Concurrent task iteration support #1239
Replies: 23 comments 1 reply
-
Do you think batches enumerators could help? (see #409) Regarding actual parallelism when running tasks, it's something that we're thinking about but we haven't made any formal plans so we can't make any promise. We can keep this issue open to continue thinking about it, start fleshing out an API, behaviour, figure out the edge cases (e.g. it will require special handling for custom enumerators which may not have a way to start a cursor randomly, but only give out one item at a time), etc. |
Beta Was this translation helpful? Give feedback.
-
Batches could help with some of our task types, yes! But we have other types of tasks that require, for example, calling an external API with an individual record and then saving that value to our database, so the batching would remove some of the overhead of the job queue itself but wouldn't give us the speed up that we would get from concurrency. |
Beta Was this translation helpful? Give feedback.
-
I recently ran a migration on flow which mainly involves making graphQL requests to core for certain things. Processing 874k rows would take about 7 days to complete. I think allowing parallelism really helps in these cases. |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
We would still like this! |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
We would still really like this |
Beta Was this translation helpful? Give feedback.
-
This would be incredibly useful. |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
Still valid |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
still relevant |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
Still, I would love this. |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
Still want this. |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
Not stale |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
Not stale |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as stale because it has not been commented on in two months. |
Beta Was this translation helpful? Give feedback.
-
Still valid |
Beta Was this translation helpful? Give feedback.
-
Turned this into a discussion to avoid the stale bot problem. There are few things to think about if we want to introduce parallelism. The first part has already been done, which was to allow multiple Runs of a Task. Now we need a good way to split and coordinate the work. A naive approach is to take the ids of the first and the last rows, and then split evenly and start n Runs, however id distribution can be sparse so that can be problematic. There's also the issue of non Active Record collections, in particular custom enumerators as mentioned already above. CSV collections might actually be the easiest ones to handle. We don't really have any bandwidth for this however, so any contribution would be useful, even if it's not working code. Scoping out the feature, what we need (e.g. should people choose the parallelism and that's it, or should it be automatic somehow?), and how it could work (e.g. the main Run coordinates multiple "sub-runs", they take a cursor and a "end cursor", and when they're done, the main Run starts a new one to keep the parallelism, or instead should the sub-runs stay running, but get a new cursor to another part of the table that needs processing?). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Over at https://github.com/shopify/flow we have been trying to adopt the maintenance task framework and have enjoyed the benefits for our small data migrations, but our main hangup is the loong runtimes of tasks that need to operate on large datasets (e.g. all records in one table table - tens of thousands now, will be much more the future). When we tried running a recent data migration via maintenance task recently, the total time to execute would have been months.
As such, our main desire with this library would be declarative concurrency support. Is #325 (comment) still the recommendation for concurrency in the future of this library?
No immediate need for action on this - we just wanted to provide feedback on our adoption!
Beta Was this translation helpful? Give feedback.
All reactions