Concurrent task iteration support #1239

joshbeckman · 2021-05-26T21:10:59Z

joshbeckman
May 26, 2021
Collaborator

Over at https://github.com/shopify/flow we have been trying to adopt the maintenance task framework and have enjoyed the benefits for our small data migrations, but our main hangup is the loong runtimes of tasks that need to operate on large datasets (e.g. all records in one table table - tens of thousands now, will be much more the future). When we tried running a recent data migration via maintenance task recently, the total time to execute would have been months.

As such, our main desire with this library would be declarative concurrency support. Is #325 (comment) still the recommendation for concurrency in the future of this library?

No immediate need for action on this - we just wanted to provide feedback on our adoption!

etiennebarrie · 2021-05-27T14:32:13Z

etiennebarrie
May 27, 2021
Maintainer

Do you think batches enumerators could help? (see #409)
Depending on your tasks, being able to update 100/1000 records at a time could substantially speed them up.

Regarding actual parallelism when running tasks, it's something that we're thinking about but we haven't made any formal plans so we can't make any promise. We can keep this issue open to continue thinking about it, start fleshing out an API, behaviour, figure out the edge cases (e.g. it will require special handling for custom enumerators which may not have a way to start a cursor randomly, but only give out one item at a time), etc.

0 replies

joshbeckman · 2021-05-27T15:23:34Z

joshbeckman
May 27, 2021
Collaborator Author

Batches could help with some of our task types, yes!

But we have other types of tasks that require, for example, calling an external API with an individual record and then saving that value to our database, so the batching would remove some of the overhead of the job queue itself but wouldn't give us the speed up that we would get from concurrency.

0 replies

sle-c · 2023-03-29T18:01:01Z

sle-c
Mar 29, 2023
Collaborator

I recently ran a migration on flow which mainly involves making graphQL requests to core for certain things. Processing 874k rows would take about 7 days to complete. I think allowing parallelism really helps in these cases.

0 replies

2024-01-27T01:43:32Z

github-actions[bot]
bot Jan 27, 2024

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

joshbeckman · 2024-01-29T16:25:57Z

joshbeckman
Jan 29, 2024
Collaborator Author

We would still like this!

0 replies

2024-03-31T01:45:30Z

github-actions[bot]
bot Mar 31, 2024

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

joshbeckman · 2024-04-01T13:45:14Z

joshbeckman
Apr 1, 2024
Collaborator Author

We would still really like this

0 replies

segiddins · 2024-04-16T17:58:49Z

segiddins
Apr 16, 2024

This would be incredibly useful.

0 replies

2024-06-16T01:51:34Z

github-actions[bot]
bot Jun 16, 2024

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

segiddins · 2024-06-16T17:47:15Z

segiddins
Jun 16, 2024

Still valid

0 replies

2024-08-16T01:51:16Z

github-actions[bot]
bot Aug 16, 2024

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

segiddins · 2024-08-16T20:46:20Z

segiddins
Aug 16, 2024

still relevant

0 replies

2024-10-16T01:59:25Z

github-actions[bot]
bot Oct 16, 2024

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

joshbeckman · 2024-10-16T14:42:13Z

joshbeckman
Oct 16, 2024
Collaborator Author

Still, I would love this.

0 replies

2024-12-16T02:07:01Z

github-actions[bot]
bot Dec 16, 2024

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

joshbeckman · 2024-12-16T14:32:03Z

joshbeckman
Dec 16, 2024
Collaborator Author

Still want this.

0 replies

2025-02-16T02:02:25Z

github-actions[bot]
bot Feb 16, 2025

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

segiddins · 2025-02-17T03:38:37Z

segiddins
Feb 17, 2025

Not stale

0 replies

2025-04-19T02:01:48Z

github-actions[bot]
bot Apr 19, 2025

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

segiddins · 2025-04-19T03:32:29Z

segiddins
Apr 19, 2025

Not stale

0 replies

2025-06-19T02:11:00Z

github-actions[bot]
bot Jun 19, 2025

This issue has been marked as stale because it has not been commented on in two months.
Please reply in order to keep the issue open. Otherwise, it will close in 14 days.
Thank you for contributing!

0 replies

segiddins · 2025-06-20T18:50:03Z

segiddins
Jun 20, 2025

Still valid

0 replies

etiennebarrie · 2025-06-24T14:39:40Z

etiennebarrie
Jun 24, 2025
Maintainer

Turned this into a discussion to avoid the stale bot problem.

There are few things to think about if we want to introduce parallelism.

The first part has already been done, which was to allow multiple Runs of a Task. Now we need a good way to split and coordinate the work. A naive approach is to take the ids of the first and the last rows, and then split evenly and start n Runs, however id distribution can be sparse so that can be problematic. There's also the issue of non Active Record collections, in particular custom enumerators as mentioned already above. CSV collections might actually be the easiest ones to handle.

We don't really have any bandwidth for this however, so any contribution would be useful, even if it's not working code. Scoping out the feature, what we need (e.g. should people choose the parallelism and that's it, or should it be automatic somehow?), and how it could work (e.g. the main Run coordinates multiple "sub-runs", they take a cursor and a "end cursor", and when they're done, the main Run starts a new one to keep the parallelism, or instead should the sub-runs stay running, but get a new cursor to another part of the table that needs processing?).

1 reply

larouxn Aug 18, 2025

Been thinking about this for a little bit now. Some thoughts.

Now we need a good way to split and coordinate the work. A naive approach is to take the ids of the first and the last rows, and then split evenly and start n Runs, however id distribution can be sparse so that can be problematic.

I have two ideas, other than using min + max ID, that I hope can be used to split data and mitigate the sparse data problem. Note, these ideas are focused on the ActiveRecord relation type of dataset, not on say CSV datasets. Though CSVs, as mentioned, should be quite a bit more straightforward and something like number 1 below would work for them too.

Count the number of rows and then grab quartile IDs (assuming 4 sub-runs) i.e. the ID of the record 1/4 of the way into the total list of records and use those IDs to create even ±1 record entirely within Ruby/Rails code.
Use raw SQL in order to access NTILE to cleanly and evenly (±1 record) divide up the records via a windowing function. Something roughly like the following though I'll note I've not used NTILE before. Typically prefer to stay in ActiveRecord but sometimes raw SQL is warranted in certain circumstances and NTILE looks promising to mitigate the sparse data problem.

SELECT
  COALESCE(LAG(max_id) OVER (ORDER BY bucket),
           (SELECT MIN(id) FROM records)) as start_id,
  max_id as end_id,
  bucket
FROM (
  SELECT
    MAX(id) as max_id,
    bucket
  FROM (
    SELECT id, NTILE(#{concurrency_count}) OVER (ORDER BY id) as bucket
    FROM records
  ) AS buckets
  GROUP BY bucket
) boundaries
ORDER BY bucket

should people choose the parallelism and that's it, or should it be automatic somehow?

I suppose people could choose concurrency i.e. number of sub-runs at the task definition level or even at the run level via custom task parameters within the web UI.

As for automatic parallelism I imagine there could be some kind of sane default, say 4 sub-runs, that one could enable by simply adding a concurrent class method from Task to a task definition. Could even allow it accept a value if one wants to use a value other than the default.

The "sane default" type approach stems from Rails philosophy as well Puma (3 threads, source) and Sidekiq (5 threads, source).

the main Run coordinates multiple "sub-runs", they take a cursor and a "end cursor", and when they're done, the main Run starts a new one to keep the parallelism, or instead should the sub-runs stay running, but get a new cursor to another part of the table that needs processing?

Was indeed looking at a single parent run (coordinator) + several child runs (processors with parent_run_id) architecture using existing cursor + new end_cursor fields to partition a given data set via the primary key. As for a parent run starting more runs when child runs finish, I was thinking more along the lines of having child runs have a subset of the total records, combined accounting for all records. Say there are 100 records, 4 sub-runs concurrency, each child run has 25 records to process.

Speaking of parent + child architecture, the parent run would also handle aggregating the status of its child runs. For example, only marking as success if all succeed, cancelling and marking run as errored if one errors out, cancelling all if run cancel is requested, and so on.

Anyway, just wanted share some ideas from my time looking into this all. 😁 🙇‍♂️

Concurrent task iteration support #1239

Uh oh!

Uh oh!

joshbeckman May 26, 2021 Collaborator

Replies: 23 comments · 1 reply

Uh oh!

etiennebarrie May 27, 2021 Maintainer

Uh oh!

joshbeckman May 27, 2021 Collaborator Author

Uh oh!

sle-c Mar 29, 2023 Collaborator

Uh oh!

github-actions[bot] bot Jan 27, 2024

Uh oh!

joshbeckman Jan 29, 2024 Collaborator Author

Uh oh!

github-actions[bot] bot Mar 31, 2024

Uh oh!

joshbeckman Apr 1, 2024 Collaborator Author

Uh oh!

segiddins Apr 16, 2024

Uh oh!

github-actions[bot] bot Jun 16, 2024

Uh oh!

segiddins Jun 16, 2024

Uh oh!

github-actions[bot] bot Aug 16, 2024

Uh oh!

segiddins Aug 16, 2024

Uh oh!

github-actions[bot] bot Oct 16, 2024

Uh oh!

joshbeckman Oct 16, 2024 Collaborator Author

Uh oh!

github-actions[bot] bot Dec 16, 2024

Uh oh!

joshbeckman Dec 16, 2024 Collaborator Author

Uh oh!

github-actions[bot] bot Feb 16, 2025

Uh oh!

segiddins Feb 17, 2025

Uh oh!

github-actions[bot] bot Apr 19, 2025

Uh oh!

segiddins Apr 19, 2025

Uh oh!

github-actions[bot] bot Jun 19, 2025

Uh oh!

segiddins Jun 20, 2025

Uh oh!

etiennebarrie Jun 24, 2025 Maintainer

Uh oh!

Uh oh!

larouxn Aug 18, 2025

joshbeckman
May 26, 2021
Collaborator

Replies: 23 comments 1 reply

etiennebarrie
May 27, 2021
Maintainer

joshbeckman
May 27, 2021
Collaborator Author

sle-c
Mar 29, 2023
Collaborator

github-actions[bot]
bot Jan 27, 2024

joshbeckman
Jan 29, 2024
Collaborator Author

github-actions[bot]
bot Mar 31, 2024

joshbeckman
Apr 1, 2024
Collaborator Author

segiddins
Apr 16, 2024

github-actions[bot]
bot Jun 16, 2024

segiddins
Jun 16, 2024

github-actions[bot]
bot Aug 16, 2024

segiddins
Aug 16, 2024

github-actions[bot]
bot Oct 16, 2024

joshbeckman
Oct 16, 2024
Collaborator Author

github-actions[bot]
bot Dec 16, 2024

joshbeckman
Dec 16, 2024
Collaborator Author

github-actions[bot]
bot Feb 16, 2025

segiddins
Feb 17, 2025

github-actions[bot]
bot Apr 19, 2025

segiddins
Apr 19, 2025

github-actions[bot]
bot Jun 19, 2025

segiddins
Jun 20, 2025

etiennebarrie
Jun 24, 2025
Maintainer