Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Untitled.py

:wave: How can I "batch" my pipeline?

I would like to run the full pipeline in batches of ~10k rather than having massive updates where one failure would stop everything.
And so I would like to have batches of get_user_ids() run everything in parallel (LocalDaskExecutor), save the results in Redshift, and start again.

PS: Another approach would be to build and "incremental" logic from the source data
PS2: I am using Prefect 1.0

Unsure if <https://docs-v1.prefect.io/core/advanced_tutorials/task-looping.html|Dynamic DAGs: Task Looping> won't over complicate things here :thinking_face:

Off the top of my head, your `get_user_ids` task executes some SQL query, maybe you could edit it to retrieve 10k user IDs where some flag is set to false in the table, then you could have a final task in your flow that only runs if save_to_redshift task competes successfully, perhaps updating the original table to set the flag for those user IDs to true.

And provided you have some schedule as well, maybe a statement in the flow like `while get_user_ids()` returns some user IDs, execute the following tasks.

Hey Matt, are you looking to do something where you chunk the `get_user_ids()` in 10k, then run the transformation `get_third_party_api_info()`  over those chunks and finally once all those the chunks have run, upload the entire result to Redshift or save each chunk result to redshift?

Yes you got it <@U02HA1S5KCG> and then the flow starts again with the next 10k