Hello
@Milos Tomic! One of the ways we designed Prefect is to work with your existing tools as much as possible, while giving you the option to bring data handling/processing into Prefect as much as you like. The nice thing is this means “best practice” for what you’re describing can range from nothing more than wrapping an existing script that you know works, or using Prefect idioms directly.
1. Prefect will work with whatever resources you give it; if you run it on a thin client with access to an external cluster or warehouse, it can automate operations there, otherwise make sure it has RAM / disk sufficient for the data you are passing. For working with data that exceeds available resources, we recommend a few things: first, we love using
Dask to scale out-of-core computation. Depending on the data structure, Dask has a variety of tools for distributing work automatically, and Prefect can either submit a single job to Dask or run natively on top of our
DaskExecutor. Second, you can use our
map
operator to load chunks of your data in parallel. Third, you can use our
LOOP
signal to work with paginated data. For reference, we use a combination of all three for our internal data warehouse (which we hope to open-source soon) - we map over tables and loop over a paginated query inside each mapped task.
2. Yes, Prefect automatically gains support for parallelism through threads or processes via Dask (which can be run locally, in addition to as a remote cluster). Check out the
LocalDaskExecutor. Given the choice, we recommend a local Dask cluster over a LocalDaskExecutor because it has better guarantees about shared memory during parallel execution
3. Any infrastructure or process you’re looking for in particular?
Hope you have an easy time getting going!