Milos Tomic

04/02/2020, 8:24 AM
Hey everyone, I was researching about ETL workflows and this came as the best solution, as I'm newbie here, I have few simple questions.I want to make workflow for extracting a really big mongodb collection(s) transforming them, and then load them into something like redshift or s3 or kinesis. 1. How do you suggest loading really big collection into workflow / Chunking or something? 2. Does Prefect have some support for multithreading or multiprocessing? 3. Is there any resources on working with big data rather than working with 1+2 as flows? That's all for now! Thanks


04/02/2020, 12:36 PM
Hello @Milos Tomic! One of the ways we designed Prefect is to work with your existing tools as much as possible, while giving you the option to bring data handling/processing into Prefect as much as you like. The nice thing is this means “best practice” for what you’re describing can range from nothing more than wrapping an existing script that you know works, or using Prefect idioms directly. 1. Prefect will work with whatever resources you give it; if you run it on a thin client with access to an external cluster or warehouse, it can automate operations there, otherwise make sure it has RAM / disk sufficient for the data you are passing. For working with data that exceeds available resources, we recommend a few things: first, we love using Dask to scale out-of-core computation. Depending on the data structure, Dask has a variety of tools for distributing work automatically, and Prefect can either submit a single job to Dask or run natively on top of our DaskExecutor. Second, you can use our
operator to load chunks of your data in parallel. Third, you can use our
signal to work with paginated data. For reference, we use a combination of all three for our internal data warehouse (which we hope to open-source soon) - we map over tables and loop over a paginated query inside each mapped task. 2. Yes, Prefect automatically gains support for parallelism through threads or processes via Dask (which can be run locally, in addition to as a remote cluster). Check out the LocalDaskExecutor. Given the choice, we recommend a local Dask cluster over a LocalDaskExecutor because it has better guarantees about shared memory during parallel execution 3. Any infrastructure or process you’re looking for in particular? Hope you have an easy time getting going!