hi there, prefect2 question: are there any examples for processing a database with 1bln rows? I’m suspicious of tasks doing row-by-row operations and performance. Likely a better choice is pagination, with a generator task yielding successive pages, that are then processed in parallel by another task? How can that be expressed in prefect2?
k
Kevin Kho
05/09/2022, 6:41 PM
I think this sounds like a painful operation to do with native Python right? Do you have an idea of how you’d do it without Prefect? Is 1 bn rows a warehouse like Snowflake?
j
Jan Domanski
05/09/2022, 6:50 PM
Something like that is what I have in mind for the data reading side
Copy code
import pandas as pd
@task
def get_db_generator():
with get_connection() as connection:
return pd.read_sql_query(query, connection, chunksize=10)
I want this to return the generator but instead it returns the future. So then
k
Kevin Kho
05/09/2022, 6:52 PM
You could call
get_db_generator().result()
to wait for the result, but you can just pass the future to a downstream task and the future will be used. Not sure about generators though, will have to check
j
Jan Domanski
05/09/2022, 6:57 PM
Cheers, yeah – that’s what I figured. But it’s not even clear if it makes sense for get_db_generator to be a task? Presumably I’ll be in some context manager hell. if get_db_generator is not a task, then it’s run at the flow-level and presumably the pages are
k
Kevin Kho
05/09/2022, 7:01 PM
Yeah generators are materialized from what I see so that they can be passed to downstream task. I understand what you are trying to do, I think it won’t work in Prefect because stuff needs to be materialized to pass downstream. Let me ask other team members to be sure about that.
j
Jan Domanski
05/09/2022, 7:02 PM
Very much appreciated thank you
k
Kevin Kho
05/10/2022, 3:07 PM
Confirmed not supported, and even potentially tricky to support because Dask and Ray might not support generators
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.