https://prefect.io logo
Title
j

Jan Domanski

05/09/2022, 6:37 PM
hi there, prefect2 question: are there any examples for processing a database with 1bln rows? I’m suspicious of tasks doing row-by-row operations and performance. Likely a better choice is pagination, with a generator task yielding successive pages, that are then processed in parallel by another task? How can that be expressed in prefect2?
k

Kevin Kho

05/09/2022, 6:41 PM
I think this sounds like a painful operation to do with native Python right? Do you have an idea of how you’d do it without Prefect? Is 1 bn rows a warehouse like Snowflake?
j

Jan Domanski

05/09/2022, 6:50 PM
Something like that is what I have in mind for the data reading side
import pandas as pd

@task
def get_db_generator():
    with get_connection() as connection:
        return pd.read_sql_query(query, connection, chunksize=10)
I want this to return the generator but instead it returns the future. So then
k

Kevin Kho

05/09/2022, 6:52 PM
You could call
get_db_generator().result()
to wait for the result, but you can just pass the future to a downstream task and the future will be used. Not sure about generators though, will have to check
j

Jan Domanski

05/09/2022, 6:57 PM
Cheers, yeah – that’s what I figured. But it’s not even clear if it makes sense for get_db_generator to be a task? Presumably I’ll be in some context manager hell. if get_db_generator is not a task, then it’s run at the flow-level and presumably the pages are
k

Kevin Kho

05/09/2022, 7:01 PM
Yeah generators are materialized from what I see so that they can be passed to downstream task. I understand what you are trying to do, I think it won’t work in Prefect because stuff needs to be materialized to pass downstream. Let me ask other team members to be sure about that.
j

Jan Domanski

05/09/2022, 7:02 PM
Very much appreciated thank you
k

Kevin Kho

05/10/2022, 3:07 PM
Confirmed not supported, and even potentially tricky to support because Dask and Ray might not support generators