hi there, prefect2 question: are there any example...
# ask-community
j
hi there, prefect2 question: are there any examples for processing a database with 1bln rows? I’m suspicious of tasks doing row-by-row operations and performance. Likely a better choice is pagination, with a generator task yielding successive pages, that are then processed in parallel by another task? How can that be expressed in prefect2?
k
I think this sounds like a painful operation to do with native Python right? Do you have an idea of how you’d do it without Prefect? Is 1 bn rows a warehouse like Snowflake?
j
Something like that is what I have in mind for the data reading side
Copy code
import pandas as pd

@task
def get_db_generator():
    with get_connection() as connection:
        return pd.read_sql_query(query, connection, chunksize=10)
I want this to return the generator but instead it returns the future. So then
k
You could call
get_db_generator().result()
to wait for the result, but you can just pass the future to a downstream task and the future will be used. Not sure about generators though, will have to check
j
Cheers, yeah – that’s what I figured. But it’s not even clear if it makes sense for get_db_generator to be a task? Presumably I’ll be in some context manager hell. if get_db_generator is not a task, then it’s run at the flow-level and presumably the pages are
k
Yeah generators are materialized from what I see so that they can be passed to downstream task. I understand what you are trying to do, I think it won’t work in Prefect because stuff needs to be materialized to pass downstream. Let me ask other team members to be sure about that.
j
Very much appreciated thank you
k
Confirmed not supported, and even potentially tricky to support because Dask and Ray might not support generators