hi there prefect2 question are there any examples for proces Prefect Community #ask-community

hi there, prefect2 question: are there any example...

Jan Domanski

05/09/2022, 6:37 PM

hi there, prefect2 question: are there any examples for processing a database with 1bln rows? I’m suspicious of tasks doing row-by-row operations and performance. Likely a better choice is pagination, with a generator task yielding successive pages, that are then processed in parallel by another task? How can that be expressed in prefect2?

Kevin Kho

05/09/2022, 6:41 PM

I think this sounds like a painful operation to do with native Python right? Do you have an idea of how you’d do it without Prefect? Is 1 bn rows a warehouse like Snowflake?

Jan Domanski

05/09/2022, 6:50 PM

Something like that is what I have in mind for the data reading side

Copy code

import pandas as pd

@task
def get_db_generator():
    with get_connection() as connection:
        return pd.read_sql_query(query, connection, chunksize=10)

I want this to return the generator but instead it returns the future. So then

Kevin Kho

05/09/2022, 6:52 PM

You could call

get_db_generator().result()

to wait for the result, but you can just pass the future to a downstream task and the future will be used. Not sure about generators though, will have to check

Jan Domanski

05/09/2022, 6:57 PM

Cheers, yeah – that’s what I figured. But it’s not even clear if it makes sense for get_db_generator to be a task? Presumably I’ll be in some context manager hell. if get_db_generator is not a task, then it’s run at the flow-level and presumably the pages are

Kevin Kho

05/09/2022, 7:01 PM

Yeah generators are materialized from what I see so that they can be passed to downstream task. I understand what you are trying to do, I think it won’t work in Prefect because stuff needs to be materialized to pass downstream. Let me ask other team members to be sure about that.

Jan Domanski

05/09/2022, 7:02 PM

Very much appreciated thank you

Kevin Kho

05/10/2022, 3:07 PM

Confirmed not supported, and even potentially tricky to support because Dask and Ray might not support generators

5 Views

Open in Slack

Previous Next