https://prefect.io logo
Title
z

Zach Phillips

04/07/2023, 5:44 PM
QQ Strategy question. I have written my first prefect deployment to get a days worth of event logs from one of our production servers (Prefect-SQLAlchermy + pandas). This works well but due to capacity limitations and table design which I have no control over, At most I can pull about 2 days worth of data before the query becomes unsustainable. My current workflow works fine for single day pulls and supports data ingestion moving forward, but I am considering strategies to use prefect effectively to backfill data more efficiently. This is a new initiative and I have room to play, but I was hoping to understand if there was a prefect-minded way to execute multiple smaller queries within a flow and return the combined dataset.
k

Kevin Grismore

04/07/2023, 5:46 PM
Is the root of the issue with the query itself being slow, or the quantity of the data being too large to fit in memory in the environment the flow is running?
z

Zach Phillips

04/07/2023, 5:49 PM
Query being slow. It's a read replica of a prod db and while I am sure there is more that can be done, it's not going to be a priority for a while
I have successfully used a strategy of batch appending DF's to an array and using pd.concat to produce a larger file but I am not sure if I should try something like that in Prefect flows
k

Kevin Grismore

04/07/2023, 5:51 PM
What's the destination of the data? Depending on how it's accessed, it doesn't necessarily need to all be grabbed in one go.
z

Zach Phillips

04/07/2023, 5:51 PM
it's going to end up in Big Query, but it would be fine to drop to CSV/Parquet locally if needed
or even GCS if there was a need for it
k

Kevin Grismore

04/07/2023, 5:55 PM
A strategy I use all the time for extracting large data that needs to land in BQ is create an external table and write it out in smaller batches to the GCS location the external table is pointed to. In my use case the data is scattered across different sources so I unify the schema in python, write as one parquet->external table per source, and then union the external tables afterward.
But you could also use SQLAlchemy's
fetch_many
to get one or a few rows at a time, write those to your output file, then repeat as a way to limit memory usage.
And if you need to divide up the queries by date ranges to keep them fast, continue to drop the files into the same external table dir until they're all finished and you've got your complete table (with the caveat that the more files your external table is comprised of, the slower your queries to it will be)
z

Zach Phillips

04/07/2023, 9:12 PM
Kevin thanks so much for the continued feedback. Work got busy so just getting back here
In this case is it appropriate to do a while loop in the SQLAlchemy context manager? I have built out the logic to fetch daily batches based on activity date and can do a start here/end here logic, just trying to get my head around it. Doing manual runs is going to take forever
k

Kevin Grismore

04/07/2023, 9:33 PM
perfectly fine to do a while loop that breaks when there are no more records to fetch