https://prefect.io logo
Title
k

kevin

03/11/2022, 4:08 AM
I have a flow that runs a
select * from table
to snowflake using a SnowflakeQuery task. It appears that this task uses
fetchall()
to return this data into memory: https://github.com/PrefectHQ/prefect/blob/5d2732a30563591410cb11fe0f7e7dfe65cc5669/src/prefect/tasks/snowflake/snowflake.py#L186 I expect that this causes performance issues with extremely large queries so I am wondering what Prefect Cloud’s tolerance for this. Ideally I think it would be preferable to lazy load query results and/or allow for query pagination? Perhaps there’s an architecture limitation I’m overlooking? I’d appreciate any insight 🙂
k

Kevin Kho

03/11/2022, 4:12 AM
What do you think you’d do without Prefect? I would break up my query if I was trying to write out multiple files. Like for example, I would just run Jan 2022 and then Feb 2022 and so on. You can even use mapping like this if Snowflake accepts multiple connections. I suspect this is easier to tackle from the query side rather than the collection side. I dunno though. Wondering what you think?
Anyway you are limited by what you can hold in memory. Prefect has no inherent limitation here. It’s more of the limitation of your system
:upvote: 1
k

kevin

03/11/2022, 4:15 AM
My line of thinking was that a lot of python cursor libraries have built in handling for pagination so it would be cool to provide some sort of page size parameter at the task level. I suppose the downside to this path is that it lays the load on the database to keep iterating. Implementing pagination at the query side does seem to be the most lightweight solution
I guess finding a method to paginate a table would be a pain point if there’s no time series stored on it
But I’m sure theres always some sort of value you can key off of for pagination
k

Kevin Kho

03/11/2022, 4:17 AM
I am series LIMIT and OFFSET on the query level when I search, not on the cursor level. Am not super used to Snowflake of course though
Oh this is nice:
# Execute a statement that will generate a result set.
sql = "select * from t"
cur.execute(sql)
 
# Fetch the result set from the cursor and deliver it as the Pandas DataFrame.
for df in cur.fetch_pandas_batches():
    my_dataframe_processing_function(df)
k

kevin

03/11/2022, 4:21 AM
woah the cursor converts it to a dataframe right away
is that right?
k

Kevin Kho

03/11/2022, 4:21 AM
Yes and there is also
fetch_pandas_all()
k

kevin

03/11/2022, 4:23 AM
it looks like snowflake supports querying using offset limit https://dwgeek.com/snowflake-limit-and-offset-uses-examples.html/ which I think I’ll go and implement
I gotta be honest I was thinking too much with my python brain and not enough with my sql brain
thanks for the feedback 😄
k

Kevin Kho

03/11/2022, 4:24 AM
Of course!