https://prefect.io logo
Title
g

Greg Adams

02/28/2022, 4:07 PM
@Anna Geller @Kevin Kho (or anyone else) - Just curious, has there been anyone making a Pandas task library? I seem to be making my own as I go along, but just making sure I'm not doing it the hard way
a

Anna Geller

02/28/2022, 4:09 PM
I haven't heard of anyone, especially because it's so easy to call any Pandas operation in a function decorated with "task". Also, we have PandasSerializer for storing results of Pandas operation. What are you looking for in such task? We would be definitely open to that, especially for Orion.
1
k

Kevin Kho

02/28/2022, 4:15 PM
There is the serializer for results but I think Pandas has too much functionality to encompass in a few tasks?
1
:upvote: 1
g

Greg Adams

02/28/2022, 4:15 PM
A lot of what we're doing is basic ETL-to-data warehouse stuff, and I'm wrapping a lot of the Pandas tasks with the decorator just to keep the tasks in the flow "small" and "discreet," (and hopefully, to make debugging simpler). I like the idea of standardizing the use of Pandas for ETL since it translates easily to our longer ML pipelines.
k

Kevin Kho

02/28/2022, 4:16 PM
Have you seen hamilton ?
g

Greg Adams

02/28/2022, 4:21 PM
Hmmm, looking at hamilton now. Trying to understand it in this context. Definitely looks useful for some of the pipelines we have.
k

Kevin Kho

02/28/2022, 4:22 PM
I think it builds the DAG for you with your Pandas functions allowing it to be more reusable (by enforcing naming conventions)
I dunno though I juist saw it and haven’t tried it myself, but it might be worth checking out
a

Anna Geller

02/28/2022, 4:26 PM
Also, to throw some recommendations from my side: if you are doing a lot of loading data to and from AWS data lake/Redshift, check out awswrangler which is basically Pandas on AWS, extremely useful for writing Pandas-based ETL workflows and makes it even kind of fun (rather than daunting) to work with AWS Glue data catalogue
g

Greg Adams

02/28/2022, 4:49 PM
Dang, that does look nice but we're using Google BigQuery.
m

Matthias

02/28/2022, 4:53 PM
I don't know the exact use-case, but why not using dbt to transform data in BigQuery?
:upvote: 1
a

Anna Geller

02/28/2022, 4:58 PM
agree with Matthias, for BigQuery I was also using just their native gcs and bigquery clients and then transforming the data with dbt - sample repo https://github.com/anna-geller/prefect-monte-carlo/tree/workshop
👍 1