@Anna Geller@Kevin Kho (or anyone else) - Just curious, has there been anyone making a Pandas task library? I seem to be making my own as I go along, but just making sure I'm not doing it the hard way
a
Anna Geller
02/28/2022, 4:09 PM
I haven't heard of anyone, especially because it's so easy to call any Pandas operation in a function decorated with "task". Also, we have PandasSerializer for storing results of Pandas operation.
What are you looking for in such task? We would be definitely open to that, especially for Orion.
✅ 1
k
Kevin Kho
02/28/2022, 4:15 PM
There is the serializer for results but I think Pandas has too much functionality to encompass in a few tasks?
✅ 1
upvote 1
g
Greg Adams
02/28/2022, 4:15 PM
A lot of what we're doing is basic ETL-to-data warehouse stuff, and I'm wrapping a lot of the Pandas tasks with the decorator just to keep the tasks in the flow "small" and "discreet," (and hopefully, to make debugging simpler). I like the idea of standardizing the use of Pandas for ETL since it translates easily to our longer ML pipelines.
Hmmm, looking at hamilton now. Trying to understand it in this context. Definitely looks useful for some of the pipelines we have.
k
Kevin Kho
02/28/2022, 4:22 PM
I think it builds the DAG for you with your Pandas functions allowing it to be more reusable (by enforcing naming conventions)
I dunno though I juist saw it and haven’t tried it myself, but it might be worth checking out
a
Anna Geller
02/28/2022, 4:26 PM
Also, to throw some recommendations from my side: if you are doing a lot of loading data to and from AWS data lake/Redshift, check out awswrangler which is basically Pandas on AWS, extremely useful for writing Pandas-based ETL workflows and makes it even kind of fun (rather than daunting) to work with AWS Glue data catalogue
g
Greg Adams
02/28/2022, 4:49 PM
Dang, that does look nice but we're using Google BigQuery.
m
Matthias
02/28/2022, 4:53 PM
I don't know the exact use-case, but why not using dbt to transform data in BigQuery?