Guys, there's something I want to discuss. When I ...
# prefect-contributors-archived
i
Guys, there's something I want to discuss. When I originally worked on the
upload_from_dataframe
, we went with
.parquet.snappy
and
.parquet.gz
for the compressed parquet files. However, comma, ergo, vis-a-vis, it came to my attention that when you're using tools like Tad and others to visualize tabular data, they EXPECT the file extension to end with
.parquet
instead (Like it works if I rename
file.parquet.snappy
to
file.snappy.parquet
, or
file.parquet.gz
to
file.gz.parquet
. I also noticed that Spark and Flink are actually saving compressed parquets as
.snappy.parquet
or
.gz.parquet
instead.
1
Long story short, we'd be very much more compliant with the industry standard distributed frameworks (Spark and Flink), and also have a better developer experience for ppl browsing data with Tad or others. So, I wanted to fix that with a PR, it's basically 2-3 lines only on the
DataFrameSerializationFormat
and fixing the pytests to expect
.gz.parquet
or
.snappy.parquet
instead.
My question is: are you guys cool with that? cc: @alex @Nate @Zanie
a
That sounds like a good change to me!
upvote 2
i
Awesome, let me work on it asap. Thank you, Alex
Is there any way I can personally retrigger the CI pipeline. It failed, but apparently it's for something not related at all to the change I did. (Also, I ensured that I had run pytest locally first before doing the push, it had worked) https://github.com/PrefectHQ/prefect-gcp/pull/166
you guys are just too fast on Code Review. I love it! I'm gonna start picking some stuff from the Open Issues for fun, if you don't mind, then! Hahahah
1
z
😄 let me know if you want help finding one
👌🏻 1