To extend Prefect 2 to persist CSV or Parquet files for pandas DataFrames instead of saving to a PersistedResultBlob JSON, there are some approaches and utilities to consider:
1.
GCP Cloud Storage Integration: Prefect has built-in tasks for interacting with GCP Cloud Storage that support uploading DataFrames in multiple formats like CSV, CSV_GZIP, PARQUET, etc.
-
Documentation: You can refer to Prefect’s integration guides for GCP Cloud Storage here:
Prefect GCP Cloud Storage
2.
Custom Result Handlers: There are PRs and discussions about integrating custom result handlers that support pandas DataFrame serialization directly within Prefect.
-
GitHub PR: A relevant PR for a Pandas DataFrame Result Handler can be found here:
Pandas DataFrame Result
3.
Serialization Options: Discussions and enhancements proposed to allow users to control serialization/deserialization methods, which would let you specify how data should be serialized before being persisted.
-
GitHub Issue: There’s an issue discussing exposing serialization/deserialization options in Prefect:
Expose serialization/deserialization
-
Consolidate data persistence documentation:
Document
Leveraging these resources, you can create custom storage solutions for pandas DataFrames to CSV or Parquet formats.
If you have more specific scenarios or if you need code examples, feel free to ask!