Kamilly Victória Ruseler
08/22/2024, 12:00 PMMarvin
08/22/2024, 12:00 PMValueError
related to handling indices within Dask. This seems to be a known issue, and there are a few related discussions:
1. Error in addition of dask dataframe and array when reading from parquet
2. Forbid columns with duplicate names
3. Cannot process datasets created by older versions of Dask
These references suggest various underlying issues, and potential fixes are discussed.
For Prefect-specific integrations, you might find this Dask usage guide helpful.
If the issue persists, consider checking for open or similar issues on Dask's GitHub page.
If you need further assistance or a specific workaround, let me know!Kamilly Victória Ruseler
08/22/2024, 12:04 PMMarvin
08/22/2024, 12:05 PMi'm not sure why this is happening. the only changes is that i upgraded dask and im using dask-expr nowSaving dataframe @ s3://kdt-staging-engorders-dump/dataset/prefect_2/2024-08-22-02:03:55 110355 PM SaveDataSet-0 prefect.task_runs Encountered exception during execution: Traceback (most recent call last): File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 29, in save_dataset save_dataframe( File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 58, in save_dataframe df.to_parquet( File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 3296, in to_parquet return to_parquet(self, path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 594, in to_parquet df.to_legacy_dataframe(), ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/d
Kamilly Victória Ruseler
08/22/2024, 12:05 PMi'm not sure why this is happening. the only changes is that i upgraded dask and im using dask-expr now@Marvin 110355 PM SaveDataSet-0 prefect.task_runs Encountered exception during execution: Traceback (most recent call last): File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 29, in save_dataset save_dataframe( File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 58, in save_dataframe df.to_parquet( File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 3296, in to_parquet return to_parquet(self, path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 594, in to_parquet df.to_legacy_dataframe(), ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/d
Marvin
08/22/2024, 12:06 PMdask-expr
. Several related issues have been reported:
1. Backward Compatibility: There are issues processing datasets created by older versions of Dask when upgrading to the latest versions. This is detailed in the issue Can not process datasets created by the older version of Dask. This might be related to your Parquet read/write problems.
2. ProcessPoolExecutor: Using ProcessPoolExecutor
with Dask can cause problems with to_parquet
operations due to pickling issues. Refer to the issue Dask DataFrame fails to export to parquet when using ProcessPoolExecutor.
3. Repartitioning Time Series: There's an issue regarding repartitioning a time series by frequency when loaded from a Parquet file. This might indicate further complications with Dask's handling of datetime indices. Details are available in Issue repartitioning a time series by frequency when loaded from parquet file.
### Recommendations:
1. Check Dataset Compatibility: Ensure your datasets are compatible with the new version of Dask you upgraded to.
2. Configuration: Adjust configurations around Dask's usage of dask-expr
and ProcessPoolExecutor
.
3. Debug with Smaller Chunks: Try processing smaller portions of your dataset, as observed issues might not manifest with smaller datasets.
### Links for Further Reading:
- Can not process datasets created by the older version of Dask
- Dask DataFrame fails to export to parquet when using ProcessPoolExecutor
- Issue repartitioning a time series by frequency when loaded from parquet file
If the problem persists, consider creating a minimal reproducible example and reporting it to the Dask GitHub issues page.