File "/opt/conda/envs/forge/lib/python3.11/site-pa...
# marvin-in-the-wild
k
File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_categorical.py", line 184, in _simplify_up return Categorize(result.frame, cats, result.operand("index")) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 176, in operand return self.operands[type(self)._parameters.index(key)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: 'index' is not in list @Marvin
m
It appears you're encountering a
ValueError
related to handling indices within Dask. This seems to be a known issue, and there are a few related discussions: 1. Error in addition of dask dataframe and array when reading from parquet 2. Forbid columns with duplicate names 3. Cannot process datasets created by older versions of Dask These references suggest various underlying issues, and potential fixes are discussed. For Prefect-specific integrations, you might find this Dask usage guide helpful. If the issue persists, consider checking for open or similar issues on Dask's GitHub page. If you need further assistance or a specific workaround, let me know!
k
i'm not sure why this is happening. the only changes is that i upgraded dask and im using dask-expr now Saving dataframe @ s3://kdt-staging-engorders-dump/dataset/prefect_2/2024-08-22-02:03:55 110355 PM SaveDataSet-0 prefect.task_runs Encountered exception during execution: Traceback (most recent call last): File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 29, in save_dataset save_dataframe( File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 58, in save_dataframe df.to_parquet( File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 3296, in to_parquet return to_parquet(self, path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 594, in to_parquet df.to_legacy_dataframe(), ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 1396, in to_legacy_dataframe df = self.optimize(**optimize_kwargs) if optimize else self ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 591, in optimize return new_collection(self.expr.optimize(fuse=fuse)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_expr.py", line 94, in optimize return optimize(self, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_expr.py", line 3063, in optimize return optimize_until(expr, stage) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_expr.py", line 3014, in optimize_until expr = result.simplify() ^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 374, in simplify new = expr.simplify_once(dependents=dependents, simplified={}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 352, in simplify_once new = operand.simplify_once( ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 352, in simplify_once new = operand.simplify_once( ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 352, in simplify_once new = operand.simplify_once( ^^^^^^^^^^^^^^^^^^^^^^ [Previous line repeated 5 more times] File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 335, in simplify_once out = child._simplify_up(expr, dependents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_categorical.py", line 184, in _simplify_up return Categorize(result.frame, cats, result.operand("index")) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_core.py", line 176, in operand return self.operands[type(self)._parameters.index(key)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: 'index' is not in list During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/envs/forge/lib/python3.11/site-packages/prefect/engine.py", line 2146, in orchestrate_task_run result = await call.aresult() ^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 327, in aresult return await asyncio.wrap_future(self.future) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 352, in _run_sync result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 38, in save_dataset raise ValueError(f"Task failed: {str(e)}") ValueError: Task failed: 'index' is not in list 110355 PM SaveDataSet-0 prefect.task_runs Running hook 'task_hook_failure' in response to entering state 'Failed' @Marvin
m
Your message was too long by 915 tokens - please shorten it and try again. For reference, here's your message at the allowed limit:
i'm not sure why this is happening. the only changes is that i upgraded dask and im using dask-expr now
Saving dataframe @ s3://kdt-staging-engorders-dump/dataset/prefect_2/2024-08-22-02:03:55 110355 PM SaveDataSet-0 prefect.task_runs Encountered exception during execution: Traceback (most recent call last): File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 29, in save_dataset save_dataframe( File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 58, in save_dataframe df.to_parquet( File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 3296, in to_parquet return to_parquet(self, path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 594, in to_parquet df.to_legacy_dataframe(), ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/d
k
i'm not sure why this is happening. the only changes is that i upgraded dask and im using dask-expr now
@Marvin 110355 PM SaveDataSet-0 prefect.task_runs Encountered exception during execution: Traceback (most recent call last): File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 29, in save_dataset save_dataframe( File "/opt/prefect/flows/forge/domain/features/tasks/save_dataset.py", line 58, in save_dataframe df.to_parquet( File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/_collection.py", line 3296, in to_parquet return to_parquet(self, path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 594, in to_parquet df.to_legacy_dataframe(), ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/forge/lib/python3.11/site-packages/d
m
Your issue seems to stem from changes introduced with the usage of
dask-expr
. Several related issues have been reported: 1. Backward Compatibility: There are issues processing datasets created by older versions of Dask when upgrading to the latest versions. This is detailed in the issue Can not process datasets created by the older version of Dask. This might be related to your Parquet read/write problems. 2. ProcessPoolExecutor: Using
ProcessPoolExecutor
with Dask can cause problems with
to_parquet
operations due to pickling issues. Refer to the issue Dask DataFrame fails to export to parquet when using ProcessPoolExecutor. 3. Repartitioning Time Series: There's an issue regarding repartitioning a time series by frequency when loaded from a Parquet file. This might indicate further complications with Dask's handling of datetime indices. Details are available in Issue repartitioning a time series by frequency when loaded from parquet file. ### Recommendations: 1. Check Dataset Compatibility: Ensure your datasets are compatible with the new version of Dask you upgraded to. 2. Configuration: Adjust configurations around Dask's usage of
dask-expr
and
ProcessPoolExecutor
. 3. Debug with Smaller Chunks: Try processing smaller portions of your dataset, as observed issues might not manifest with smaller datasets. ### Links for Further Reading: - Can not process datasets created by the older version of Dask - Dask DataFrame fails to export to parquet when using ProcessPoolExecutor - Issue repartitioning a time series by frequency when loaded from parquet file If the problem persists, consider creating a minimal reproducible example and reporting it to the Dask GitHub issues page.