and am finding that if there are issues with workers on the cluster (e.g. if something is wrong with the cluster configuration and the workers crash) then Prefect tasks just hang forever with pending status. I even see an ERROR log message in the console from Dask:
2024-04-10 13:17:00,323 - distributed.scheduler - ERROR - Task parse_raw_data-0-cb42b0a8543b470ca6699484874854ac-1 marked as failed because 4 workers died while trying to run it
What's the recommended way of dealing with this? Ideally the error would propagate to Prefect and the program would terminate.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.