Justin Trautmann

02/16/2023, 9:46 AM
hello community, hello Prefect team, i am using the Ray task runner with a cluster that consists mainly of AWS spot instances. i would love to just restart tasks when the underlying spot instance is unexpectedly terminated. this functionality is already supported by Ray but Prefect's
orchestration policy prevents a repeated task execution as the task remains in status
when the instance is terminated and a new replacement instance will try to start the task from scratch, including proposing the state
again which leads to:
Copy code
Task run <task_run_id> received abort during orchestration: This run cannot transition to the RUNNING state from the RUNNING state. Task run is in RUNNING state
I feel like this policy has a legitimate reason of preventing multiple agents from attempting to orchestrate the same run but also catches the case where the same agent tries to execute the same task multiple times which should be allowed imo. Anyone having experience with Prefect + Ray on AWS spot and ideas on how to handle this case? PS: @Tim Galvin not sure about Dask and SLURM but maybe your issue is somewhat related

Yaron Levi

02/16/2023, 11:55 AM
@Justin Trautmann Hi, we are also seeing this issue but in a different configuration (running on Render, with continuous loop). You can see all the details here, in the comment I’ve made:
Also, if you search for “This run cannot transition to the RUNNING state from the RUNNING state” in prefect’s GitHub repo, you’ll find around 4-5 issues with that problem.