https://prefect.io logo
m

Miremad Aghili

11/14/2022, 4:15 PM
Hey guys, Here is a question: Currently we have setup agents on multiple computers. some of the tasks that are running are big and sometimes the computer crashes. Mostly due to RAM issues. When this happens we expect the status of the flow to change to failed or something like that but in reality it stays in the running status for days and does not report it has stopped working. Is there a way to solve this issue? We are running the agents on Windows computers and we use DockerRun for our prefect agents. (this is on prefect 1)
m

Mason Menges

11/14/2022, 7:09 PM
Hey @Miremad Aghili are you just using the Local Executor for your flow runs or a different one? for context here this come sometimes happen if the underlying infrastructure running the flow disappears prior to the state of the flow being updated, ideally the Zombie Killer process should pick up these flow runs and cancel/fail them though there are some circumstances where that doesn't happen. One way to address is to utilize Automations https://docs-v1.prefect.io/orchestration/concepts/automations.html#overview forgoing the automation route if you wanted to do this yourself you could also have a separate flow that picks up running flows and cancels/fails them if their start time is older than some predetermined set of time, (same thing really as the automation but you'll have more fine grained control) using these queries
m

Miremad Aghili

11/14/2022, 7:29 PM
Hi @Mason Menges, yes the agents I have are local agents connected to docker desktop. What I think i am missing here is that prefect does not pick up the container crash or agent going down for some reason
These automations are nice but for CI/CD this can't be the best solution
3 Views