Hello Team I ve got a really weird issue that I need help re Prefect Community #ask-community

Hello Team, I've got a really weird issue that I n...

Alan

07/28/2024, 7:32 AM

Hello Team, I've got a really weird issue that I need help resolving: When I am running a master flow that spawns many subflows, i get a dreaded SIGTERM signal that causes some of the jobs to fail. Here are some of the logs for example:

Copy code

Crash detected! Execution was cancelled by the runtime environment.
prefect.task_runs
Job 'uptight-swan-txg8h': Job reached backoff limit.
prefect.flow_runs.worker
Job 'uptight-swan-txg8h': No pods found for job.
prefect.flow_runs.worker
Crash detected! Execution was aborted by a termination signal.
prefect.flow_runs
Reported flow run 'b26c5eee-718f-4833-8816-5b29d5c7a2c8' as crashed: Flow run infrastructure exited with non-zero status code -1.

Copy code

Progress: 20.0%
prefect.task_runs
Job 'swinging-pug-6kxhq': Job reached backoff limit.
prefect.flow_runs.worker
Job 'swinging-pug-6kxhq': No pods found for job.
prefect.flow_runs.worker
Process for flow run 'swinging-pug' exited with status code: -15; This indicates that the process exited due to a SIGTERM signal. Typically, this is caused by manual cancellation.
prefect.flow_runs.runner
Downloading flow code from storage at '.'

Copy code

Progress: 0.0%
prefect.task_runs
Job 'famous-mongoose-cxc5t': Job reached backoff limit.
prefect.flow_runs.worker
Job 'famous-mongoose-cxc5t': No pods found for job.
prefect.flow_runs.worker
Process for flow run 'famous-mongoose' exited with status code: -15; This indicates that the process exited due to a SIGTERM signal. Typically, this is caused by manual cancellation.
prefect.flow_runs.runner
Downloading flow code from storage at '.'

In the image below, the crashed jobs correspond to the orange progress bars. As you can see, some jobs are green and go to completion, but others crash with the logs shown above. I am deliberately setting the jobBackOffLimit to 0 in order to catch these. This is especially problematic because it happens in the master flow which kicks off the remaining subflows. Obviously I wouldn't want the master flow to restart as that would spawn even more subflows. My question is, where is this SIGTERM coming from? From my understanding, this is something I can handle in my code/container and is something that the container can choose to ignore. If this is the case, how can I make the prefect container ignore the SIGTERM and keep going until it completes the job?

24 Views

Open in Slack

Previous Next