Alan
07/28/2024, 7:32 AMCrash detected! Execution was cancelled by the runtime environment.
prefect.task_runs
Job 'uptight-swan-txg8h': Job reached backoff limit.
prefect.flow_runs.worker
Job 'uptight-swan-txg8h': No pods found for job.
prefect.flow_runs.worker
Crash detected! Execution was aborted by a termination signal.
prefect.flow_runs
Reported flow run 'b26c5eee-718f-4833-8816-5b29d5c7a2c8' as crashed: Flow run infrastructure exited with non-zero status code -1.
Progress: 20.0%
prefect.task_runs
Job 'swinging-pug-6kxhq': Job reached backoff limit.
prefect.flow_runs.worker
Job 'swinging-pug-6kxhq': No pods found for job.
prefect.flow_runs.worker
Process for flow run 'swinging-pug' exited with status code: -15; This indicates that the process exited due to a SIGTERM signal. Typically, this is caused by manual cancellation.
prefect.flow_runs.runner
Downloading flow code from storage at '.'
Progress: 0.0%
prefect.task_runs
Job 'famous-mongoose-cxc5t': Job reached backoff limit.
prefect.flow_runs.worker
Job 'famous-mongoose-cxc5t': No pods found for job.
prefect.flow_runs.worker
Process for flow run 'famous-mongoose' exited with status code: -15; This indicates that the process exited due to a SIGTERM signal. Typically, this is caused by manual cancellation.
prefect.flow_runs.runner
Downloading flow code from storage at '.'
In the image below, the crashed jobs correspond to the orange progress bars. As you can see, some jobs are green and go to completion, but others crash with the logs shown above. I am deliberately setting the jobBackOffLimit to 0 in order to catch these. This is especially problematic because it happens in the master flow which kicks off the remaining subflows. Obviously I wouldn't want the master flow to restart as that would spawn even more subflows. My question is, where is this SIGTERM coming from? From my understanding, this is something I can handle in my code/container and is something that the container can choose to ignore. If this is the case, how can I make the prefect container ignore the SIGTERM and keep going until it completes the job?