Your issue with the flow run crashing due to a SIGKILL signal (status code -9) is a common problem that typically arises from high memory usage, even if it seems like there is enough memory available. Here are some insights and potential solutions:
1.
Memory Management: Ensure that your flow and tasks are optimized for memory usage. Consider breaking down your tasks into smaller units or using more efficient data structures to handle large datasets.
2.
Resource Limits: Check the resource limits set on your Prefect Cloud cluster. Sometimes, the constraints might be tighter than expected, leading to the operating system terminating the process.
3.
Graceful Handling: Prefect 3.x has introduced workers and work pools, replacing agents and infrastructure blocks. Make sure your deployment and execution environment are correctly configured to handle resource scaling and management.
Prefect Workers and
Work Pools might need tuning for your specific workload.
4.
Crash Handling: There are ongoing improvements in Prefect to handle crashes more gracefully. For instance, ensuring that flows go through crash detection even with a SIGTERM signal. You might want to ensure your setup takes advantage of these improvements.
Crash Handling is an area with active development.
5.
Check Logs: Always review the logs for any additional context or errors that might provide more insight into what is causing the high memory usage.
For more detailed guidance, you might find these GitHub issues and discussions helpful:
-
Hook on_crashed not being called on WorkerProcess
-
RayTaskRunner with Prefect 3: Retrieval of Task Results Failing
-
Handle SIGTERM received by prefect-agent gracefully
If you need further assistance, consider reaching out to the Prefect community or support for tailored advice.