Giovanni Giacco
10/18/2024, 2:48 AMMarvin
10/18/2024, 2:48 AMAmazonECS_FullAccess
, AmazonEC2FullAccess
, IAMFullAccess
, and AmazonEC2ContainerRegistryFullAccess
, but be aware that these can be more permissive than needed. Docker is required to build and push images to your registry. More details on setting infrastructure can be found here.
2. Task Queue Management: Implement robust task queue management to handle various failure scenarios effectively, such as using exponential backoff for retrying transient errors, and ensuring support for graceful failure handling and notifications. More insights on deploying task queues are discussed in a Prefect blog.
3. Monitoring and Optimization: Be attentive to resource usage, especially when dealing with large volumes of subflows. For instance, issues have been reported with the CancellationCleanup
service consuming high CPU usage when managing numerous subflows in a deployment. Implement resource limits and consider tuning the Prefect server and database interaction. You can find more about such infrastructure-related issues here.
4. Retry Logic: Be cautious with recent changes in Prefect's retry logic. The dynamic_key
changes have affected how retries are managed, leading to potential re-execution of completed tasks. Adjust your result storage logic to better handle cache and retries. More specifics on retry logic behavior changes and a temporary solution were discussed in a GitHub issue.
5. Flow and Agent Management: If using Prefect with Helm on EKS, ensure your agents are managed properly to prevent them from hanging or getting stuck. This involves ensuring sufficient capacity and implementing robust error handling within agents, especially in high-load scenarios.
By following these guidelines, you can optimize your Prefect deployment on AWS EKS to be resilient, scalable, and efficient. If you encounter any issues, referring to the Prefect GitHub issues and latest documentation can be beneficial.