Hi <@ULVA73B9P>, we have around 50 deployments run...
# ask-marvin
m
Hi @Marvin, we have around 50 deployments running and they all run smoothly when scheduled, however when we try to trigger a flow manually using either the API or the UI we often get 503 service unavailable one or twice before it actually manages to go through, what could this issue be, we're on prefect 3.1.2 that we host ourselves.
m
The issue you're experiencing with receiving a "503 Service Unavailable" error when manually triggering flows in your self-hosted Prefect 3.1.2 environment could be due to several factors: 1. Server Load and Resource Constraints: A 503 error often indicates that the server is temporarily unable to handle the request due to being overloaded or undergoing maintenance. Ensure that your server has sufficient resources (CPU, memory) to handle additional manual triggers alongside scheduled deployments. 2. Network or Connectivity Issues: Check if there are any network issues or connectivity problems between your client and the server. This could cause intermittent failures when trying to trigger flows manually. 3. Configuration Issues: Review your self-hosted setup and configuration to ensure everything is correctly set up. You can refer to the self-hosting documentation for guidance. 4. Concurrency Limits: Consider if there are any concurrency limits set on the server that might be restricting the number of simultaneous requests. While there are no specific GitHub issues directly related to 503 errors, there are some known issues with Prefect 3.x, such as changes in API and infrastructure handling, which might indirectly affect your setup. For example, agents have been replaced by workers, and work pools have replaced infrastructure blocks. Make sure your setup aligns with these changes. For more detailed troubleshooting, you might need to check your server logs for any specific error messages when the 503 occurs. If the problem persists, consider reaching out to the Prefect community or support for more targeted assistance.
m
@Marvin Looking at the logs, it seems we get an sqlite3 error "Database is locked" on the heartbeat of the different work queues, could the be related?
m
Yes, the "Database is locked" error from SQLite can indeed be related to the 503 errors you're experiencing. This problem usually arises when multiple processes attempt to write to the SQLite database simultaneously, which can cause a lock since SQLite does not support high concurrency well. Here are a few steps you can take to address this issue: 1. Switch to PostgreSQL: For high orchestration volume and better concurrency support, it's recommended to use PostgreSQL instead of SQLite. SQLite is suitable for lightweight, single-server deployments but not for handling multiple concurrent write operations. You can find more information about setting up PostgreSQL in the self-hosting documentation. 2. Review Worker Configuration: Ensure that your workers are configured correctly and check the heartbeat settings. You can adjust the
PREFECT_WORKER_HEARTBEAT_SECONDS
setting if needed. More details on worker configuration can be found here. 3. Reduce Concurrency: If switching databases is not immediately feasible, you might try reducing the number of concurrent operations to see if it alleviates the locking issue. By addressing these database constraints, you should see an improvement in your ability to manually trigger flows without encountering the 503 errors.
m
@Marvin The issue happens even when no other flows are running, so unless the 10 work pools we have running is too much, I doubt it's a concurrency issue.
m
Given that the issue occurs even when no other flows are running, it suggests that the problem might be related to the configuration or state of the work pools themselves rather than the volume of concurrent operations. Here are a few considerations: 1. Work Pool Configuration: Ensure that each work pool is configured correctly and not inadvertently causing database locks. Review the configuration settings for each work pool and check for any anomalies. 2. Database Health: Even if concurrency isn't the issue, the SQLite database might still be encountering internal locks due to its inherent limitations. Consider checking the database health and possibly clearing out any old or unnecessary data that might be causing issues. 3. Heartbeat Frequency: The heartbeats from the work pools might still be causing locks if they are set too frequently. You might want to experiment with adjusting the heartbeat interval to see if it alleviates the issue. 4. Logs and Monitoring: Continuously monitor logs for any additional errors or warnings that might provide more context. Look for patterns or specific times when the issue occurs to identify any underlying causes. If these steps do not resolve the issue, transitioning to a more robust database like PostgreSQL might still be a beneficial long-term solution, even if concurrency isn't the immediate problem.