https://prefect.io logo
w

Walter Cavinaw

01/31/2023, 11:21 PM
I am having an issue that is sporadic. I can't create a MRE, because it (seems to) occur randomly. I am more hoping that someone can point me at other discussion or github issues where this is discussed. I have been searching through discourse, slack and gh issues to find a possible solution. We will run a dozen or so flows at the same time. Sometimes one or two will be marked as failed because a number of tasks crashed with no error. Soemtimes a flow will be marked as crashed when I can see in GKE that it's still running? Has anyone encountered similar issues? prefect 2, gke, kubernetes jobs, using dask task runner. parent flow starts deployments and waits for them to show status. Is our agent potentially too small? 0.5vcpu and 2Gi mem.
t

Timo Vink

01/31/2023, 11:57 PM
Is it possible it's taking more than 60 seconds for the pods to be created and move out of the Pending state? Perhaps waiting for infrastructure to spin up, or large images to download? By default I believe a KubernetesJob will be marked as crashed if it take 60+ seconds for the Pod to get to any state other than Pending. Can be controlled with the
pod_watch_timeout_seconds
property on the
KubernetesJob
.
w

Walter Cavinaw

02/01/2023, 1:24 AM
Our pod watch timeout is set to 20 mins. It goes to running and then crashes at some point during the run randomly. I think it might be that the prefect cloud service endpoint is unavailable from time to time, because I've noticed this error pop up: Crash detected! Execution was interrupted by an unexpected exception: httpx.HTTPStatusError: Server error '503 Service Unavailable' for url 'https://api.prefect.cloud/api/accounts/6c7cb1ac-fa86-457c-8973-9a9fbb5bf90d/workspaces/6b67b596-70ff-4f0b-842f-4d2f5f8ff471/task_runs/64bf86e3-b1cc-411e-a1a8-29ea30138a04/set_state' For more information check: https://httpstatuses.com/503 It seems likely that interrupted communication between prefect cloud and the agent (or flow?) causes it to get marked as "Crashed" Can this be fixed by a retry? Can it retry on crashed?
c

Clovis

02/01/2023, 8:38 AM
Hello ! I also encountered the exact same 503 error last night. I would be interested to know what the best practice might be in this case.
c

Christopher Boyd

02/01/2023, 3:29 PM
Interesting, do you have a timeframe this occurred?
w

Walter Cavinaw

02/01/2023, 5:45 PM
I spent a lot of time yesterday and today looking into the communication between flows and workspace and writing code to do a "wait_for_deployment" task in order to orchestrate a v1-style Flow-of-flows pattern. From what I've seen, our tasks crash due to errors like "can't set task state" or 503 or other communication problems. We run a dozen flows at a time with hundreds of tasks. The more flows are running, the more likely tasks are to crash. We've solved on of our "Crashed" problem by taking the orion code for run_deployment and putting retries around the http calls, but this seems like a load and service problem. When we run one flow at a time there are few tasks that crash. What would help alleviate this problem, do you have some ideas? Maybe reducing logging to reduce traffic? batching task submits instead of runnign them all at once?
c

Christopher Boyd

02/02/2023, 2:15 PM
I don’t have a direct answer on this, more a generality. If you are seeing 503's or server side related responses (or responses you suspect are likely caused on the server side), I would encourage you to raise that feedback here. There isn’t always a silver bullet to fix these, but if it’s a resilience / availability issue on the service itself, we want to know and work towards addressing that so it’s a non-issue
in other words - you shouldn’t have to write code to try / retry a failure on our end, it should either be part of the prefect package itself, or mitigated on the service