I am having a reoccurring issue where jobs I submi...
# prefect-community
s
I am having a reoccurring issue where jobs I submit with Prefect (using a DaskKubernetesEnvironment and a kubernetes cluster on amazon EKS). In the screen shot below you can see that I've mapped over 3000 tasks within my flow, but only a small fraction of them (in this case 173) actually run. Note, the tasks labeled as 'running' in the pictures are actually already canceled, this just isn't reflected in the cloud UI.
Anyway eventually the prefect agent will just stop submitting new kubernetes jobs and all of the running pods will get automatically terminated. I think this has something to do with the Timeout Error pictures in the lower left
This is odd, though, since I would have expected that the Prefect agent would keep submitting tasks (without caring if some, or in this case most, of them are failing) up until all of the mapped tasks have had an opportunity to run. But it seems like this isn't the case? Or maybe I'm missing something šŸ¤” . Either way some help would be greatly appreciated!
m
Hi @Severin Ryberg [sevberg] - I have encountered this before - it is most likely your dask-scheduler being resource constrained
upvote 1
the default resources in the job spec are far from sufficient if you are running a flow with a lot of tasks (you'll probably want to pass your own custom
scheduler_spec_file
in case you are using a
DaskKubernetesEnvironment
)
s
Thanks for the fast tip @Marwan Sarieddine! I'll try it out. Do you think 4000m cpu's would be enough?
m
sure - sorry for the late reply - yes that should be enough I believe
s
Hello again @Marwan Sarieddine (and others), as you guessed, I confirmed that the K8S dask scheduler needed more CPUs (for my use case it maxes out at around 1000 mCPU, which is now far below the limit I gave it of 4000). But unfortunately the problem persists šŸ˜ž I've also: ā€¢ Completely disabled the K8S cluster autoscaler after coming across this issue ā€¢ Noticed that some of the worker pods needed some more mCPU's, so I've also tried increasing this as well ā€¢ Increased the "warehouse size" to the largest size possible of the cloud database (Snowflake) that I read and write from during the flow to ensure that it isn't a bottleneck ā€¢ Setting the job timeout to 10 minutes (far longer than the 2-5 minutes it takes the overall flow to fail) Unfortunately these, also, did not resolve the issue šŸ˜ž šŸ˜ž Does anything else come to mind? If not a solution, then at least an investigation angle to poke at it from? Thanks again!!
In case it helps, here's another example of the flow failing prematurely with a mysterious TimeoutError after only 82 of 3142 mapped runs have exited (all of which "succeed"). The related error log is also attached
m
Hi @Severin Ryberg [sevberg] - the other thing that might be happening is perhaps some operation that is blocking the workers from communicating with the scheduler - and as a quick fix I would increase the timeout - you mention you tried this
Copy code
Setting the job timeout to 10 minutes (far longer than the 2-5 minutes it takes the overall flow to fail)
did you do so by setting this envirnoment variable:
Copy code
"DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT": "600s",
? I am using the cluster autoscaler without any issues - but perhaps just because I am not encountering the edge case described in the issue you referenced
and if that's what you did - did you confirm the error statement is now something like
Copy code
failed to connect after 600s ?