I am having a reoccurring issue where jobs I submit with Pre Prefect Community #ask-community

I am having a reoccurring issue where jobs I submi...

Severin Ryberg [sevberg]

12/18/2020, 9:35 PM

I am having a reoccurring issue where jobs I submit with Prefect (using a DaskKubernetesEnvironment and a kubernetes cluster on amazon EKS). In the screen shot below you can see that I've mapped over 3000 tasks within my flow, but only a small fraction of them (in this case 173) actually run. Note, the tasks labeled as 'running' in the pictures are actually already canceled, this just isn't reflected in the cloud UI.

Severin Ryberg [sevberg]

12/18/2020, 9:35 PM

Anyway eventually the prefect agent will just stop submitting new kubernetes jobs and all of the running pods will get automatically terminated. I think this has something to do with the Timeout Error pictures in the lower left

Severin Ryberg [sevberg]

12/18/2020, 9:37 PM

This is odd, though, since I would have expected that the Prefect agent would keep submitting tasks (without caring if some, or in this case most, of them are failing) up until all of the mapped tasks have had an opportunity to run. But it seems like this isn't the case? Or maybe I'm missing something 🤔 . Either way some help would be greatly appreciated!

Marwan Sarieddine

12/18/2020, 9:47 PM

Hi @Severin Ryberg [sevberg] - I have encountered this before - it is most likely your dask-scheduler being resource constrained

upvote 1

Marwan Sarieddine

12/18/2020, 9:48 PM

the default resources in the job spec are far from sufficient if you are running a flow with a lot of tasks (you'll probably want to pass your own custom

scheduler_spec_file

in case you are using a

DaskKubernetesEnvironment

)

Severin Ryberg [sevberg]

12/18/2020, 10:05 PM

Thanks for the fast tip @Marwan Sarieddine! I'll try it out. Do you think 4000m cpu's would be enough?

Marwan Sarieddine

12/18/2020, 11:08 PM

sure - sorry for the late reply - yes that should be enough I believe

Severin Ryberg [sevberg]

12/19/2020, 12:00 PM

Hello again @Marwan Sarieddine (and others), as you guessed, I confirmed that the K8S dask scheduler needed more CPUs (for my use case it maxes out at around 1000 mCPU, which is now far below the limit I gave it of 4000). But unfortunately the problem persists 😞 I've also: • Completely disabled the K8S cluster autoscaler after coming across this issue • Noticed that some of the worker pods needed some more mCPU's, so I've also tried increasing this as well • Increased the "warehouse size" to the largest size possible of the cloud database (Snowflake) that I read and write from during the flow to ensure that it isn't a bottleneck • Setting the job timeout to 10 minutes (far longer than the 2-5 minutes it takes the overall flow to fail) Unfortunately these, also, did not resolve the issue 😞 😞 Does anything else come to mind? If not a solution, then at least an investigation angle to poke at it from? Thanks again!!

Severin Ryberg [sevberg]

12/19/2020, 12:17 PM

In case it helps, here's another example of the flow failing prematurely with a mysterious TimeoutError after only 82 of 3142 mapped runs have exited (all of which "succeed"). The related error log is also attached

logs.txt

Marwan Sarieddine

12/19/2020, 3:34 PM

Hi @Severin Ryberg [sevberg] - the other thing that might be happening is perhaps some operation that is blocking the workers from communicating with the scheduler - and as a quick fix I would increase the timeout - you mention you tried this

Copy code

Setting the job timeout to 10 minutes (far longer than the 2-5 minutes it takes the overall flow to fail)

did you do so by setting this envirnoment variable:

Copy code

"DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT": "600s",

? I am using the cluster autoscaler without any issues - but perhaps just because I am not encountering the edge case described in the issue you referenced

Marwan Sarieddine

12/19/2020, 3:36 PM

and if that's what you did - did you confirm the error statement is now something like

Copy code

failed to connect after 600s ?

Open in Slack

Previous Next