anyone have any idea why my task run concurrency settings wo Prefect Community #ask-community

anyone have any idea why my task run concurrency s...

Brett Naul

11/10/2021, 12:33 AM

anyone have any idea why my task run concurrency settings would be getting ignored? these tasks started around the same time (within 10 mins) but not like the same nanosecond or anything

Kevin Kho

11/10/2021, 1:52 AM

Will ask the team about it

Anna Geller

11/10/2021, 1:39 PM

@Brett Naul could you send me or Kevin the flow run ids and task run ids of the affected tasks? this would be helpful for debugging and identifying why that happened. Also, could you perhaps share the Prefect version that you used when you encountered that issue, and your flow definition to see how you assigned tags to those tasks? We would then investigate further and, if this is a bug on our end, we would open an issue for that.

Brett Naul

11/10/2021, 2:02 PM

hmm I'm not sure that I can figure it out, the flow was going all day (say like 300 big tasks) and mostly the limit was being respected, until it wasn't anymore 🤷 if I can reproduce I'll let you know

Anna Geller

11/10/2021, 2:10 PM

That would be great! If it happens again, send us the flow run and task run ids, the prefect version, the flow definition and we’ll dig deeper then. Thanks a lot! 🙌

Brett Naul

11/10/2021, 4:25 PM

all right now I've got the reverse problem...this is on 0.15.7, flow run

1e300487-b461-41d3-abee-8a038b1ea538

, one task run is

99842cc1-5cf0-47b9-90b8-a797d6c798e3

Anna Geller

11/10/2021, 4:31 PM

Thanks! Can you share: • what agent do you use with this flow? • how do you define your flow - especially the task tags and run configuration? It can be helpful to debug the issue

Brett Naul

11/10/2021, 4:59 PM

the 🎢 ride continues...but as far as I can tell nothing is actually running, still all queued. we're using the k8s agent, the flow is generated dynamically so it's not straightforward to share. the tag is just coming from the same place as the task name. cc @George Coyne who we just e-met 👋

Brett Naul

11/10/2021, 5:01 PM

also I cancelled that flow but still see "14 running tasks", not sure how long it should take to update..

Brett Naul

11/10/2021, 5:16 PM

possible this is related to the concurrency dashboard lying to me

Kevin Kho

11/10/2021, 5:27 PM

I think the flow cancellation should be reasonably quick. Are those tasks still running when you visit them in the UI?

Brett Naul

11/10/2021, 5:35 PM

ah it does look like there's a Cancelled run with some straggler tasks that still think they're Running. isn't cancelling supposed to change all the task states as well?

Kevin Kho

11/10/2021, 5:39 PM

I believe so but will ask

Anna Geller

11/10/2021, 5:48 PM

@Brett Naul which executor do you use? if Dask, is it some external Dask cluster or one running within the same Kubernetes cluster as your agent?

Brett Naul

11/10/2021, 5:57 PM

dask executor in the same cluster

👍 1

George Coyne

11/10/2021, 7:46 PM

Using Kubecluster or dask local?

Brett Naul

11/10/2021, 7:49 PM

it's using a wrapper around dask_kubernetes.HelmCluster...but more or less KubeCluster

Brett Naul

11/10/2021, 9:17 PM

after deleting and recreating that task limit....now it just doesn't work at all. nothing is entering a Queued state ever 😕

Kevin Kho

11/10/2021, 9:23 PM

Even other Flows are not submitting?

Brett Naul

11/10/2021, 9:25 PM

no I mean just the concurrency limiting on this task is not working. although technically yes bc my run w/o that concurrency limiting starts so many flow runs that it effectively shuts down our prefect account 😅

Kevin Kho

11/10/2021, 9:27 PM

Are the pods of that KubeCluster still running?

Brett Naul

11/10/2021, 9:30 PM

no they're all gone. I don't really see what any of these infra q's have to do with the task concurrency stuff though...it seems pretty clear that the tasks are getting the wrong rate limit status back from the server for one reason or another, it isn't related to kubernetes or anything like that. in particular the fact that the task concurrency status page often fails to load bc of graphql timeouts seems like a bad sign

Kevin Kho

11/10/2021, 9:33 PM

Because if the cancellation didn’t succeed and there is something still in a running state, then that would contribute to the wrong rate limit status.

Kevin Kho

11/10/2021, 9:34 PM

You are right that the graphql timeout is a bad sign. Will find people to help.

Zanie

11/10/2021, 9:48 PM

Just an update, this has been escalated and we're investigating.

Zanie

11/10/2021, 9:50 PM

Did you change tags since your screenshot? We see

GenerateTravelActivitiesDay

in the backend not

GenerateTravelActivities

Zanie

11/10/2021, 9:51 PM

It'd also be helpful if you shared some more details from that timed out query so I can narrow down which one it is.

Brett Naul

11/10/2021, 10:40 PM

ah yeah sorry I was fiddling with the names to try and unstick things. and I'm actually not sure about the query, it was timing out from the UI and the javascript error is a bit inscrutable to me. if there's something I can pull out here that would be useful I can copy/paste whatever

Zanie

11/10/2021, 11:41 PM

Hey Brett, if the task concurrency query times out repeatedly the server says that there is not concurrency available so that your limits are not exceeded. We're still investigating a fix for the timeouts.

Zanie

11/10/2021, 11:47 PM

Can you ensure that all of the relevantly tagged task runs are moved to final states before running your next concurrency limited flow? It looks like there are still some hanging around in running states which can cause problems.

Zanie

11/10/2021, 11:50 PM

We're going to look into improving performance for the relevant queries as well, but that's not going to be fixable today.

Brett Naul

11/10/2021, 11:52 PM

yeah I noticed the straggler tasks too, I think I have a script now to clean them all up...this was the issue with cancelled flows not transitioning all their tasks, maybe just bc there's so many of them and some similarly time out when trying to update to Cancelled?

Zanie

11/11/2021, 12:02 AM

That's a good observation, I think we may not be properly returning slots for cancellation.

Zanie

11/11/2021, 12:06 AM

I think this may be specific to flow run cancellation though and setting the task run states to cancelled manually should return slots. I'll have to dive in with someone more familiar with that API later though.

Brett Naul

11/11/2021, 12:39 AM

got it, it does look like I managed to get rid of those ghost tasks and now the limit looks like it's working again! so this seems like a decent workaround for now

Copy code

from prefect.engine.state import Cancelled
query = """{
  task_run(
    where: {_and: [{task: {name: {_eq: "activities-travel_activities/configs/mini_nor_cal.yaml"}}}, {state: {_eq: "Running"}}]}
  ) {
    id
    task {
      id
      name
    }
  }
}
"""
result = pd.json_normalize(p.graphql( query )['data']['task_run'])
result.id.map(lambda _id: p.set_task_run_state(_id, Cancelled()))

4 Views

Open in Slack

Previous Next