anyone have any idea why my task run concurrency s...
# ask-community
b
anyone have any idea why my task run concurrency settings would be getting ignored? these tasks started around the same time (within 10 mins) but not like the same nanosecond or anything
k
Will ask the team about it
a
@Brett Naul could you send me or Kevin the flow run ids and task run ids of the affected tasks? this would be helpful for debugging and identifying why that happened. Also, could you perhaps share the Prefect version that you used when you encountered that issue, and your flow definition to see how you assigned tags to those tasks? We would then investigate further and, if this is a bug on our end, we would open an issue for that.
b
hmm I'm not sure that I can figure it out, the flow was going all day (say like 300 big tasks) and mostly the limit was being respected, until it wasn't anymore 🤷 if I can reproduce I'll let you know
a
That would be great! If it happens again, send us the flow run and task run ids, the prefect version, the flow definition and we’ll dig deeper then. Thanks a lot! 🙌
b
all right now I've got the reverse problem...this is on 0.15.7, flow run
1e300487-b461-41d3-abee-8a038b1ea538
, one task run is
99842cc1-5cf0-47b9-90b8-a797d6c798e3
a
Thanks! Can you share: • what agent do you use with this flow? • how do you define your flow - especially the task tags and run configuration? It can be helpful to debug the issue
b
the 🎢 ride continues...but as far as I can tell nothing is actually running, still all queued. we're using the k8s agent, the flow is generated dynamically so it's not straightforward to share. the tag is just coming from the same place as the task name. cc @George Coyne who we just e-met 👋
also I cancelled that flow but still see "14 running tasks", not sure how long it should take to update..
possible this is related to the concurrency dashboard lying to me
k
I think the flow cancellation should be reasonably quick. Are those tasks still running when you visit them in the UI?
b
ah it does look like there's a Cancelled run with some straggler tasks that still think they're Running. isn't cancelling supposed to change all the task states as well?
k
I believe so but will ask
a
@Brett Naul which executor do you use? if Dask, is it some external Dask cluster or one running within the same Kubernetes cluster as your agent?
b
dask executor in the same cluster
👍 1
g
Using Kubecluster or dask local?
b
it's using a wrapper around dask_kubernetes.HelmCluster...but more or less KubeCluster
after deleting and recreating that task limit....now it just doesn't work at all. nothing is entering a Queued state ever 😕
k
Even other Flows are not submitting?
b
no I mean just the concurrency limiting on this task is not working. although technically yes bc my run w/o that concurrency limiting starts so many flow runs that it effectively shuts down our prefect account 😅
k
Are the pods of that KubeCluster still running?
b
no they're all gone. I don't really see what any of these infra q's have to do with the task concurrency stuff though...it seems pretty clear that the tasks are getting the wrong rate limit status back from the server for one reason or another, it isn't related to kubernetes or anything like that. in particular the fact that the task concurrency status page often fails to load bc of graphql timeouts seems like a bad sign
k
Because if the cancellation didn’t succeed and there is something still in a running state, then that would contribute to the wrong rate limit status.
You are right that the graphql timeout is a bad sign. Will find people to help.
z
Just an update, this has been escalated and we're investigating.
Did you change tags since your screenshot? We see
GenerateTravelActivitiesDay
in the backend not
GenerateTravelActivities
It'd also be helpful if you shared some more details from that timed out query so I can narrow down which one it is.
b
ah yeah sorry I was fiddling with the names to try and unstick things. and I'm actually not sure about the query, it was timing out from the UI and the javascript error is a bit inscrutable to me. if there's something I can pull out here that would be useful I can copy/paste whatever
z
Hey Brett, if the task concurrency query times out repeatedly the server says that there is not concurrency available so that your limits are not exceeded. We're still investigating a fix for the timeouts.
Can you ensure that all of the relevantly tagged task runs are moved to final states before running your next concurrency limited flow? It looks like there are still some hanging around in running states which can cause problems.
We're going to look into improving performance for the relevant queries as well, but that's not going to be fixable today.
b
yeah I noticed the straggler tasks too, I think I have a script now to clean them all up...this was the issue with cancelled flows not transitioning all their tasks, maybe just bc there's so many of them and some similarly time out when trying to update to Cancelled?
z
That's a good observation, I think we may not be properly returning slots for cancellation.
I think this may be specific to flow run cancellation though and setting the task run states to cancelled manually should return slots. I'll have to dive in with someone more familiar with that API later though.
b
got it, it does look like I managed to get rid of those ghost tasks and now the limit looks like it's working again! so this seems like a decent workaround for now
Copy code
from prefect.engine.state import Cancelled
query = """{
  task_run(
    where: {_and: [{task: {name: {_eq: "activities-travel_activities/configs/mini_nor_cal.yaml"}}}, {state: {_eq: "Running"}}]}
  ) {
    id
    task {
      id
      name
    }
  }
}
"""
result = pd.json_normalize(p.graphql( query )['data']['task_run'])
result.id.map(lambda _id: p.set_task_run_state(_id, Cancelled()))