Hello all. I recently bumped into this issue for...
# ask-community
v
Hello all. I recently bumped into this issue for my flow. Has anyone seen anything similar?
Copy code
Failed to set task state with error: ClientError([{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID 929b8b8d-b75c-4a7b-b0ff-e0433a903fee: provided a running state but associated flow run 3775f8a8-6d8c-452b-a69f-e51eb6bc07e7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])
Traceback (most recent call last):
  File "/opt/conda/envs/dev/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/opt/conda/envs/dev/lib/python3.8/site-packages/prefect/client/client.py", line 1917, in set_task_run_state
    result = self.graphql(
  File "/opt/conda/envs/dev/lib/python3.8/site-packages/prefect/client/client.py", line 569, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID 929b8b8d-b75c-4a7b-b0ff-e0433a903fee: provided a running state but associated flow run 3775f8a8-6d8c-452b-a69f-e51eb6bc07e7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
k
Hey @Vincent, I haven’t. Does this happen regularly or just once?
v
This happened on our most recent submission so I can’t say how often it occurs. However, I can say that this occurred multiple times on our tasks, and ultimately caused a significant loss of cpu hours.
k
Ok will ask the team about this
v
As an additional point of feedback, I think that it would be ideal if flows did not fail due to Client communication. I am interested to hear what feedback the team has on that as well.
k
Will bring it up
Hey @Vincent, I did chat with the team about this and we looked into this flow and it looks like that a flow died but workers continued processing it. There was an error in the logs specifically with a kubernetes job failing. The internal server error is just a generic message from an API call that failed.
v
Thanks for the followup. Were there any logs that indicated what caused the flow to die? If I want the other workers to continue til completion, is there any setting that I can change? (ie. would turning off the heartbeat allow the other task runs to finish)
k
So in this case it’s Prefect specifically reporting that the job just failed. If you are using Dask, it might be that the scheduler/client actually died. If a worker dies, Dask has some fault tolerance to spin up new workers and continue where it left off.
So turning off the heartbeat is unlikely to fix this because the flowrunner (maybe) actually died
v
I see. I will pay attention to the scheduler to see if there are any clues as to what may have caused it to exit. One thing I did noticed in my logs is that prior to the client failure, was an attribute access error caused by
None
being passed to the task. It seems that the task started, and did not receive the appropriate input from the last task. This may be related to how Dask chooses to store /fetch results. Thanks for investigation.
I want to follow up on this thread and ask what would trigger this error
[8 October 2021 12:17pm]: Kubernetes Error: pods ['prefect-job-7db50922-smzgb'] failed for this job
(as reported in the cloud UI. Is this message coming from the prefect agent?
k
This is but we wouldn’t know as that’s specifically happening on your infrastructure. If the pod exists, maybe opening pod logs might give a clue?
v
Unfortunately the prefect job exited before I could fetch any logs.
I noticed that my agent has been reporting 429, 404, and 502 errors. I wonder if these would affect a running job.
k
404 might be us, we had an incident at 5:22 PM ET that lasted for a few seconds. 429 is the rate limiting. 502….I have no immediate idea. They would affect the running job as the job would error out I think