Hello all I recently bumped into this issue for my flow Has Prefect Community #ask-community

Hello all. I recently bumped into this issue for...

Vincent

10/05/2021, 2:12 PM

Hello all. I recently bumped into this issue for my flow. Has anyone seen anything similar?

Copy code

Failed to set task state with error: ClientError([{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID 929b8b8d-b75c-4a7b-b0ff-e0433a903fee: provided a running state but associated flow run 3775f8a8-6d8c-452b-a69f-e51eb6bc07e7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])
Traceback (most recent call last):
  File "/opt/conda/envs/dev/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 91, in call_runner_target_handlers
    state = self.client.set_task_run_state(
  File "/opt/conda/envs/dev/lib/python3.8/site-packages/prefect/client/client.py", line 1917, in set_task_run_state
    result = self.graphql(
  File "/opt/conda/envs/dev/lib/python3.8/site-packages/prefect/client/client.py", line 569, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID 929b8b8d-b75c-4a7b-b0ff-e0433a903fee: provided a running state but associated flow run 3775f8a8-6d8c-452b-a69f-e51eb6bc07e7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]

Kevin Kho

10/05/2021, 2:27 PM

Hey @Vincent, I haven’t. Does this happen regularly or just once?

Vincent

10/05/2021, 2:28 PM

This happened on our most recent submission so I can’t say how often it occurs. However, I can say that this occurred multiple times on our tasks, and ultimately caused a significant loss of cpu hours.

Kevin Kho

10/05/2021, 2:29 PM

Ok will ask the team about this

Vincent

10/05/2021, 2:32 PM

As an additional point of feedback, I think that it would be ideal if flows did not fail due to Client communication. I am interested to hear what feedback the team has on that as well.

Kevin Kho

10/05/2021, 2:34 PM

Will bring it up

Kevin Kho

10/05/2021, 8:48 PM

Hey @Vincent, I did chat with the team about this and we looked into this flow and it looks like that a flow died but workers continued processing it. There was an error in the logs specifically with a kubernetes job failing. The internal server error is just a generic message from an API call that failed.

Vincent

10/05/2021, 8:54 PM

Thanks for the followup. Were there any logs that indicated what caused the flow to die? If I want the other workers to continue til completion, is there any setting that I can change? (ie. would turning off the heartbeat allow the other task runs to finish)

Kevin Kho

10/05/2021, 8:55 PM

So in this case it’s Prefect specifically reporting that the job just failed. If you are using Dask, it might be that the scheduler/client actually died. If a worker dies, Dask has some fault tolerance to spin up new workers and continue where it left off.

Kevin Kho

10/05/2021, 8:56 PM

So turning off the heartbeat is unlikely to fix this because the flowrunner (maybe) actually died

Vincent

10/05/2021, 9:05 PM

I see. I will pay attention to the scheduler to see if there are any clues as to what may have caused it to exit. One thing I did noticed in my logs is that prior to the client failure, was an attribute access error caused by

None

being passed to the task. It seems that the task started, and did not receive the appropriate input from the last task. This may be related to how Dask chooses to store /fetch results. Thanks for investigation.

Vincent

10/08/2021, 6:17 PM

I want to follow up on this thread and ask what would trigger this error

[8 October 2021 12:17pm]: Kubernetes Error: pods ['prefect-job-7db50922-smzgb'] failed for this job

(as reported in the cloud UI. Is this message coming from the prefect agent?

Kevin Kho

10/08/2021, 6:19 PM

This is but we wouldn’t know as that’s specifically happening on your infrastructure. If the pod exists, maybe opening pod logs might give a clue?

Vincent

10/08/2021, 6:20 PM

Unfortunately the prefect job exited before I could fetch any logs.

Vincent

10/08/2021, 6:22 PM

I noticed that my agent has been reporting 429, 404, and 502 errors. I wonder if these would affect a running job.

Kevin Kho

10/08/2021, 6:26 PM

404 might be us, we had an incident at 5:22 PM ET that lasted for a few seconds. 429 is the rate limiting. 502….I have no immediate idea. They would affect the running job as the job would error out I think

2 Views

Open in Slack

Previous Next