https://prefect.io logo
Title
s

Sandeep Aggarwal

06/11/2020, 12:31 PM
Need help with below error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/execute.py", line 668, in complete_value_catching_error
    return_type, field_nodes, info, path, result
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/execute.py", line 733, in complete_value
    raise result
  File "/prefect-server/src/prefect_server/graphql/states.py", line 73, in set_state
    task_run_id=state_input["task_run_id"], state=state,
  File "/prefect-server/src/prefect_server/api/states.py", line 91, in set_task_run_state
    f"State update failed for task run ID {task_run_id}: provided "
graphql.error.graphql_error.GraphQLError: State update failed for task run ID 63293e14-b1d4-4d2e-ae21-e9aeb8edfade: provided a running state but associated flow run 73a41de3-adc3-4a48-9b57-9b7bdb6094f7 is not in a running state.
So my workflow involves running some commands inside docker containers. The workflow itself aren't huge but the docker execution can take several seconds (should be under 1min though). I am currently running with couple of dask workers with limited memory i.e. 500MB. Workflow works fine for small no. of requests but as I start hitting multiple requests, workers starts dying and I see this error in logs prefect server logs. Although this is just a testing system and actual prod environment will have higher memory limits but still would like to know if this error is expected and if there is any way to avoid/handle this?
👀 1
l

Laura Lorenz (she/her)

06/11/2020, 1:33 PM
Hi! I don’t think that is expected. Fundamentally the error is about the flow run not being in a running state when one of its constituent tasks tries to start, which is not a common situation we expect. Is there any output from the flowrunner itself either visible in server (does it have a different state than ‘Running’ and if so what) or from the agent’s logs?
s

Sandeep Aggarwal

06/11/2020, 4:30 PM
Hello Laura, This happens when dask worker runs out of memory. I see below error in worker logs:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 271.47 MB -- Worker memory limit: 314.57 MB
The nanny process restarts the worker and the tasks that were stucked fails with above error. Apart from that I don't see any specific error in any of the logs. I will try to reproduce the error and see if I can find anything useful.
l

Laura Lorenz (she/her)

06/11/2020, 5:00 PM
Gotcha. What type of agent are you using? I’m particularly interested in any logs or exceptions from the FlowRunner which should be emitted by the agent. I know your dask workers have limited memory, but given the error you are seeing being tripped I would expect that not to be related to if the workers are resource constrained but instead if the agent (or the platform the agent runs on -- docker, kubernetes, etc) is resource constrained. And again this isn’t expected behavior, and it is harder to guarantee expected behavior in resource constrained environments anyways 😅 a caveat I’m sure you’re aware of haha
s

Sandeep Aggarwal

06/13/2020, 7:54 AM
Yup I agree. Constrained environments can instigate unexpected outcomes. I am having hard time reproducing the issue but thanks for the pointers anyways. I now know where to look just in case this pops up again. I am using docker agent btw.