Hello - we are experiencing intermittent errors on...
# ask-community
i
Hello - we are experiencing intermittent errors on flows that are submitting a large number of tasks to DaskTaskRunner. No individual tasks seem to fail, but the flow will periodically err out with the following message (full traceback in thread):
Copy code
Flow run encountered an exception. MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.
We are on Prefect version 2.8.2 but this has been occurring for a few versions now. Other users seem to have experienced this behavior as well (see thread here). Is this being worked on and if so, where can we track it?
Copy code
Encountered exception during execution:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 651, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 154, in run_sync_in_interruptible_worker_thread
    async with anyio.create_task_group() as tg:
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 135, in capture_worker_thread_and_result
    result = __fn(*args, **kwargs)
  File "/opt/prefect/flow.py", line 128, in entrypoint
    if future.result() is not None:
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 226, in result
    return sync(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 267, in sync
    return run_async_from_worker_thread(__async_fn, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 101, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.
z
Generally, this is because of an upstream problem
Do you have additional logs for the flow run?
i
Thank you Zaine. I looked over this Github issue and we are having similar behavior as BStefansen-lightbox describes here. The flow fails after a short amount of time - in the most recent instance <5 minutes. I’ve attached the logs from the pod around the time of the crash.
z
What is your kubernetes job timeout set to?
i
180 secs
z
If the timeout is 180s it’ll fail at 180s
So if it runs longer i.e. approaching 5 minutes it’ll be marked as failed by the agent then exit when the next task runs
i
Sorry I’m a little confused - what would need to run longer than >180 seconds for this to crash? This flow typically runs >2 hours and does not have this issue, this does seem to be intermittent.
Happy to set our job watch timeout to None but I’m not sure how this would address the issue.
z
If your job watch timeout is less than the runtime of the flow, the flow should fail
If the issue is intermittent, it is likely because we were not enforcing job watch timeouts correctly until 2.8.3
It’s not clear to me if your issue is something else yet, the logs did not seem to include the full run
i
I’ll try to set our timeout to None then, thank you. I was under the impression
job_watch_timeout_seconds
listens for events and will timeout if no events are received in the span passed here; has this changed? I can definitely provide a longer log if that would be helpful here but they might contain some semi-sensitive output; is there somewhere I can forward them privately/through DM?