Hello we are experiencing intermittent errors on flows that Prefect Community #ask-community

Hello - we are experiencing intermittent errors on...

Ilya Galperin

03/01/2023, 5:30 PM

Hello - we are experiencing intermittent errors on flows that are submitting a large number of tasks to DaskTaskRunner. No individual tasks seem to fail, but the flow will periodically err out with the following message (full traceback in thread):

Copy code

Flow run encountered an exception. MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

We are on Prefect version 2.8.2 but this has been occurring for a few versions now. Other users seem to have experienced this behavior as well (see thread here). Is this being worked on and if so, where can we track it?

Ilya Galperin

03/01/2023, 5:31 PM

Copy code

Encountered exception during execution:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 651, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 154, in run_sync_in_interruptible_worker_thread
    async with anyio.create_task_group() as tg:
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 135, in capture_worker_thread_and_result
    result = __fn(*args, **kwargs)
  File "/opt/prefect/flow.py", line 128, in entrypoint
    if future.result() is not None:
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 226, in result
    return sync(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 267, in sync
    return run_async_from_worker_thread(__async_fn, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 101, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Zanie

03/01/2023, 5:47 PM

Generally, this is because of an upstream problem

Zanie

03/01/2023, 5:47 PM

Do you have additional logs for the flow run?

Zanie

03/01/2023, 5:49 PM

https://github.com/PrefectHQ/prefect/issues/8228#issuecomment-1448366073 is likely helpful

Ilya Galperin

03/01/2023, 6:12 PM

Thank you Zaine. I looked over this Github issue and we are having similar behavior as BStefansen-lightbox describes here. The flow fails after a short amount of time - in the most recent instance <5 minutes. I’ve attached the logs from the pod around the time of the crash.

logs-insights-results.csv

Zanie

03/01/2023, 6:28 PM

What is your kubernetes job timeout set to?

Ilya Galperin

03/01/2023, 6:32 PM

180 secs

Zanie

03/01/2023, 6:37 PM

If the timeout is 180s it’ll fail at 180s

Zanie

03/01/2023, 6:38 PM

So if it runs longer i.e. approaching 5 minutes it’ll be marked as failed by the agent then exit when the next task runs

Ilya Galperin

03/01/2023, 6:43 PM

Sorry I’m a little confused - what would need to run longer than >180 seconds for this to crash? This flow typically runs >2 hours and does not have this issue, this does seem to be intermittent.

Ilya Galperin

03/01/2023, 6:43 PM

Happy to set our job watch timeout to None but I’m not sure how this would address the issue.

Zanie

03/01/2023, 6:49 PM

If your job watch timeout is less than the runtime of the flow, the flow should fail

Zanie

03/01/2023, 6:49 PM

If the issue is intermittent, it is likely because we were not enforcing job watch timeouts correctly until 2.8.3

Zanie

03/01/2023, 6:50 PM

It’s not clear to me if your issue is something else yet, the logs did not seem to include the full run

Ilya Galperin

03/01/2023, 6:57 PM

I’ll try to set our timeout to None then, thank you. I was under the impression

job_watch_timeout_seconds

listens for events and will timeout if no events are received in the span passed here; has this changed? I can definitely provide a longer log if that would be helpful here but they might contain some semi-sensitive output; is there somewhere I can forward them privately/through DM?

2 Views

Open in Slack

Previous Next