https://prefect.io logo
Title
b

Ben Ayers-Glassey

04/21/2023, 5:50 PM
We're running Prefect Orion in a self-hosted Kubernetes cluster. We upgraded from 2.7.4 to 2.8.0, and now when we run any flow, we immediately run into an error where Prefect can't load its state.
2023-04-21T15:16:04.920982508Z /usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; thi
s may result in unpredictable behaviour
2023-04-21T15:16:04.921075059Z   warn(RuntimeWarning(msg))
2023-04-21T15:16:04.997977913Z 15:16:04.996 | INFO    | Flow run 'phi321-lysenko-tensor' - Created task run 'generate_operation_id-a9e17ed9-0' for task 'generate_operation_id'
2023-04-21T15:16:04.999103995Z 15:16:04.997 | INFO    | Flow run 'phi321-lysenko-tensor' - Executing 'generate_operation_id-a9e17ed9-0' immediately...
2023-04-21T15:16:05.043778806Z 15:16:05.042 | INFO    | Task run 'generate_operation_id-a9e17ed9-0' - Task run '43baff18-8292-4cd7-a694-2dcf6d1a0c77' already finished.
2023-04-21T15:16:05.119806977Z 15:16:05.044 | ERROR   | Flow run 'phi321-lysenko-tensor' - Encountered exception during execution:
2023-04-21T15:16:05.119866174Z Traceback (most recent call last):
2023-04-21T15:16:05.119875789Z   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
2023-04-21T15:16:05.119882275Z     result = await run_sync(flow_call)
2023-04-21T15:16:05.119888470Z   File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
2023-04-21T15:16:05.119896590Z     return await anyio.to_thread.run_sync(
2023-04-21T15:16:05.119904639Z   File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
2023-04-21T15:16:05.119911921Z     return await get_asynclib().run_sync_in_worker_thread(
2023-04-21T15:16:05.119917612Z   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
2023-04-21T15:16:05.119924062Z     return await future
 ...etc...
2023-04-21T15:16:05.120058513Z   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
2023-04-21T15:16:05.120063946Z     raise self._exception
2023-04-21T15:16:05.120069588Z   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
2023-04-21T15:16:05.120078014Z     return await future._result()
2023-04-21T15:16:05.120085086Z   File "/usr/local/lib/python3.8/site-packages/prefect/futures.py", line 237, in _result
2023-04-21T15:16:05.120091377Z     return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
2023-04-21T15:16:05.120097177Z   File "/usr/local/lib/python3.8/site-packages/prefect/states.py", line 100, in _get_state_result
2023-04-21T15:16:05.120103360Z     raise MissingResult(
2023-04-21T15:16:05.120110136Z prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.
2023-04-21T15:16:05.141538481Z 15:16:05.140 | INFO    | prefect.engine - Engine execution of flow run '5c996762-8edc-4584-9f70-4d1d6adf2b3b' aborted by orchestrator: This run has already terminated.
1
Some background: we do not have state persistence turned on; we were considering doing so after upgrading Prefect.
Also, we were running into this same error ("prefect.exceptions.MissingResult: State data is missing") before the upgrade, but only when tasks ran for more than 4 hours. That's why we were considering turning on result persistence -- it seemed like the "non-persistent" state storage was only good for 4 hours, or something. Now, it happens immediately, even with test flows which only have 1 dummy task.
z

Zanie

04/21/2023, 6:06 PM
Task run ‘generate_operation_id-a9e17ed9-0’ - Task run ‘43baff18-8292-4cd7-a694-2dcf6d1a0c77’ already finished.
Indicates that the task run id already exists and has a terminal state
That’s pretty weird for a new flow run
b

Ben Ayers-Glassey

04/21/2023, 7:46 PM
Yeah. Here's an example flow:
...and here's a run of it:
d

Deceivious

04/21/2023, 8:45 PM
ive been noticing this alot as well. But I dont know how to replicate it so I havnt been able to create an ticket. But i honestly think this should be treated as a cache miss and the task needs to be rerun ignoring the cache. But like i said i dont know how to replicate it.
Also I have persist result on so that isnt the solution to this.
z

Zanie

04/21/2023, 8:47 PM
The "cache miss" behavior you're looking for is what I'm implementing in https://github.com/PrefectHQ/prefect/pull/9152
I'm very confused by this behavior for a simple flow run as shared. Are you both self hosting? Any interesting logs from the agent or output not captured by our logger? Does this occur without a deployment (ie just calling the flow)?
b

Ben Ayers-Glassey

04/21/2023, 8:50 PM
Ah, good question. It does not happen when running locally.
z

Zanie

04/21/2023, 8:51 PM
I wonder if the flow is running twice concurrently somehow when managed by an agent.
b

Ben Ayers-Glassey

04/21/2023, 8:51 PM
Oh, that's a thought. 🤔
Here are agent logs:
20:44:50.907 | INFO    | prefect.agent - Submitting flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66'
20:44:51.139 | INFO    | prefect.agent - Completed submission of flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66'
20:44:51.160 | INFO    | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Pod has status 'Pending'.
20:44:52.475 | INFO    | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Pod has status 'Running'.
20:45:01.102 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Job did not complete within timeout of 5s.
20:45:01.138 | INFO    | prefect.agent - Reported flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66' as crashed: Flow run infrastructure exited with non-zero status code -1.
So it's complaining about a timeout. I tried setting
pod_watch_timeout_seconds: null
but didn't fix.
I also tried
job_watch_timeout_seconds: null
but didn't fix.
z

Zanie

04/21/2023, 8:53 PM
Huh the job watch one looks set to 5s there which would definitely cause problems. If the agent marks a flow as failed then the next task run will be aborted.
It looks like the error message on the run is just misleading; this is likely the root cause
b

Ben Ayers-Glassey

04/21/2023, 8:54 PM
Oh nice.
z

Zanie

04/21/2023, 8:54 PM
Sounds like you're having a hard time changing that setting so perhaps that's the next thing to investigate
b

Ben Ayers-Glassey

04/21/2023, 8:55 PM
Yeah, I'll focus on that. Any tips before I go poking at it? Like, why changing the value in the deployment .yaml didn't seem to affect what was logged?
z

Zanie

04/21/2023, 8:56 PM
I don't have a great intuition for how things are updated from the yaml - that's changed a lot recently.
Does
prefect deployment inspect
or
prefect block inspect
for the relevant objects show a value for that field?
👀 1
b

Ben Ayers-Glassey

04/21/2023, 9:38 PM
Ah ha, it does not. Or rather, it shows the default value of 5. So, that field of the deployment .yaml isn't being picked up by
prefect deployment apply
? Is that a known kind of issue people have encountered in the past? Or is it possible I'm doing something wrong?
Or I guess... the deployment .yaml file has a scary comment which says:
###
### DO NOT EDIT BELOW THIS LINE
###
...and we are totally editing below that line. It worked until now, but maybe the time has come where we shouldn't do that anymore. So I guess... I need to look at our "infrastructure block", and see if I can change the
job_watch_timeout_seconds
seconds there.
AMAZING! It worked. Thank you @Zanie!
Screenshot from 2023-04-21 14-54-16.png
d

Dekel R

05/04/2023, 1:58 PM
@Ofek Katriel