We're running Prefect Orion in a self-hosted Kuber...
# ask-community
b
We're running Prefect Orion in a self-hosted Kubernetes cluster. We upgraded from 2.7.4 to 2.8.0, and now when we run any flow, we immediately run into an error where Prefect can't load its state.
Copy code
2023-04-21T15:16:04.920982508Z /usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; thi
s may result in unpredictable behaviour
2023-04-21T15:16:04.921075059Z   warn(RuntimeWarning(msg))
2023-04-21T15:16:04.997977913Z 15:16:04.996 | INFO    | Flow run 'phi321-lysenko-tensor' - Created task run 'generate_operation_id-a9e17ed9-0' for task 'generate_operation_id'
2023-04-21T15:16:04.999103995Z 15:16:04.997 | INFO    | Flow run 'phi321-lysenko-tensor' - Executing 'generate_operation_id-a9e17ed9-0' immediately...
2023-04-21T15:16:05.043778806Z 15:16:05.042 | INFO    | Task run 'generate_operation_id-a9e17ed9-0' - Task run '43baff18-8292-4cd7-a694-2dcf6d1a0c77' already finished.
2023-04-21T15:16:05.119806977Z 15:16:05.044 | ERROR   | Flow run 'phi321-lysenko-tensor' - Encountered exception during execution:
2023-04-21T15:16:05.119866174Z Traceback (most recent call last):
2023-04-21T15:16:05.119875789Z   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
2023-04-21T15:16:05.119882275Z     result = await run_sync(flow_call)
2023-04-21T15:16:05.119888470Z   File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
2023-04-21T15:16:05.119896590Z     return await anyio.to_thread.run_sync(
2023-04-21T15:16:05.119904639Z   File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
2023-04-21T15:16:05.119911921Z     return await get_asynclib().run_sync_in_worker_thread(
2023-04-21T15:16:05.119917612Z   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
2023-04-21T15:16:05.119924062Z     return await future
 ...etc...
2023-04-21T15:16:05.120058513Z   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
2023-04-21T15:16:05.120063946Z     raise self._exception
2023-04-21T15:16:05.120069588Z   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
2023-04-21T15:16:05.120078014Z     return await future._result()
2023-04-21T15:16:05.120085086Z   File "/usr/local/lib/python3.8/site-packages/prefect/futures.py", line 237, in _result
2023-04-21T15:16:05.120091377Z     return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
2023-04-21T15:16:05.120097177Z   File "/usr/local/lib/python3.8/site-packages/prefect/states.py", line 100, in _get_state_result
2023-04-21T15:16:05.120103360Z     raise MissingResult(
2023-04-21T15:16:05.120110136Z prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.
2023-04-21T15:16:05.141538481Z 15:16:05.140 | INFO    | prefect.engine - Engine execution of flow run '5c996762-8edc-4584-9f70-4d1d6adf2b3b' aborted by orchestrator: This run has already terminated.
1
Some background: we do not have state persistence turned on; we were considering doing so after upgrading Prefect.
Also, we were running into this same error ("prefect.exceptions.MissingResult: State data is missing") before the upgrade, but only when tasks ran for more than 4 hours. That's why we were considering turning on result persistence -- it seemed like the "non-persistent" state storage was only good for 4 hours, or something. Now, it happens immediately, even with test flows which only have 1 dummy task.
z
Task run ‘generate_operation_id-a9e17ed9-0’ - Task run ‘43baff18-8292-4cd7-a694-2dcf6d1a0c77’ already finished.
Indicates that the task run id already exists and has a terminal state
That’s pretty weird for a new flow run
b
Yeah. Here's an example flow:
...and here's a run of it:
d
ive been noticing this alot as well. But I dont know how to replicate it so I havnt been able to create an ticket. But i honestly think this should be treated as a cache miss and the task needs to be rerun ignoring the cache. But like i said i dont know how to replicate it.
Also I have persist result on so that isnt the solution to this.
z
The "cache miss" behavior you're looking for is what I'm implementing in https://github.com/PrefectHQ/prefect/pull/9152
I'm very confused by this behavior for a simple flow run as shared. Are you both self hosting? Any interesting logs from the agent or output not captured by our logger? Does this occur without a deployment (ie just calling the flow)?
b
Ah, good question. It does not happen when running locally.
z
I wonder if the flow is running twice concurrently somehow when managed by an agent.
b
Oh, that's a thought. 🤔
Here are agent logs:
Copy code
20:44:50.907 | INFO    | prefect.agent - Submitting flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66'
20:44:51.139 | INFO    | prefect.agent - Completed submission of flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66'
20:44:51.160 | INFO    | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Pod has status 'Pending'.
20:44:52.475 | INFO    | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Pod has status 'Running'.
20:45:01.102 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Job did not complete within timeout of 5s.
20:45:01.138 | INFO    | prefect.agent - Reported flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66' as crashed: Flow run infrastructure exited with non-zero status code -1.
So it's complaining about a timeout. I tried setting
pod_watch_timeout_seconds: null
but didn't fix.
I also tried
job_watch_timeout_seconds: null
but didn't fix.
z
Huh the job watch one looks set to 5s there which would definitely cause problems. If the agent marks a flow as failed then the next task run will be aborted.
It looks like the error message on the run is just misleading; this is likely the root cause
b
Oh nice.
z
Sounds like you're having a hard time changing that setting so perhaps that's the next thing to investigate
b
Yeah, I'll focus on that. Any tips before I go poking at it? Like, why changing the value in the deployment .yaml didn't seem to affect what was logged?
z
I don't have a great intuition for how things are updated from the yaml - that's changed a lot recently.
Does
prefect deployment inspect
or
prefect block inspect
for the relevant objects show a value for that field?
👀 1
b
Ah ha, it does not. Or rather, it shows the default value of 5. So, that field of the deployment .yaml isn't being picked up by
prefect deployment apply
? Is that a known kind of issue people have encountered in the past? Or is it possible I'm doing something wrong?
Or I guess... the deployment .yaml file has a scary comment which says:
Copy code
###
### DO NOT EDIT BELOW THIS LINE
###
...and we are totally editing below that line. It worked until now, but maybe the time has come where we shouldn't do that anymore. So I guess... I need to look at our "infrastructure block", and see if I can change the
job_watch_timeout_seconds
seconds there.
AMAZING! It worked. Thank you @Zanie!
d
@Ofek Katriel