We re running Prefect Orion in a self hosted Kubernetes clus Prefect Community #ask-community

We're running Prefect Orion in a self-hosted Kuber...

Ben Ayers-Glassey

04/21/2023, 5:50 PM

We're running Prefect Orion in a self-hosted Kubernetes cluster. We upgraded from 2.7.4 to 2.8.0, and now when we run any flow, we immediately run into an error where Prefect can't load its state.

Copy code

2023-04-21T15:16:04.920982508Z /usr/local/lib/python3.8/runpy.py:127: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; thi
s may result in unpredictable behaviour
2023-04-21T15:16:04.921075059Z   warn(RuntimeWarning(msg))
2023-04-21T15:16:04.997977913Z 15:16:04.996 | INFO    | Flow run 'phi321-lysenko-tensor' - Created task run 'generate_operation_id-a9e17ed9-0' for task 'generate_operation_id'
2023-04-21T15:16:04.999103995Z 15:16:04.997 | INFO    | Flow run 'phi321-lysenko-tensor' - Executing 'generate_operation_id-a9e17ed9-0' immediately...
2023-04-21T15:16:05.043778806Z 15:16:05.042 | INFO    | Task run 'generate_operation_id-a9e17ed9-0' - Task run '43baff18-8292-4cd7-a694-2dcf6d1a0c77' already finished.
2023-04-21T15:16:05.119806977Z 15:16:05.044 | ERROR   | Flow run 'phi321-lysenko-tensor' - Encountered exception during execution:
2023-04-21T15:16:05.119866174Z Traceback (most recent call last):
2023-04-21T15:16:05.119875789Z   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
2023-04-21T15:16:05.119882275Z     result = await run_sync(flow_call)
2023-04-21T15:16:05.119888470Z   File "/usr/local/lib/python3.8/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
2023-04-21T15:16:05.119896590Z     return await anyio.to_thread.run_sync(
2023-04-21T15:16:05.119904639Z   File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
2023-04-21T15:16:05.119911921Z     return await get_asynclib().run_sync_in_worker_thread(
2023-04-21T15:16:05.119917612Z   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
2023-04-21T15:16:05.119924062Z     return await future
 ...etc...
2023-04-21T15:16:05.120058513Z   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
2023-04-21T15:16:05.120063946Z     raise self._exception
2023-04-21T15:16:05.120069588Z   File "/usr/local/lib/python3.8/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
2023-04-21T15:16:05.120078014Z     return await future._result()
2023-04-21T15:16:05.120085086Z   File "/usr/local/lib/python3.8/site-packages/prefect/futures.py", line 237, in _result
2023-04-21T15:16:05.120091377Z     return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
2023-04-21T15:16:05.120097177Z   File "/usr/local/lib/python3.8/site-packages/prefect/states.py", line 100, in _get_state_result
2023-04-21T15:16:05.120103360Z     raise MissingResult(
2023-04-21T15:16:05.120110136Z prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.
2023-04-21T15:16:05.141538481Z 15:16:05.140 | INFO    | prefect.engine - Engine execution of flow run '5c996762-8edc-4584-9f70-4d1d6adf2b3b' aborted by orchestrator: This run has already terminated.

✅ 1

Ben Ayers-Glassey

04/21/2023, 5:51 PM

Some background: we do not have state persistence turned on; we were considering doing so after upgrading Prefect.

Ben Ayers-Glassey

04/21/2023, 5:52 PM

Also, we were running into this same error ("prefect.exceptions.MissingResult: State data is missing") before the upgrade, but only when tasks ran for more than 4 hours. That's why we were considering turning on result persistence -- it seemed like the "non-persistent" state storage was only good for 4 hours, or something. Now, it happens immediately, even with test flows which only have 1 dummy task.

Zanie

04/21/2023, 6:06 PM

Task run ‘generate_operation_id-a9e17ed9-0’ - Task run ‘43baff18-8292-4cd7-a694-2dcf6d1a0c77’ already finished.

Indicates that the task run id already exists and has a terminal state

Zanie

04/21/2023, 6:07 PM

That’s pretty weird for a new flow run

Ben Ayers-Glassey

04/21/2023, 7:46 PM

Yeah. Here's an example flow:

sampleflow.py

Ben Ayers-Glassey

04/21/2023, 7:46 PM

...and here's a run of it:

Deceivious

04/21/2023, 8:45 PM

ive been noticing this alot as well. But I dont know how to replicate it so I havnt been able to create an ticket. But i honestly think this should be treated as a cache miss and the task needs to be rerun ignoring the cache. But like i said i dont know how to replicate it.

Deceivious

04/21/2023, 8:45 PM

Also I have persist result on so that isnt the solution to this.

Zanie

04/21/2023, 8:47 PM

The "cache miss" behavior you're looking for is what I'm implementing in https://github.com/PrefectHQ/prefect/pull/9152

Zanie

04/21/2023, 8:48 PM

I'm very confused by this behavior for a simple flow run as shared. Are you both self hosting? Any interesting logs from the agent or output not captured by our logger? Does this occur without a deployment (ie just calling the flow)?

Ben Ayers-Glassey

04/21/2023, 8:50 PM

Ah, good question. It does not happen when running locally.

Zanie

04/21/2023, 8:51 PM

I wonder if the flow is running twice concurrently somehow when managed by an agent.

Ben Ayers-Glassey

04/21/2023, 8:51 PM

Oh, that's a thought. 🤔

Ben Ayers-Glassey

04/21/2023, 8:51 PM

Here are agent logs:

Copy code

20:44:50.907 | INFO    | prefect.agent - Submitting flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66'
20:44:51.139 | INFO    | prefect.agent - Completed submission of flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66'
20:44:51.160 | INFO    | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Pod has status 'Pending'.
20:44:52.475 | INFO    | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Pod has status 'Running'.
20:45:01.102 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'cerulean-skink-j694d': Job did not complete within timeout of 5s.
20:45:01.138 | INFO    | prefect.agent - Reported flow run 'ddf9a230-7086-4b7e-bdbf-4474d8a9ce66' as crashed: Flow run infrastructure exited with non-zero status code -1.

Ben Ayers-Glassey

04/21/2023, 8:52 PM

So it's complaining about a timeout. I tried setting

pod_watch_timeout_seconds: null

but didn't fix.

Ben Ayers-Glassey

04/21/2023, 8:52 PM

I also tried

job_watch_timeout_seconds: null

but didn't fix.

Zanie

04/21/2023, 8:53 PM

Huh the job watch one looks set to 5s there which would definitely cause problems. If the agent marks a flow as failed then the next task run will be aborted.

Zanie

04/21/2023, 8:53 PM

It looks like the error message on the run is just misleading; this is likely the root cause

Ben Ayers-Glassey

04/21/2023, 8:54 PM

Oh nice.

Zanie

04/21/2023, 8:54 PM

Sounds like you're having a hard time changing that setting so perhaps that's the next thing to investigate

Ben Ayers-Glassey

04/21/2023, 8:55 PM

Yeah, I'll focus on that. Any tips before I go poking at it? Like, why changing the value in the deployment .yaml didn't seem to affect what was logged?

Zanie

04/21/2023, 8:56 PM

I don't have a great intuition for how things are updated from the yaml - that's changed a lot recently.

Zanie

04/21/2023, 8:58 PM

Does

prefect deployment inspect

prefect block inspect

for the relevant objects show a value for that field?

👀 1

Ben Ayers-Glassey

04/21/2023, 9:38 PM

Ah ha, it does not. Or rather, it shows the default value of 5. So, that field of the deployment .yaml isn't being picked up by

prefect deployment apply

? Is that a known kind of issue people have encountered in the past? Or is it possible I'm doing something wrong?

Ben Ayers-Glassey

04/21/2023, 9:40 PM

Or I guess... the deployment .yaml file has a scary comment which says:

Copy code

###
### DO NOT EDIT BELOW THIS LINE
###

...and we are totally editing below that line. It worked until now, but maybe the time has come where we shouldn't do that anymore. So I guess... I need to look at our "infrastructure block", and see if I can change the

job_watch_timeout_seconds

seconds there.

Ben Ayers-Glassey

04/21/2023, 9:54 PM

AMAZING! It worked. Thank you @Zanie!

Ben Ayers-Glassey

04/21/2023, 9:54 PM

Dekel R

05/04/2023, 1:58 PM

@Ofek Katriel

Open in Slack

Previous Next