https://prefect.io logo
Title
j

John-Craig Borman

04/05/2023, 2:24 PM
Hi all, has anyone seen flows crash with messages like this?
13:29:44.526 | INFO    | prefect.engine - Engine execution of flow run '0f87d3ab-658a-4fb8-bf9d-9daf903bcaf1' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state.

13:29:43.035 | INFO    | Flow run 'unnatural-leech' - Downloading flow code from storage at None
  warn(RuntimeWarning(msg))
/usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
prefect version
:
Version:             2.8.7
API version:         0.8.4
Python version:      3.10.10
Git commit:          a6d6c6fc
Built:               Thu, Mar 23, 2023 3:27 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1
z

Zanie

04/05/2023, 2:25 PM
We’ve removed that orchestration rule in the latest version, so you shouldn’t see that error if you upgrade.
However the basic idea is that the flow run is attempting to RUN when it has already started RUNNING elsewhere — this generally is due to a restarted pod or something?
j

John-Craig Borman

04/05/2023, 2:26 PM
That could be the case, these runs are deployed in a K8s environment
but this is helpful, I'll get in touch with our infra team and hopefully we can isolate the root cause
Still seeing this error on 2.10.5 in a k8s environment
z

Zanie

05/01/2023, 4:33 PM
Your server is upgraded as well?
j

John-Craig Borman

05/01/2023, 4:44 PM
Yeah, this is being run on a worker with the same version
I sent this to one of my colleagues regarding their flow:
the logs there are probably misleading. What most likely happened is:
1. your deployment triggered a flow to run on some Pod 1
2. The Prefect flow enters a running state
a. Pod 1 crashed while running the flow (for some unknown reason, possibly OOM as you mention)
3. K8s sees the crashed pod and tries to restart by creating Pod 2 to execute the same flow
4. The Prefect flow tries to enter a running state but it was already running
a. Prefect's state machine complains that this is invalid
b. Raises the error we see above before an OOM (or other failure) that crashed Pod 1
We suspect that this is roughly what happened in our K8s deployment, the error message above being a scapegoat for some root failure
Doesn't help that we're not able to find the logs for Pod 1