Hi Everyone, My team is getting k8 deployment job...
# prefect-community
b
Hi Everyone, My team is getting k8 deployment jobs where the job log states the flow run status cannot transition to
RUNNING
Copy code
Engine execution of flow run '148d81ea-3cfe-4db1-a0d4-3f3f17748fb0' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state.
Will add more details to thread
ENV
prefect==2.6.4 python==3.10 We have a
for loop
that uses the
prefect.deployments
python library to create deployment runs https://docs.prefect.io/api-ref/prefect/deployments/?h=run_deplo#prefect.deployments.run_deployment The deployment uses s3 storage and k8 job infrastructure. This flow run status error only seems to show up on some flow runs, while others work just fine. Our k8 cluster doesn't appear to be overloaded (it was at first, and we thought that was the issue, but we still get this error at normal load levels) We are exploring if maybe this is because flow runs in the for loop are using the same flow run names, and therefore trying to make duplicate jobs, and confusing the k8 agent.
m
Hey @Blake Stefansen I'm assuming you have multiple agents setup on the same work-queue, this is just an orchestration rule that prevents both agents from running the same flow.
b
@Mason Menges Thank you for reaching out! I can't seem to track down any other agents that are listening on the same queue. Does prefect have an api or some method to see all the agents polling a particular work queue?
I believe I have the same issue here https://github.com/PrefectHQ/prefect/issues/7116 It sounds like our k8 is rescheduling pods, and in doing so, the new pod fails to run because the flow run state in prefect is already running. this seems like a big issue, unless there is a better way to spec our job infrastructure. below should be the default spec generated by the CLI (minus the image secret)
Copy code
{
  "kind": "Job",
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "env": [],
            "name": "prefect-job"
          }
        ],
        "completions": 1,
        "parallelism": 1,
        "restartPolicy": "Never",
        "imagePullSecrets": [
          {
            "name": "dockercloud-secret"
          }
        ]
      }
    }
  },
  "metadata": {
    "labels": {}
  },
  "apiVersion": "batch/v1"
}
m
Hmm, I'm not super familiar with K8s itself so I'm not totally certain if there would be another way to spec it to address this, I know we're talking about this and other improvements we can make to the engine that would ideally address this but I can't point to anything concrete at the moment, The issue currently is that if we change the existing rules we end up in a situation where 2 agents could be duplicating work which is arguably the more destructive pattern but it's definitely on the radar for enhancements.
d
Hey, Do you happen to have any update about this? We are having the same issue “aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state” Thanks
👀 1
b
Hello Dekel, we received your report through email. We will continue our investigation and provide updates through the email thread if that's alright with you! If someone has any additional information, I'm sure they'll contribute here as well.
👀 1
d
Yeah sure, I assumed I’m not the only one who sees this log, so I posted it here too (-: Thanks
j
@Ton Steijvers same issue