I’m seeing a bug whereby the agent sometimes submi...
# prefect-community
d
I’m seeing a bug whereby the agent sometimes submits the same flow run twice in quick succession I hit Quick Run via the UI - is this known about? I’d have expected there to be some idempotency controls either in the UI or the agent to ensure the same flow run doesn’t get submitted twice? I’ve had this twice today already, details in 🧵
Copy code
11:34:24.700 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:34:24.823 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
11:34:24.998 | INFO    | prefect.agent - Submitting flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:25.656 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Starting watch for pod start...
11:34:25.670 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Pod has status 'Pending'.
11:34:25.804 | INFO    | prefect.agent - Completed submission of flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:27.465 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Pod has status 'Running'.
11:34:27.466 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Starting watch for job completion
11:34:33.123 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:34:33.456 | INFO    | prefect.agent - Submitting flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:34.084 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Starting watch for pod start...
11:34:34.092 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Pod has status 'Pending'.
11:34:34.224 | INFO    | prefect.agent - Completed submission of flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:35.082 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Pod has status 'Running'.
11:34:35.083 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Starting watch for job completion
11:34:37.752 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
11:34:43.503 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:34:48.388 | DEBUG   | prefect.agent - Checking for cancelled flow runs...



-- another occurence below


14:35:49.229 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:35:49.538 | INFO    | prefect.agent - Submitting flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:50.459 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Starting watch for pod start...
14:35:50.479 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Pod has status 'Pending'.
14:35:50.906 | INFO    | prefect.agent - Completed submission of flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:51.385 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Pod has status 'Running'.
14:35:51.386 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Starting watch for job completion
14:35:55.925 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:35:57.197 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:35:57.494 | INFO    | prefect.agent - Submitting flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:58.139 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Starting watch for pod start...
14:35:58.148 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Pod has status 'Pending'.
14:35:58.262 | INFO    | prefect.agent - Completed submission of flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:58.987 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Pod has status 'Running'.
14:35:58.987 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Starting watch for job completion
14:36:05.043 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:36:06.432 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:36:17.168 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
All timestamps in UTC
Prefect 2 cloud - https://app.prefect.cloud/account/1235831b-7019-4987-b759-4341850b4c5a/workspace/85fb93b5-ea3a-4da7-8be7-aa4e13c6397d
Copy code
Version:             2.7.11
API version:         0.8.4
Python version:      3.8.7
Git commit:          6b27b476
Built:               Thu, Feb 2, 2023 7:22 PM
OS/Arch:             darwin/x86_64
Profile:             bi-p
Server type:         cloud
Specifically this is a Kubernetes Agent
It spawns 2 k8s jobs of the same name (but different suffixes) which then go on to compete with one another, and in one case confused and crashed the dask workers and whole flow run
z
Hm this is weird. This shouldn’t have anything to do with the UI, looks like a bug in the agent implementation.
cc @alex / @Peyton Runyan is this something that could have broken with the work pool changes?
a
@David Elliott What command did you run to start your agent?
d
It’s in a k8s yaml template, but it’s
Copy code
command: ["/bin/bash", "-c"]
  args:
    - "prefect agent start -q bi-pipeline-other-local-dev-k8s -q bi-pipeline-main-local-dev-k8s"
a
Thanks! I’m not seeing anything in the changes of the agent implementation for work pools that would cause this error. It looks like the flow run is being submitted, but then gets returned again when querying the work queue for available flow runs on the next pass based on the timestamps. That makes me think that the flow run state isn’t being updated correctly or there’s some sort of race condition occurring. @David Elliott how often do you see this bug when creating quick runs?
d
First time I’ve seen this in prefect 2, and only starting today - had those 2 above examples thus far. Agreed it feels like a race condition to do with the flow run state getting updated… is it the agent that updates the flow run state, or the flow run pod itself? if it’s the agent, feels like it would finish updating the flow run state prior to pulling new flow runs off the queue?
a
Yeah, the agent updates the flow run state to pending before spinning up a pod to execute the flow run, so the flow run shouldn’t be returned on subsequent queries of the work queues. I’ll check to see if we’ve made any updates to the flow runs/work queues API recently.
👍 1
z
We’re encountering some database issues today, but a fix has just been deployed to production.
We suspect that is the cause of the issue.
👍 1
d
Just happened again now btw
Copy code
14:40:32.058 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:40:40.974 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:40:43.314 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:40:43.678 | INFO    | prefect.agent - Submitting flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:44.551 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Starting watch for pod start...
14:40:44.559 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Pod has status 'Pending'.
14:40:44.674 | INFO    | prefect.agent - Completed submission of flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:46.559 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Pod has status 'Running'.
14:40:46.560 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Starting watch for job completion
14:40:51.083 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:40:52.187 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:40:52.782 | INFO    | prefect.agent - Submitting flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:53.471 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Starting watch for pod start...
14:40:53.481 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Pod has status 'Pending'.
14:40:53.717 | INFO    | prefect.agent - Completed submission of flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:54.823 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Pod has status 'Running'.
14:40:54.824 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Starting watch for job completion
14:41:03.236 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:41:03.541 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:41:12.157 | DEBUG   | prefect.agent - Checking for cancelled flow runs...