https://prefect.io logo
Title
d

David Elliott

02/22/2023, 2:41 PM
I’m seeing a bug whereby the agent sometimes submits the same flow run twice in quick succession I hit Quick Run via the UI - is this known about? I’d have expected there to be some idempotency controls either in the UI or the agent to ensure the same flow run doesn’t get submitted twice? I’ve had this twice today already, details in 🧵
11:34:24.700 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:34:24.823 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
11:34:24.998 | INFO    | prefect.agent - Submitting flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:25.656 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Starting watch for pod start...
11:34:25.670 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Pod has status 'Pending'.
11:34:25.804 | INFO    | prefect.agent - Completed submission of flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:27.465 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Pod has status 'Running'.
11:34:27.466 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rgk4z': Starting watch for job completion
11:34:33.123 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:34:33.456 | INFO    | prefect.agent - Submitting flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:34.084 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Starting watch for pod start...
11:34:34.092 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Pod has status 'Pending'.
11:34:34.224 | INFO    | prefect.agent - Completed submission of flow run 'fbf28ef3-41d2-4f70-915c-eca600972955'
11:34:35.082 | INFO    | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Pod has status 'Running'.
11:34:35.083 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'knowing-macaque-rbp5m': Starting watch for job completion
11:34:37.752 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
11:34:43.503 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
11:34:48.388 | DEBUG   | prefect.agent - Checking for cancelled flow runs...



-- another occurence below


14:35:49.229 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:35:49.538 | INFO    | prefect.agent - Submitting flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:50.459 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Starting watch for pod start...
14:35:50.479 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Pod has status 'Pending'.
14:35:50.906 | INFO    | prefect.agent - Completed submission of flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:51.385 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Pod has status 'Running'.
14:35:51.386 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-8f9xc': Starting watch for job completion
14:35:55.925 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:35:57.197 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:35:57.494 | INFO    | prefect.agent - Submitting flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:58.139 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Starting watch for pod start...
14:35:58.148 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Pod has status 'Pending'.
14:35:58.262 | INFO    | prefect.agent - Completed submission of flow run '8ccab014-7150-43d8-a701-13d557d66c51'
14:35:58.987 | INFO    | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Pod has status 'Running'.
14:35:58.987 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'silky-serval-4vx98': Starting watch for job completion
14:36:05.043 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:36:06.432 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:36:17.168 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
All timestamps in UTC
Prefect 2 cloud - https://app.prefect.cloud/account/1235831b-7019-4987-b759-4341850b4c5a/workspace/85fb93b5-ea3a-4da7-8be7-aa4e13c6397d
Version:             2.7.11
API version:         0.8.4
Python version:      3.8.7
Git commit:          6b27b476
Built:               Thu, Feb 2, 2023 7:22 PM
OS/Arch:             darwin/x86_64
Profile:             bi-p
Server type:         cloud
Specifically this is a Kubernetes Agent
It spawns 2 k8s jobs of the same name (but different suffixes) which then go on to compete with one another, and in one case confused and crashed the dask workers and whole flow run
z

Zanie

02/22/2023, 4:07 PM
Hm this is weird. This shouldn’t have anything to do with the UI, looks like a bug in the agent implementation.
cc @alex / @Peyton Runyan is this something that could have broken with the work pool changes?
a

alex

02/22/2023, 4:10 PM
@David Elliott What command did you run to start your agent?
d

David Elliott

02/22/2023, 4:15 PM
It’s in a k8s yaml template, but it’s
command: ["/bin/bash", "-c"]
  args:
    - "prefect agent start -q bi-pipeline-other-local-dev-k8s -q bi-pipeline-main-local-dev-k8s"
a

alex

02/22/2023, 4:47 PM
Thanks! I’m not seeing anything in the changes of the agent implementation for work pools that would cause this error. It looks like the flow run is being submitted, but then gets returned again when querying the work queue for available flow runs on the next pass based on the timestamps. That makes me think that the flow run state isn’t being updated correctly or there’s some sort of race condition occurring. @David Elliott how often do you see this bug when creating quick runs?
d

David Elliott

02/22/2023, 4:50 PM
First time I’ve seen this in prefect 2, and only starting today - had those 2 above examples thus far. Agreed it feels like a race condition to do with the flow run state getting updated… is it the agent that updates the flow run state, or the flow run pod itself? if it’s the agent, feels like it would finish updating the flow run state prior to pulling new flow runs off the queue?
a

alex

02/22/2023, 4:53 PM
Yeah, the agent updates the flow run state to pending before spinning up a pod to execute the flow run, so the flow run shouldn’t be returned on subsequent queries of the work queues. I’ll check to see if we’ve made any updates to the flow runs/work queues API recently.
👍 1
z

Zanie

02/22/2023, 6:03 PM
We’re encountering some database issues today, but a fix has just been deployed to production.
We suspect that is the cause of the issue.
👍 1
d

David Elliott

02/24/2023, 2:42 PM
Just happened again now btw
14:40:32.058 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:40:40.974 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:40:43.314 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:40:43.678 | INFO    | prefect.agent - Submitting flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:44.551 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Starting watch for pod start...
14:40:44.559 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Pod has status 'Pending'.
14:40:44.674 | INFO    | prefect.agent - Completed submission of flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:46.559 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Pod has status 'Running'.
14:40:46.560 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-sgxdv': Starting watch for job completion
14:40:51.083 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:40:52.187 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:40:52.782 | INFO    | prefect.agent - Submitting flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:53.471 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Starting watch for pod start...
14:40:53.481 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Pod has status 'Pending'.
14:40:53.717 | INFO    | prefect.agent - Completed submission of flow run '5592f7a9-c2a0-467b-a135-10aa3fca19c3'
14:40:54.823 | INFO    | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Pod has status 'Running'.
14:40:54.824 | DEBUG   | prefect.infrastructure.kubernetes-job - Job 'blond-pillbug-tcqwl': Starting watch for job completion
14:41:03.236 | DEBUG   | prefect.agent - Checking for cancelled flow runs...
14:41:03.541 | DEBUG   | prefect.agent - Checking for scheduled flow runs...
14:41:12.157 | DEBUG   | prefect.agent - Checking for cancelled flow runs...