https://prefect.io logo
Title
c

Colin

06/05/2021, 3:48 PM
Hi, i am running kubernetes using docker storage and kubernetes hosting. sometime, just sometimes my flows dont start at the scheduled time - they start 15 minutes later and work succesfully. when i look at the logs i can see that they are not submitted for execution at the required time but about 15 mins later ? can anyone help me find the logs ?
k

Kevin Kho

06/05/2021, 3:51 PM
Hey @Colin, I think the relevant logs might be on the agent side. Do you have access to your agent logs?
c

Colin

06/05/2021, 3:51 PM
yes
i have turned on debugging on the agent, but still not seeing much ?
k

Kevin Kho

06/05/2021, 3:53 PM
So you just see it picking up flows 15 minutes late but not much else?
c

Colin

06/05/2021, 3:58 PM
ok, i dont have the agent log at the moment - waiting for it to occur again and i will capture. however when i look in flow_run_state (in the db) i am seeing the following "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 14:21:50.318645+00" "Scheduled" "Flow run scheduled." "2021-06-05 14:30:00+00" "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 14:30:00.429779+00" "Submitted" "Submitted for execution" "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 14:45:42.167453+00" "Running" "Running flow." "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 14:45:43.5566+00" "Success" "All reference tasks succeeded."
sorry for the format, the task is scheduled for 14:30 its then submitted for execution at 14:30 but starts running at 14:45.
k

Kevin Kho

06/05/2021, 4:07 PM
No worries about the format! I see yeah, what I’m trying to see is when the agent picked it up or if there’s more information there. Just confirming you’re on server right?
c

Colin

06/05/2021, 4:10 PM
yes server
from memory the server picked it up at 14:45, but i am waiting for another blip and i can confirm
k

Kevin Kho

06/05/2021, 4:13 PM
I’d have to ask the team for more info but I think we sometimes see this behavior on Kubernetes when the flow doesn’t get enough resources to kick off immediately. I’ll have to get back to you on Monday
c

Colin

06/05/2021, 4:13 PM
ok, that would be great thanks..
i think i may have found the problem, changed the agent config to Always pull the image from the repository - seems to have done the trick
👍 1
Sadly the problem has re-occured. any ideas ?
k

Kevin Kho

06/09/2021, 2:59 PM
Asking the team now
So this happens when the agent needs space on one of the nodes in the cluster to schedule the job, but can’t. It’s a matter of not having enough resources you can try configuring auto scaling.
c

Colin

06/09/2021, 3:28 PM
is there a way i can see this error, looks to my like the nodes have plenty of available capacity ?
k

Kevin Kho

06/09/2021, 3:44 PM
If the Flow Run has debug logs enabled, you would see it.
c

Colin

06/09/2021, 3:56 PM
ok, how would i turn on debug logs for the flow run ?
k

Kevin Kho

06/09/2021, 4:00 PM
You need an enviroment variable
PREFECT__LOGGING__LEVEL=DEBUG
. This can be passed through the RunConfig like
KubernetesRun('env': {"PREFECT___LOGGING___LEVEL": "DEBUG"})
or you can do it on the agent like
prefect agent kubernetes start --env PREFECT__LOGGING__LEVEL=DEBUG
c

Colin

06/09/2021, 4:03 PM
Hmm thanks, i have that set on my agent and i am seeing debug logs - see the attached
14:30:40 lens INFO agent Submitted for execution: Job prefect-job-ea0a4ebb 14:30:43 lens INFO CloudFlowRunner Beginning Flow run for 'ClientEmailer' 14:30:43 lens DEBUG CloudFlowRunner Using executor type LocalExecutor 14:30:43 lens DEBUG CloudFlowRunner Flow 'ClientEmailer': Handling state change from Scheduled to Running 14:30:44 lens INFO CloudTaskRunner Task 'init_logger': Starting task run...
the issue is that this task was scheduled to start at 14:15 but only started at 14:30 and hence only started logging at 14:30..
see here, dont worry about the failure its more that it didnt start until 15 mins after the scheduled time
k

Kevin Kho

06/09/2021, 4:05 PM
Do you see Debug Logs in the UI?
c

Colin

06/09/2021, 4:05 PM
yes
k

Kevin Kho

06/09/2021, 4:05 PM
I assume there’s no information there about the delay?
c

Colin

06/09/2021, 4:06 PM
no, it doesnt even start logging until 14:30... looking at the agent logs i can see this [2021-06-09 13:15:00,011] INFO - agent | Deploying flow run 17684062-f853-47bd-b207-6fd582f89229 to execution environment... INFO:agent:Deploying flow run 17684062-f853-47bd-b207-6fd582f89229 to execution environment... [2021-06-09 13:30:40,533] INFO - agent | Completed deployment of flow run 17684062-f853-47bd-b207-6fd582f89229 INFO:agent:Completed deployment of flow run 17684062-f853-47bd-b207-6fd582f89229
which seems to indicated that its taking 15 mins to deploy the flow run ?
k

Kevin Kho

06/09/2021, 4:06 PM
Gotcha. Will ask the team again.
c

Colin

06/09/2021, 4:06 PM
but i cant see any reason why that would take 15 mins ?
k

Kevin Kho

06/09/2021, 4:10 PM
Does your flow in the UI contain debug logs like this?
c

Colin

06/09/2021, 4:17 PM
yes
there are debug messages in the extract i attached above so i guess DEBUG is logging, its seems to be the deployment which is taking a long time ?
k

Kevin Kho

06/09/2021, 4:18 PM
Ok will ask the team. One sec
Ok so I chatted with team members with more Kubernetes experience. The recommendation if you can is to run this Flow on a cluster where it’s the only job there as a test if you can see the same behavior. If it is getting delayed there, then it’s at least easier to identify that the agent/flow itself is causing an issue. With this set-up, check if the pod is executing right away. The download of the flow may be slow. Maybe the agent is having difficulty communicating. You can check on DEBUG mode if that agent is still pinging the Server. The second thing to do is to run the flow multiple times at the same time. If any of them don’t go straight from submitted to running, then check the lifecycle status of that pod. The thing is that your cluster cpu and memory make look good, but it’s separate from the allocatable cpu and memory. It could be possible that the flow is not able to get resources even if there is some available in the cluster.
c

Colin

06/10/2021, 8:09 AM
thanks i will try this
ok i have tried this, agents look fine and i am not seeing any out of memory issues.
its almost as if the kubernetes pods are not creating and then after 15 mins the lazerus process finds out and tried again
i have also added limits to the jobs to protect the memory, we have a 6gb machine and i have set limits at 1gb so that should ensure plenty of capactuty (at most 1 running at atime)
k

Kevin Kho

06/10/2021, 3:26 PM
So you weren’t able to replicate when trying to? Lazarus should appear in the Flow logs for sure though. Have you seen it anywhere?
c

Colin

06/10/2021, 5:49 PM
yes i can see that lazarus has restarted the flow and it ran correctly, nothing as to why it started late
this is the strange thing, if i look at the agent logs i see the following [2021-06-10 17:30:00,007] INFO - agent | Deploying flow run 3796b051-a60c-4458-8dad-25a740223e1c to execution environment... [2021-06-10 17:45:40,111] INFO - agent | Completed deployment of flow run 3796b051-a60c-4458-8dad-25a740223e1c
the deployment starts on time
is there any extra logging i can turn on for the agent ?
k

Kevin Kho

06/10/2021, 6:10 PM
prefect agent _______ start --log-level=DEBUG --show-flow-logs
will give the most logs
c

Colin

06/10/2021, 7:27 PM
show flow logs gives me an error, wont allow the agent to start but i have the log level turned on
looks like --show-flow-logs is not available for the kubernetes agent
k

Kevin Kho

06/10/2021, 7:31 PM
Oh I see that’s right
c

Colin

06/10/2021, 7:31 PM
i am wondering if i should move away from docker storage, is there a recommendation on how to run in a kubernetes env
k

Kevin Kho

06/10/2021, 7:36 PM
We generally are moving away from recommending Docker storage, but not because of the interaction with Kubernetes. It has to more with the serialization that happens with pickle based storage causing some difficulties. With that said, we recommend script based storage like Github or S3 with
stored_as_script
set to true. Docker storage does make sense though if you have dependencies you need to package with your Flow
c

Colin

06/10/2021, 7:36 PM
yes we do have some dependancies, largely python modules
frustrating as its a great tool and generally working great, just this wierd intermittent problem
k

Kevin Kho

06/10/2021, 7:42 PM
I know. Sorry about that. What’s harder is that it only happens some times. So the situations where I’ve seen Lazarus triggered with Kubernetes is when preemptible/spot compute is being used and the Flow could not secure them. I have also seen Lazarus revive Flows where the hardware underneath hangs on some process. There have been reports on the other hand of Lazarus not kicking in
Do you really rely on Lazarus? What some people do is turn off heartbeats on the Flow, and that can stop the Zombie Killer from killing something
Or you can try turning off Lazarus itself and see what happens
Maybe it just sees it as distressed when it really isn’t
c

Colin

06/10/2021, 7:48 PM
i have tried without lazerus and we see the same issue
ok i found some logs.... the agent finds the job DEBUG:agent:Querying for ready flow runs... DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'}
the one ending cad2 doent start
[2021-06-10 19:44:57,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUG:agent:Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time...
[2021-06-10 19:44:59,202] DEBUG - agent | Querying for ready flow runs... DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'])
[2021-06-10 19:45:00,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 19:45:00,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted...
but the deployment of cad2 never completes
no reason why....
k

Kevin Kho

06/10/2021, 8:00 PM
Just making sure I understand there are two runs scheduled for the same time, the agent finds both. The one ending in
cad2
just goes to submitted and has no activity afterward?
c

Colin

06/10/2021, 8:06 PM
correct
looks like the agent is not submitting it, but the agent pulls two together it thinks it has submitted it
k

Kevin Kho

06/10/2021, 8:08 PM
Ok will forward to team
c

Colin

06/10/2021, 8:08 PM
see these logs kubectl logs prefect-agent-78cf68c98b-b2wk8 --namespace prefect-server | findstr "cad2" DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 19:44:57,324] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 19:44:57,353] DEBUG - agent | Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... DEBUG:agent:Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... [2021-06-10 19:44:57,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:44:57,646] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:44:58,201] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:44:59,244] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:45:00,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 19:45:00,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... INFO:agent:Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... DEBUG:agent:Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... [2021-06-10 19:57:07,434] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:17,478] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:27,524] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:37,568] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:47,608] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:57,650] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:07,833] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:17,954] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:28,012] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) PS C:\Users\colin.rees\OneDrive - FastAndEazy\Work\FranchiseBrands\VsCodeProjects\ArmTemplates1> kubectl logs prefect-agent-78cf68c98b-b2wk8 --namespace prefect-server | findstr "cad2" DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 19:44:57,324] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 19:44:57,353] DEBUG - agent | Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... DEBUG:agent:Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... [2021-06-10 19:44:57,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUG:agent:Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:44:57,646] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:44:58,201] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:44:59,244] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:45:00,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 19:45:00,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... INFO:agent:Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... DEBUG:agent:Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... [2021-06-10 19:57:07,434] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:17,478] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:27,524] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:37,568] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:47,608] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:57:57,650] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:07,833] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:17,954] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:28,012] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 19:58:38,061] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUG:agent:Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2'])
👍 1
great, i think its probably a bug in the agent and how it handles multiple jobs pulled together
thanls
so i played with this a little more, and i am seeing the same behavior with the local agent. i think there is an edge case with the agent where if multiple flows kick off at about the same time then some of them get stuck in the submitted state. - trying to reproduce now.
k

Kevin Kho

06/13/2021, 3:25 AM
Oh man I had a response typed out for you yesterday but forgot to send it it seems. Attaching it below.
Hey @Colin, chatted with the team. It still seems like a k8s issue looking at the logs because the submission of the job is taking a long time, which means the k8s API is taking a long time and is hanging. It does seem something fishy is going on within these calls: https://github.com/PrefectHQ/prefect/blob/ac6bb08f70319e627b5e4a03b64ebcc1dc169ef6/src/prefect/agent/agent.py#L377-L386 The 
_deploy_flow_run_completed_callback
 isn’t being reached. It would be helpful to know if this log happened: 
self.logger.debug("Creating namespaced job {}".format(job_name))
We recommend installing an editable version of Prefect on your agent and add more print / log statements in the k8s agent to better isolate whether the hang is coming from Prefect or Kubernetes (we suspect K8s); you can add more granular information here: https://github.com/PrefectHQ/prefect/blob/ac6bb08f70319e627b5e4a03b64ebcc1dc169ef6/src/prefect/agent/kubernetes/agent.py#L389
Went over this thread and I realized I never asked you what version you were on. What is your Prefect version?
And yes replicating on the local agent would be great!
c

Colin

06/14/2021, 3:45 PM
version 0.14.21
k

Kevin Kho

06/14/2021, 3:46 PM
Oh ok I’m on the same one