Hi, i am running kubernetes using docker storage a...
# prefect-server
c
Hi, i am running kubernetes using docker storage and kubernetes hosting. sometime, just sometimes my flows dont start at the scheduled time - they start 15 minutes later and work succesfully. when i look at the logs i can see that they are not submitted for execution at the required time but about 15 mins later ? can anyone help me find the logs ?
k
Hey @Colin, I think the relevant logs might be on the agent side. Do you have access to your agent logs?
c
yes
i have turned on debugging on the agent, but still not seeing much ?
k
So you just see it picking up flows 15 minutes late but not much else?
c
ok, i dont have the agent log at the moment - waiting for it to occur again and i will capture. however when i look in flow_run_state (in the db) i am seeing the following "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 142150.318645+00" "Scheduled" "Flow run scheduled." "2021-06-05 143000+00" "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 143000.429779+00" "Submitted" "Submitted for execution" "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 144542.167453+00" "Running" "Running flow." "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 144543.5566+00" "Success" "All reference tasks succeeded."
sorry for the format, the task is scheduled for 14:30 its then submitted for execution at 14:30 but starts running at 14:45.
k
No worries about the format! I see yeah, what I’m trying to see is when the agent picked it up or if there’s more information there. Just confirming you’re on server right?
c
yes server
from memory the server picked it up at 14:45, but i am waiting for another blip and i can confirm
k
I’d have to ask the team for more info but I think we sometimes see this behavior on Kubernetes when the flow doesn’t get enough resources to kick off immediately. I’ll have to get back to you on Monday
c
ok, that would be great thanks..
i think i may have found the problem, changed the agent config to Always pull the image from the repository - seems to have done the trick
👍 1
Sadly the problem has re-occured. any ideas ?
k
Asking the team now
So this happens when the agent needs space on one of the nodes in the cluster to schedule the job, but can’t. It’s a matter of not having enough resources you can try configuring auto scaling.
c
is there a way i can see this error, looks to my like the nodes have plenty of available capacity ?
k
If the Flow Run has debug logs enabled, you would see it.
c
ok, how would i turn on debug logs for the flow run ?
k
You need an enviroment variable
PREFECT__LOGGING__LEVEL=DEBUG
. This can be passed through the RunConfig like
KubernetesRun('env': {"PREFECT___LOGGING___LEVEL": "DEBUG"})
or you can do it on the agent like
prefect agent kubernetes start --env PREFECT__LOGGING__LEVEL=DEBUG
c
Hmm thanks, i have that set on my agent and i am seeing debug logs - see the attached
143040 lens INFO agent Submitted for execution: Job prefect-job-ea0a4ebb 143043 lens INFO CloudFlowRunner Beginning Flow run for 'ClientEmailer' 143043 lens DEBUG CloudFlowRunner Using executor type LocalExecutor 143043 lens DEBUG CloudFlowRunner Flow 'ClientEmailer': Handling state change from Scheduled to Running 143044 lens INFO CloudTaskRunner Task 'init_logger': Starting task run...
the issue is that this task was scheduled to start at 14:15 but only started at 14:30 and hence only started logging at 14:30..
see here, dont worry about the failure its more that it didnt start until 15 mins after the scheduled time
k
Do you see Debug Logs in the UI?
c
yes
k
I assume there’s no information there about the delay?
c
no, it doesnt even start logging until 14:30... looking at the agent logs i can see this [2021-06-09 131500,011] INFO - agent | Deploying flow run 17684062-f853-47bd-b207-6fd582f89229 to execution environment... INFOagentDeploying flow run 17684062-f853-47bd-b207-6fd582f89229 to execution environment... [2021-06-09 133040,533] INFO - agent | Completed deployment of flow run 17684062-f853-47bd-b207-6fd582f89229 INFOagentCompleted deployment of flow run 17684062-f853-47bd-b207-6fd582f89229
which seems to indicated that its taking 15 mins to deploy the flow run ?
k
Gotcha. Will ask the team again.
c
but i cant see any reason why that would take 15 mins ?
k
Does your flow in the UI contain debug logs like this?
c
yes
there are debug messages in the extract i attached above so i guess DEBUG is logging, its seems to be the deployment which is taking a long time ?
k
Ok will ask the team. One sec
Ok so I chatted with team members with more Kubernetes experience. The recommendation if you can is to run this Flow on a cluster where it’s the only job there as a test if you can see the same behavior. If it is getting delayed there, then it’s at least easier to identify that the agent/flow itself is causing an issue. With this set-up, check if the pod is executing right away. The download of the flow may be slow. Maybe the agent is having difficulty communicating. You can check on DEBUG mode if that agent is still pinging the Server. The second thing to do is to run the flow multiple times at the same time. If any of them don’t go straight from submitted to running, then check the lifecycle status of that pod. The thing is that your cluster cpu and memory make look good, but it’s separate from the allocatable cpu and memory. It could be possible that the flow is not able to get resources even if there is some available in the cluster.
c
thanks i will try this
ok i have tried this, agents look fine and i am not seeing any out of memory issues.
its almost as if the kubernetes pods are not creating and then after 15 mins the lazerus process finds out and tried again
i have also added limits to the jobs to protect the memory, we have a 6gb machine and i have set limits at 1gb so that should ensure plenty of capactuty (at most 1 running at atime)
k
So you weren’t able to replicate when trying to? Lazarus should appear in the Flow logs for sure though. Have you seen it anywhere?
c
yes i can see that lazarus has restarted the flow and it ran correctly, nothing as to why it started late
this is the strange thing, if i look at the agent logs i see the following [2021-06-10 173000,007] INFO - agent | Deploying flow run 3796b051-a60c-4458-8dad-25a740223e1c to execution environment... [2021-06-10 174540,111] INFO - agent | Completed deployment of flow run 3796b051-a60c-4458-8dad-25a740223e1c
the deployment starts on time
is there any extra logging i can turn on for the agent ?
k
prefect agent _______ start --log-level=DEBUG --show-flow-logs
will give the most logs
c
show flow logs gives me an error, wont allow the agent to start but i have the log level turned on
looks like --show-flow-logs is not available for the kubernetes agent
k
Oh I see that’s right
c
i am wondering if i should move away from docker storage, is there a recommendation on how to run in a kubernetes env
k
We generally are moving away from recommending Docker storage, but not because of the interaction with Kubernetes. It has to more with the serialization that happens with pickle based storage causing some difficulties. With that said, we recommend script based storage like Github or S3 with
stored_as_script
set to true. Docker storage does make sense though if you have dependencies you need to package with your Flow
c
yes we do have some dependancies, largely python modules
frustrating as its a great tool and generally working great, just this wierd intermittent problem
k
I know. Sorry about that. What’s harder is that it only happens some times. So the situations where I’ve seen Lazarus triggered with Kubernetes is when preemptible/spot compute is being used and the Flow could not secure them. I have also seen Lazarus revive Flows where the hardware underneath hangs on some process. There have been reports on the other hand of Lazarus not kicking in
Do you really rely on Lazarus? What some people do is turn off heartbeats on the Flow, and that can stop the Zombie Killer from killing something
Or you can try turning off Lazarus itself and see what happens
Maybe it just sees it as distressed when it really isn’t
c
i have tried without lazerus and we see the same issue
ok i found some logs.... the agent finds the job DEBUGagentQuerying for ready flow runs... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'}
the one ending cad2 doent start
[2021-06-10 194457,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentWaiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time...
[2021-06-10 194459,202] DEBUG - agent | Querying for ready flow runs... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'])
[2021-06-10 194500,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 194500,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted...
but the deployment of cad2 never completes
no reason why....
k
Just making sure I understand there are two runs scheduled for the same time, the agent finds both. The one ending in
cad2
just goes to submitted and has no activity afterward?
c
correct
looks like the agent is not submitting it, but the agent pulls two together it thinks it has submitted it
k
Ok will forward to team
c
see these logs kubectl logs prefect-agent-78cf68c98b-b2wk8 --namespace prefect-server | findstr "cad2" DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,324] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,353] DEBUG - agent | Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... DEBUGagentSubmitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... [2021-06-10 194457,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194457,646] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194458,201] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194459,244] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194500,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 194500,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... INFOagentDeploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... DEBUGagentUpdating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... [2021-06-10 195707,434] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195717,478] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195727,524] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195737,568] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195747,608] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195757,650] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195807,833] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195817,954] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195828,012] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) PS C:\Users\colin.rees\OneDrive - FastAndEazy\Work\FranchiseBrands\VsCodeProjects\ArmTemplates1> kubectl logs prefect-agent-78cf68c98b-b2wk8 --namespace prefect-server | findstr "cad2" DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,324] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,353] DEBUG - agent | Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... DEBUGagentSubmitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... [2021-06-10 194457,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentWaiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194457,646] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194458,201] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194459,244] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194500,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 194500,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... INFOagentDeploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... DEBUGagentUpdating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... [2021-06-10 195707,434] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195717,478] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195727,524] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195737,568] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195747,608] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195757,650] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195807,833] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195817,954] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195828,012] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195838,061] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2'])
👍 1
great, i think its probably a bug in the agent and how it handles multiple jobs pulled together
thanls
so i played with this a little more, and i am seeing the same behavior with the local agent. i think there is an edge case with the agent where if multiple flows kick off at about the same time then some of them get stuck in the submitted state. - trying to reproduce now.
k
Oh man I had a response typed out for you yesterday but forgot to send it it seems. Attaching it below.
Hey @Colin, chatted with the team. It still seems like a k8s issue looking at the logs because the submission of the job is taking a long time, which means the k8s API is taking a long time and is hanging. It does seem something fishy is going on within these calls: https://github.com/PrefectHQ/prefect/blob/ac6bb08f70319e627b5e4a03b64ebcc1dc169ef6/src/prefect/agent/agent.py#L377-L386 The 
_deploy_flow_run_completed_callback
 isn’t being reached. It would be helpful to know if this log happened: 
self.logger.debug("Creating namespaced job {}".format(job_name))
We recommend installing an editable version of Prefect on your agent and add more print / log statements in the k8s agent to better isolate whether the hang is coming from Prefect or Kubernetes (we suspect K8s); you can add more granular information here: https://github.com/PrefectHQ/prefect/blob/ac6bb08f70319e627b5e4a03b64ebcc1dc169ef6/src/prefect/agent/kubernetes/agent.py#L389
Went over this thread and I realized I never asked you what version you were on. What is your Prefect version?
And yes replicating on the local agent would be great!
c
version 0.14.21
k
Oh ok I’m on the same one