Hi i am running kubernetes using docker storage and kubernet Prefect Community #prefect-server

Hi, i am running kubernetes using docker storage a...

Colin

06/05/2021, 3:48 PM

Hi, i am running kubernetes using docker storage and kubernetes hosting. sometime, just sometimes my flows dont start at the scheduled time - they start 15 minutes later and work succesfully. when i look at the logs i can see that they are not submitted for execution at the required time but about 15 mins later ? can anyone help me find the logs ?

Kevin Kho

06/05/2021, 3:51 PM

Hey @Colin, I think the relevant logs might be on the agent side. Do you have access to your agent logs?

Colin

06/05/2021, 3:51 PM

yes

Colin

06/05/2021, 3:52 PM

i have turned on debugging on the agent, but still not seeing much ?

Kevin Kho

06/05/2021, 3:53 PM

So you just see it picking up flows 15 minutes late but not much else?

Colin

06/05/2021, 3:58 PM

ok, i dont have the agent log at the moment - waiting for it to occur again and i will capture. however when i look in flow_run_state (in the db) i am seeing the following "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 142150.318645+00" "Scheduled" "Flow run scheduled." "2021-06-05 143000+00" "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 143000.429779+00" "Submitted" "Submitted for execution" "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 144542.167453+00" "Running" "Running flow." "87701bbf-5f30-47e2-a9b7-7a630b960398" "2021-06-05 144543.5566+00" "Success" "All reference tasks succeeded."

Colin

06/05/2021, 3:59 PM

sorry for the format, the task is scheduled for 14:30 its then submitted for execution at 14:30 but starts running at 14:45.

Kevin Kho

06/05/2021, 4:07 PM

No worries about the format! I see yeah, what I’m trying to see is when the agent picked it up or if there’s more information there. Just confirming you’re on server right?

Colin

06/05/2021, 4:10 PM

yes server

Colin

06/05/2021, 4:11 PM

from memory the server picked it up at 14:45, but i am waiting for another blip and i can confirm

Kevin Kho

06/05/2021, 4:13 PM

I’d have to ask the team for more info but I think we sometimes see this behavior on Kubernetes when the flow doesn’t get enough resources to kick off immediately. I’ll have to get back to you on Monday

Colin

06/05/2021, 4:13 PM

ok, that would be great thanks..

Colin

06/06/2021, 8:01 AM

i think i may have found the problem, changed the agent config to Always pull the image from the repository - seems to have done the trick

👍 1

Colin

06/09/2021, 12:15 PM

Sadly the problem has re-occured. any ideas ?

Kevin Kho

06/09/2021, 2:59 PM

Asking the team now

Kevin Kho

06/09/2021, 3:04 PM

So this happens when the agent needs space on one of the nodes in the cluster to schedule the job, but can’t. It’s a matter of not having enough resources you can try configuring auto scaling.

Colin

06/09/2021, 3:28 PM

is there a way i can see this error, looks to my like the nodes have plenty of available capacity ?

Kevin Kho

06/09/2021, 3:44 PM

If the Flow Run has debug logs enabled, you would see it.

Colin

06/09/2021, 3:56 PM

ok, how would i turn on debug logs for the flow run ?

Kevin Kho

06/09/2021, 4:00 PM

You need an enviroment variable

PREFECT__LOGGING__LEVEL=DEBUG

. This can be passed through the RunConfig like

KubernetesRun('env': {"PREFECT___LOGGING___LEVEL": "DEBUG"})

or you can do it on the agent like

prefect agent kubernetes start --env PREFECT__LOGGING__LEVEL=DEBUG

Colin

06/09/2021, 4:03 PM

Hmm thanks, i have that set on my agent and i am seeing debug logs - see the attached

Colin

06/09/2021, 4:03 PM

143040 lens INFO agent Submitted for execution: Job prefect-job-ea0a4ebb 143043 lens INFO CloudFlowRunner Beginning Flow run for 'ClientEmailer' 143043 lens DEBUG CloudFlowRunner Using executor type LocalExecutor 143043 lens DEBUG CloudFlowRunner Flow 'ClientEmailer': Handling state change from Scheduled to Running 143044 lens INFO CloudTaskRunner Task 'init_logger': Starting task run...

Colin

06/09/2021, 4:03 PM

the issue is that this task was scheduled to start at 14:15 but only started at 14:30 and hence only started logging at 14:30..

Colin

06/09/2021, 4:04 PM

see here, dont worry about the failure its more that it didnt start until 15 mins after the scheduled time

Kevin Kho

06/09/2021, 4:05 PM

Do you see Debug Logs in the UI?

Colin

06/09/2021, 4:05 PM

yes

Kevin Kho

06/09/2021, 4:05 PM

I assume there’s no information there about the delay?

Colin

06/09/2021, 4:06 PM

no, it doesnt even start logging until 14:30... looking at the agent logs i can see this [2021-06-09 131500,011] INFO - agent | Deploying flow run 17684062-f853-47bd-b207-6fd582f89229 to execution environment... INFOagentDeploying flow run 17684062-f853-47bd-b207-6fd582f89229 to execution environment... [2021-06-09 133040,533] INFO - agent | Completed deployment of flow run 17684062-f853-47bd-b207-6fd582f89229 INFOagentCompleted deployment of flow run 17684062-f853-47bd-b207-6fd582f89229

Colin

06/09/2021, 4:06 PM

which seems to indicated that its taking 15 mins to deploy the flow run ?

Kevin Kho

06/09/2021, 4:06 PM

Gotcha. Will ask the team again.

Colin

06/09/2021, 4:06 PM

but i cant see any reason why that would take 15 mins ?

Kevin Kho

06/09/2021, 4:10 PM

Does your flow in the UI contain debug logs like this?

Colin

06/09/2021, 4:17 PM

yes

Colin

06/09/2021, 4:17 PM

there are debug messages in the extract i attached above so i guess DEBUG is logging, its seems to be the deployment which is taking a long time ?

Kevin Kho

06/09/2021, 4:18 PM

Ok will ask the team. One sec

Kevin Kho

06/09/2021, 8:22 PM

Ok so I chatted with team members with more Kubernetes experience. The recommendation if you can is to run this Flow on a cluster where it’s the only job there as a test if you can see the same behavior. If it is getting delayed there, then it’s at least easier to identify that the agent/flow itself is causing an issue. With this set-up, check if the pod is executing right away. The download of the flow may be slow. Maybe the agent is having difficulty communicating. You can check on DEBUG mode if that agent is still pinging the Server. The second thing to do is to run the flow multiple times at the same time. If any of them don’t go straight from submitted to running, then check the lifecycle status of that pod. The thing is that your cluster cpu and memory make look good, but it’s separate from the allocatable cpu and memory. It could be possible that the flow is not able to get resources even if there is some available in the cluster.

Colin

06/10/2021, 8:09 AM

thanks i will try this

Colin

06/10/2021, 3:12 PM

ok i have tried this, agents look fine and i am not seeing any out of memory issues.

Colin

06/10/2021, 3:13 PM

its almost as if the kubernetes pods are not creating and then after 15 mins the lazerus process finds out and tried again

Colin

06/10/2021, 3:18 PM

i have also added limits to the jobs to protect the memory, we have a 6gb machine and i have set limits at 1gb so that should ensure plenty of capactuty (at most 1 running at atime)

Kevin Kho

06/10/2021, 3:26 PM

So you weren’t able to replicate when trying to? Lazarus should appear in the Flow logs for sure though. Have you seen it anywhere?

Colin

06/10/2021, 5:49 PM

yes i can see that lazarus has restarted the flow and it ran correctly, nothing as to why it started late

Colin

06/10/2021, 6:06 PM

this is the strange thing, if i look at the agent logs i see the following [2021-06-10 173000,007] INFO - agent | Deploying flow run 3796b051-a60c-4458-8dad-25a740223e1c to execution environment... [2021-06-10 174540,111] INFO - agent | Completed deployment of flow run 3796b051-a60c-4458-8dad-25a740223e1c

Colin

06/10/2021, 6:06 PM

the deployment starts on time

Colin

06/10/2021, 6:07 PM

is there any extra logging i can turn on for the agent ?

Kevin Kho

06/10/2021, 6:10 PM

prefect agent _______ start --log-level=DEBUG --show-flow-logs

will give the most logs

Colin

06/10/2021, 7:27 PM

show flow logs gives me an error, wont allow the agent to start but i have the log level turned on

Colin

06/10/2021, 7:29 PM

looks like --show-flow-logs is not available for the kubernetes agent

Kevin Kho

06/10/2021, 7:31 PM

Oh I see that’s right

Colin

06/10/2021, 7:31 PM

i am wondering if i should move away from docker storage, is there a recommendation on how to run in a kubernetes env

Kevin Kho

06/10/2021, 7:36 PM

We generally are moving away from recommending Docker storage, but not because of the interaction with Kubernetes. It has to more with the serialization that happens with pickle based storage causing some difficulties. With that said, we recommend script based storage like Github or S3 with

stored_as_script

set to true. Docker storage does make sense though if you have dependencies you need to package with your Flow

Colin

06/10/2021, 7:36 PM

yes we do have some dependancies, largely python modules

Colin

06/10/2021, 7:37 PM

frustrating as its a great tool and generally working great, just this wierd intermittent problem

Kevin Kho

06/10/2021, 7:42 PM

I know. Sorry about that. What’s harder is that it only happens some times. So the situations where I’ve seen Lazarus triggered with Kubernetes is when preemptible/spot compute is being used and the Flow could not secure them. I have also seen Lazarus revive Flows where the hardware underneath hangs on some process. There have been reports on the other hand of Lazarus not kicking in

Kevin Kho

06/10/2021, 7:45 PM

Do you really rely on Lazarus? What some people do is turn off heartbeats on the Flow, and that can stop the Zombie Killer from killing something

Kevin Kho

06/10/2021, 7:45 PM

Or you can try turning off Lazarus itself and see what happens

Kevin Kho

06/10/2021, 7:46 PM

Maybe it just sees it as distressed when it really isn’t

Colin

06/10/2021, 7:48 PM

i have tried without lazerus and we see the same issue

Colin

06/10/2021, 7:54 PM

ok i found some logs.... the agent finds the job DEBUGagentQuerying for ready flow runs... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'}

Colin

06/10/2021, 7:54 PM

the one ending cad2 doent start

Colin

06/10/2021, 7:55 PM

[2021-06-10 194457,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentWaiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time...

Colin

06/10/2021, 7:55 PM

[2021-06-10 194459,202] DEBUG - agent | Querying for ready flow runs... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'])

Colin

06/10/2021, 7:55 PM

[2021-06-10 194500,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 194500,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted...

Colin

06/10/2021, 7:56 PM

but the deployment of cad2 never completes

Colin

06/10/2021, 7:56 PM

no reason why....

Kevin Kho

06/10/2021, 8:00 PM

Just making sure I understand there are two runs scheduled for the same time, the agent finds both. The one ending in

cad2

just goes to submitted and has no activity afterward?

Colin

06/10/2021, 8:06 PM

correct

Colin

06/10/2021, 8:07 PM

looks like the agent is not submitting it, but the agent pulls two together it thinks it has submitted it

Kevin Kho

06/10/2021, 8:08 PM

Ok will forward to team

Colin

06/10/2021, 8:08 PM

see these logs kubectl logs prefect-agent-78cf68c98b-b2wk8 --namespace prefect-server | findstr "cad2" DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,324] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,353] DEBUG - agent | Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... DEBUGagentSubmitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... [2021-06-10 194457,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194457,646] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194458,201] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194459,244] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194500,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 194500,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... INFOagentDeploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... DEBUGagentUpdating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... [2021-06-10 195707,434] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195717,478] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195727,524] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195737,568] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195747,608] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195757,650] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195807,833] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195817,954] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195828,012] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) PS C:\Users\colin.rees\OneDrive - FastAndEazy\Work\FranchiseBrands\VsCodeProjects\ArmTemplates1> kubectl logs prefect-agent-78cf68c98b-b2wk8 --namespace prefect-server | findstr "cad2" DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,324] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} [2021-06-10 194457,353] DEBUG - agent | Submitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... DEBUGagentSubmitting flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 for deployment... [2021-06-10 194457,355] DEBUG - agent | Waiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentWaiting 2.644376s to deploy flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 on time... DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194457,646] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194458,201] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194459,244] DEBUG - agent | Found 2 ready flow run(s): {'640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (2 already being submitted: ['640a3238-c959-4a1f-a266-7438fc7ec102', '0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 194500,017] INFO - agent | Deploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... [2021-06-10 194500,019] DEBUG - agent | Updating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... INFOagentDeploying flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 to execution environment... DEBUGagentUpdating flow run 0d52ee48-90dd-4be6-8c2d-11eafaaecad2 state from Scheduled -> Submitted... [2021-06-10 195707,434] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195717,478] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195727,524] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195737,568] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195747,608] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195757,650] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195807,833] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195817,954] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195828,012] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) [2021-06-10 195838,061] DEBUG - agent | Found 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2']) DEBUGagentFound 1 ready flow run(s): {'0d52ee48-90dd-4be6-8c2d-11eafaaecad2'} (1 already being submitted: ['0d52ee48-90dd-4be6-8c2d-11eafaaecad2'])

👍 1

Colin

06/10/2021, 8:08 PM

great, i think its probably a bug in the agent and how it handles multiple jobs pulled together

Colin

06/10/2021, 8:08 PM

thanls

Colin

06/12/2021, 9:03 PM

so i played with this a little more, and i am seeing the same behavior with the local agent. i think there is an edge case with the agent where if multiple flows kick off at about the same time then some of them get stuck in the submitted state. - trying to reproduce now.

Kevin Kho

06/13/2021, 3:25 AM

Oh man I had a response typed out for you yesterday but forgot to send it it seems. Attaching it below.

Kevin Kho

06/13/2021, 3:25 AM

Hey @Colin, chatted with the team. It still seems like a k8s issue looking at the logs because the submission of the job is taking a long time, which means the k8s API is taking a long time and is hanging. It does seem something fishy is going on within these calls: https://github.com/PrefectHQ/prefect/blob/ac6bb08f70319e627b5e4a03b64ebcc1dc169ef6/src/prefect/agent/agent.py#L377-L386 The

_deploy_flow_run_completed_callback

isn’t being reached. It would be helpful to know if this log happened:

self.logger.debug("Creating namespaced job {}".format(job_name))

We recommend installing an editable version of Prefect on your agent and add more print / log statements in the k8s agent to better isolate whether the hang is coming from Prefect or Kubernetes (we suspect K8s); you can add more granular information here: https://github.com/PrefectHQ/prefect/blob/ac6bb08f70319e627b5e4a03b64ebcc1dc169ef6/src/prefect/agent/kubernetes/agent.py#L389

Kevin Kho

06/13/2021, 3:26 AM

Went over this thread and I realized I never asked you what version you were on. What is your Prefect version?

Kevin Kho

06/13/2021, 3:27 AM

And yes replicating on the local agent would be great!

Colin

06/14/2021, 3:45 PM

version 0.14.21

Kevin Kho

06/14/2021, 3:46 PM

Oh ok I’m on the same one

7 Views

Open in Slack

Previous Next