https://prefect.io logo
m

Matias Godoy

08/20/2020, 12:22 PM
Hi guys! I found a weird behaviour with the agents: We have a flow that has been working perfectly for a while now. The problem is that every now and then we have a run in which every task (in the same flow run) is executed twice. We have only two agents running in an EC2 instance. I started looking today and I found that sometimes both of the agents pick the same flow run, and both execute it! [more info in the comments so I don't pollute the main thread]
๐Ÿ‘€ 2
Both agents are running under
supervisord
in the same EC2 instance, so I have a GUI that allows me to easily see their logs. Here's what I found: Agent 1:
Copy code
[2020-08-19 06:39:07,450] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-08-19 06:39:07,582] INFO - agent | Deploying flow run 8d061e63-8b9e-40a6-a1b8-103cecedac01
Agent 2:
Copy code
[2020-08-19 06:39:07,672] INFO - agent | Found 1 flow run(s) to submit for execution.
[2020-08-19 06:39:07,805] INFO - agent | Deploying flow run 8d061e63-8b9e-40a6-a1b8-103cecedac01
As you can see, both agents found the pending run and both picked it up with a difference of a few milliseconds.
Also in the cloud UI, you can clearly see what's happening in the logs:
The strange part is that this only happens every now and then. The agents work perfectly most of the times.
Is there something I'm doing wrong? Has this been reported before? Let me know if I you need me to provide more information.
j

Jeremiah

08/20/2020, 1:49 PM
Hi @Matias Godoy! We actually experienced a similar issue with one of our test workflows yesterday morning and identified a race condition.with multiple agents. We pushed a fix for that issue yesterday afternoon, so if itโ€™s the same I hope you wonโ€™t see this behavior anymore. Of course if you do see anything unexpected, please let us know!
m

Matias Godoy

08/20/2020, 1:50 PM
Nice! This fix is for the agents or for the Cloud? I'm asking so I know if I have to take any action ๐Ÿ™‚
๐Ÿ‘ 1
c

Chris White

08/20/2020, 4:10 PM
Itโ€™s for Cloud - no action required on your part ๐Ÿ‘
๐Ÿ‘ 2
j

Jeremiah

08/20/2020, 4:16 PM
Thanks Chris - Apologies I missed your follow-up @Matias Godoy
m

Matias Godoy

08/20/2020, 4:27 PM
Excellent. Thanks a lot!
e

Eldho Suresh

11/16/2020, 3:49 AM
@Narasimhan Ramaswamy
n

Narasimhan Ramaswamy

11/16/2020, 4:02 AM
@Jeremiah - just an extension to this problem, we are having prefect job submitted twice. We just have one agent running but randomly within single flow all tasks are repeated twice. The flow used DaskKubernetesenvironment. Our agent is hosted in AKS and we manage flows with Prefect Cloud. We noticed that these random failed flows were rescheduled by Lazarus.
can you please help here?
j

Jeremiah

11/16/2020, 4:14 PM
@Narasimhan Ramaswamy it sounds like you are experiencing a different issue than the one in this thread, since your symptom is repeated tasks, not runs (apart from normal Lazarus rescheduling). Sometimes seeing repeated tasks in combination with Dask means that your Dask worker ran out of memory and died, and when it spun back up the Dask scheduler caused it to re-run all work. If you need further assistance I recommend starting a new thread here so other folks will be able to see it (this thread is 3 months old) or a GitHub discussion to maximize visibility!