Matias Godoy

    Matias Godoy

    2 years ago
    Hi guys! I found a weird behaviour with the agents: We have a flow that has been working perfectly for a while now. The problem is that every now and then we have a run in which every task (in the same flow run) is executed twice. We have only two agents running in an EC2 instance. I started looking today and I found that sometimes both of the agents pick the same flow run, and both execute it! [more info in the comments so I don't pollute the main thread]
    Both agents are running under
    supervisord
    in the same EC2 instance, so I have a GUI that allows me to easily see their logs. Here's what I found: Agent 1:
    [2020-08-19 06:39:07,450] INFO - agent | Found 1 flow run(s) to submit for execution.
    [2020-08-19 06:39:07,582] INFO - agent | Deploying flow run 8d061e63-8b9e-40a6-a1b8-103cecedac01
    Agent 2:
    [2020-08-19 06:39:07,672] INFO - agent | Found 1 flow run(s) to submit for execution.
    [2020-08-19 06:39:07,805] INFO - agent | Deploying flow run 8d061e63-8b9e-40a6-a1b8-103cecedac01
    As you can see, both agents found the pending run and both picked it up with a difference of a few milliseconds.
    Also in the cloud UI, you can clearly see what's happening in the logs:
    The strange part is that this only happens every now and then. The agents work perfectly most of the times.
    Is there something I'm doing wrong? Has this been reported before? Let me know if I you need me to provide more information.
    Jeremiah

    Jeremiah

    2 years ago
    Hi @Matias Godoy! We actually experienced a similar issue with one of our test workflows yesterday morning and identified a race condition.with multiple agents. We pushed a fix for that issue yesterday afternoon, so if itโ€™s the same I hope you wonโ€™t see this behavior anymore. Of course if you do see anything unexpected, please let us know!
    Matias Godoy

    Matias Godoy

    2 years ago
    Nice! This fix is for the agents or for the Cloud? I'm asking so I know if I have to take any action ๐Ÿ™‚
    Chris White

    Chris White

    2 years ago
    Itโ€™s for Cloud - no action required on your part ๐Ÿ‘
    Jeremiah

    Jeremiah

    2 years ago
    Thanks Chris - Apologies I missed your follow-up @Matias Godoy
    Matias Godoy

    Matias Godoy

    2 years ago
    Excellent. Thanks a lot!
    e

    Eldho Suresh

    1 year ago
    @Narasimhan Ramaswamy
    Narasimhan Ramaswamy

    Narasimhan Ramaswamy

    1 year ago
    @Jeremiah - just an extension to this problem, we are having prefect job submitted twice. We just have one agent running but randomly within single flow all tasks are repeated twice. The flow used DaskKubernetesenvironment. Our agent is hosted in AKS and we manage flows with Prefect Cloud. We noticed that these random failed flows were rescheduled by Lazarus.
    can you please help here?
    Jeremiah

    Jeremiah

    1 year ago
    @Narasimhan Ramaswamy it sounds like you are experiencing a different issue than the one in this thread, since your symptom is repeated tasks, not runs (apart from normal Lazarus rescheduling). Sometimes seeing repeated tasks in combination with Dask means that your Dask worker ran out of memory and died, and when it spun back up the Dask scheduler caused it to re-run all work. If you need further assistance I recommend starting a new thread here so other folks will be able to see it (this thread is 3 months old) or a GitHub discussion to maximize visibility!