https://prefect.io logo
Title
m

Marko Jamedzija

09/10/2021, 7:10 AM
Hello 🙂 I’m experiencing a weird problem of prefect running the same task twice one after another in a span of several seconds. When looking at the schematic, it’s clearly seen that there’s only one task. I’m using prefect version 0.15.2. Thanks!
It turns out that the flow was scheduled, submitted and run 2x by agent in the same flow run, causing this. Any reason why this happened?
Also in (kubernetes) agent logs I see that it’s experiencing read timeouts like this one quite often when trying to start flow run
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)
k

Kevin Kho

09/10/2021, 1:53 PM
Are you by chance using
flow.run()
inside the registered flow?
m

Marko Jamedzija

09/10/2021, 2:21 PM
No 🙂
k

Kevin Kho

09/10/2021, 2:23 PM
Do they have different flow run ids?
m

Marko Jamedzija

09/10/2021, 2:25 PM
No, it was the same flow run Id.
it’s not really visible, but in the logs, the same tasks was started 2x
with several seconds between starts
k

Kevin Kho

09/10/2021, 2:27 PM
Does it happen for all tasks or some tasks?
m

Marko Jamedzija

09/10/2021, 2:29 PM
For all tasks. Just checked.
k

Kevin Kho

09/10/2021, 2:34 PM
Do you have multiple agents capable of picking up the flow? Did it register on another agent?
m

Marko Jamedzija

09/10/2021, 2:39 PM
There’s only one agent 🙂
k

Kevin Kho

09/10/2021, 2:45 PM
Does it happen all the time for this flow?
m

Marko Jamedzija

09/10/2021, 2:46 PM
No, it happened only today and the day before yday (it’s being run once a day).
k

Kevin Kho

09/10/2021, 2:48 PM
Ok I have no ideas the the moment. Will ask the team about it
m

Marko Jamedzija

09/10/2021, 2:49 PM
Thanks! 🙂
Hi @Kevin Kho 🙂 I just wanted to see if there are any updates on what might be the cause of this? Thanks!
k

Kevin Kho

09/14/2021, 2:26 PM
There were no ideas honestly. Did it happen since this?
m

Marko Jamedzija

09/14/2021, 2:29 PM
It happens almost every day (some days it’s working ok) 😅 I didn’t have time to explore it in more detail. I’ll see to upgrade prefect to the latest version and see if there are any improvements. My guess so far is that agent has problems, so will see if the new version will help.
k

Kevin Kho

09/14/2021, 2:31 PM
What version are you on?
m

Marko Jamedzija

09/14/2021, 2:31 PM
0.15.2
k

Kevin Kho

09/14/2021, 2:31 PM
Oh 0.15.2….that should be stable. Could you show me what you see with the agents tab? Just blur out sensitive info
m

Marko Jamedzija

09/14/2021, 2:33 PM
this one?
or you meant the detailed view
k

Kevin Kho

09/14/2021, 2:35 PM
Yeah this one. Do you have other agents at all?
m

Marko Jamedzija

09/14/2021, 2:36 PM
nope, just this one 🙂
is this a good practice? How many do you recommend?
k

Kevin Kho

09/14/2021, 2:37 PM
Ok let me try to get input from the team again. It is. We really just recommend one.
m

Marko Jamedzija

09/14/2021, 2:37 PM
👍
We plan to run more of them which would spawn jobs in different namespaces for different teams with labels set accordingly. But for now we have just this one.
Ok let me try to get input from the team again. It is. We really just recommend one
Thanks 🙂
which would spawn jobs
because all our flows are k8s runs (
KubernetesRun
) (for
run_config
)
k

Kevin Kho

09/14/2021, 2:41 PM
Oh that is a good use case to have multiple agents yep
m

Marko Jamedzija

09/14/2021, 2:42 PM
👍
k

Kevin Kho

09/15/2021, 3:33 AM
Hey Marko, so this behavior happens when the Dask worker dies and the tasks that live of that worker are re-run. Dask has a computation graph of the tasks that are submitted, and if the worker dies due to memory for example, it will re-run what submitted tasks on that machine
Prefect Cloud has a mechanism to stop these tasks from re-running which is called Version Locking but it is not available on server
m

Marko Jamedzija

09/15/2021, 9:00 AM
Thanks Kevin! I’ll take a closer look to see if this is what’s happening to us as well, because we run our tasks mostly using LocalDaskExecutor, and this happens before any task has been started. But in any case, I’ll try to get more details and see how to resolve this some other way around 🙂
k

Kevin Kho

09/15/2021, 2:25 PM
I think we will have someone working on implementing the locking for server. This is the issue to track and feel free to chime in
m

Marko Jamedzija

09/17/2021, 8:43 AM
Great, thanks! 🙂 🙂
Hello again. I checked this issue and the PR that was merged and I’m not sure this will fix the issue as it’s related to the running by duplicate agents (and the timespan of check is 30 seconds). The problem we have is related to Lazarus project which reschedules the flow and doesn’t kill the previous one. Example from today is below. GraphQL request:
query { flow_run(
  where: { 
    flow: { name: { _eq: "<flow-name>" } }
    scheduled_start_time: { _eq: "2021-09-28T05:10:00" }
  }) {
    flow {
      name
    }
    id
		states {
      state
      message
      created
      start_time
    }
  	state
    scheduled_start_time
		auto_scheduled
  }
}
Response:
{
  "data": {
    "flow_run": [
      {
        "flow": {
          "name": "<name>"
        },
        "id": "<id>",
        "states": [
          {
            "state": "Success",
            "message": "All reference tasks succeeded.",
            "created": "2021-09-28T05:36:09.345878+00:00",
            "start_time": null
          },
          {
            "state": "Failed",
            "message": "Kubernetes Error: pods ['prefect-job-<job-id>'] failed for this job",
            "created": "2021-09-28T05:35:56.861441+00:00",
            "start_time": null
          },
          {
            "state": "Running",
            "message": "Running flow.",
            "created": "2021-09-28T05:25:45.80988+00:00",
            "start_time": null
          },
          {
            "state": "Submitted",
            "message": "Submitted for execution",
            "created": "2021-09-28T05:25:42.091069+00:00",
            "start_time": null
          },
          {
            "state": "Scheduled",
            "message": "Rescheduled by a Lazarus process.",
            "created": "2021-09-28T05:20:08.821885+00:00",
            "start_time": "2021-09-28T05:20:08.815652+00:00"
          },
          {
            "state": "Submitted",
            "message": "Submitted for execution",
            "created": "2021-09-28T05:10:00.035923+00:00",
            "start_time": null
          },
          {
            "state": "Scheduled",
            "message": "Flow run scheduled.",
            "created": "2021-09-27T19:10:28.001225+00:00",
            "start_time": "2021-09-28T05:10:00+00:00"
          }
        ],
        "state": "Success",
        "scheduled_start_time": "2021-09-28T05:10:00+00:00",
        "auto_scheduled": true
      }
    ]
  }
}
Is there any suggestion how to resolve this? Thanks! 🙂
k

Kevin Kho

09/28/2021, 2:21 PM
Will check with the team on this
🙌 1
m

Marko Jamedzija

10/06/2021, 10:11 AM
Hi Kevin 🙂 Any news on this one? Thanks!
k

Kevin Kho

10/06/2021, 2:42 PM
Sorry I think I didn’t get a response from the team. I’ll follow up
Please feel free to follow up faster if I don’t get back to in like 1 or 2 days. No need to wait a week
m

Marko Jamedzija

10/06/2021, 2:43 PM
No worries 🙂 I was just busy with other things 😄 and also didn’t want to be too annoying 😄
k

Kevin Kho

10/06/2021, 9:38 PM
Hey Marko, I chatted with the team and unfortunately there’s not much that can be done on Server here for this because this is the purpose of Version Locking on Cloud, and Version Locking requires some services that aren’t shipped with Server
m

Marko Jamedzija

10/07/2021, 3:53 PM
Ok 🙂 Thanks for the answer in any case 🙂
k

Kevin Kho

10/07/2021, 3:54 PM
Sorry about that!