Hello :slightly_smiling_face: I’m experiencing a w...
# prefect-server
m
Hello 🙂 I’m experiencing a weird problem of prefect running the same task twice one after another in a span of several seconds. When looking at the schematic, it’s clearly seen that there’s only one task. I’m using prefect version 0.15.2. Thanks!
It turns out that the flow was scheduled, submitted and run 2x by agent in the same flow run, causing this. Any reason why this happened?
Also in (kubernetes) agent logs I see that it’s experiencing read timeouts like this one quite often when trying to start flow run
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)
k
Are you by chance using
flow.run()
inside the registered flow?
m
No 🙂
k
Do they have different flow run ids?
m
No, it was the same flow run Id.
it’s not really visible, but in the logs, the same tasks was started 2x
with several seconds between starts
k
Does it happen for all tasks or some tasks?
m
For all tasks. Just checked.
k
Do you have multiple agents capable of picking up the flow? Did it register on another agent?
m
There’s only one agent 🙂
k
Does it happen all the time for this flow?
m
No, it happened only today and the day before yday (it’s being run once a day).
k
Ok I have no ideas the the moment. Will ask the team about it
m
Thanks! 🙂
Hi @Kevin Kho 🙂 I just wanted to see if there are any updates on what might be the cause of this? Thanks!
k
There were no ideas honestly. Did it happen since this?
m
It happens almost every day (some days it’s working ok) 😅 I didn’t have time to explore it in more detail. I’ll see to upgrade prefect to the latest version and see if there are any improvements. My guess so far is that agent has problems, so will see if the new version will help.
k
What version are you on?
m
0.15.2
k
Oh 0.15.2….that should be stable. Could you show me what you see with the agents tab? Just blur out sensitive info
m
this one?
or you meant the detailed view
k
Yeah this one. Do you have other agents at all?
m
nope, just this one 🙂
is this a good practice? How many do you recommend?
k
Ok let me try to get input from the team again. It is. We really just recommend one.
m
👍
We plan to run more of them which would spawn jobs in different namespaces for different teams with labels set accordingly. But for now we have just this one.
Ok let me try to get input from the team again. It is. We really just recommend one
Thanks 🙂
which would spawn jobs
because all our flows are k8s runs (
KubernetesRun
) (for
run_config
)
k
Oh that is a good use case to have multiple agents yep
m
👍
k
Hey Marko, so this behavior happens when the Dask worker dies and the tasks that live of that worker are re-run. Dask has a computation graph of the tasks that are submitted, and if the worker dies due to memory for example, it will re-run what submitted tasks on that machine
Prefect Cloud has a mechanism to stop these tasks from re-running which is called Version Locking but it is not available on server
m
Thanks Kevin! I’ll take a closer look to see if this is what’s happening to us as well, because we run our tasks mostly using LocalDaskExecutor, and this happens before any task has been started. But in any case, I’ll try to get more details and see how to resolve this some other way around 🙂
k
I think we will have someone working on implementing the locking for server. This is the issue to track and feel free to chime in
m
Great, thanks! 🙂 🙂
Hello again. I checked this issue and the PR that was merged and I’m not sure this will fix the issue as it’s related to the running by duplicate agents (and the timespan of check is 30 seconds). The problem we have is related to Lazarus project which reschedules the flow and doesn’t kill the previous one. Example from today is below. GraphQL request:
Copy code
query { flow_run(
  where: { 
    flow: { name: { _eq: "<flow-name>" } }
    scheduled_start_time: { _eq: "2021-09-28T05:10:00" }
  }) {
    flow {
      name
    }
    id
		states {
      state
      message
      created
      start_time
    }
  	state
    scheduled_start_time
		auto_scheduled
  }
}
Response:
Copy code
{
  "data": {
    "flow_run": [
      {
        "flow": {
          "name": "<name>"
        },
        "id": "<id>",
        "states": [
          {
            "state": "Success",
            "message": "All reference tasks succeeded.",
            "created": "2021-09-28T05:36:09.345878+00:00",
            "start_time": null
          },
          {
            "state": "Failed",
            "message": "Kubernetes Error: pods ['prefect-job-<job-id>'] failed for this job",
            "created": "2021-09-28T05:35:56.861441+00:00",
            "start_time": null
          },
          {
            "state": "Running",
            "message": "Running flow.",
            "created": "2021-09-28T05:25:45.80988+00:00",
            "start_time": null
          },
          {
            "state": "Submitted",
            "message": "Submitted for execution",
            "created": "2021-09-28T05:25:42.091069+00:00",
            "start_time": null
          },
          {
            "state": "Scheduled",
            "message": "Rescheduled by a Lazarus process.",
            "created": "2021-09-28T05:20:08.821885+00:00",
            "start_time": "2021-09-28T05:20:08.815652+00:00"
          },
          {
            "state": "Submitted",
            "message": "Submitted for execution",
            "created": "2021-09-28T05:10:00.035923+00:00",
            "start_time": null
          },
          {
            "state": "Scheduled",
            "message": "Flow run scheduled.",
            "created": "2021-09-27T19:10:28.001225+00:00",
            "start_time": "2021-09-28T05:10:00+00:00"
          }
        ],
        "state": "Success",
        "scheduled_start_time": "2021-09-28T05:10:00+00:00",
        "auto_scheduled": true
      }
    ]
  }
}
Is there any suggestion how to resolve this? Thanks! 🙂
k
Will check with the team on this
🙌 1
m
Hi Kevin 🙂 Any news on this one? Thanks!
k
Sorry I think I didn’t get a response from the team. I’ll follow up
Please feel free to follow up faster if I don’t get back to in like 1 or 2 days. No need to wait a week
m
No worries 🙂 I was just busy with other things 😄 and also didn’t want to be too annoying 😄
k
Hey Marko, I chatted with the team and unfortunately there’s not much that can be done on Server here for this because this is the purpose of Version Locking on Cloud, and Version Locking requires some services that aren’t shipped with Server
m
Ok 🙂 Thanks for the answer in any case 🙂
k
Sorry about that!