https://prefect.io logo
Title
r

Rafał Bielicki

05/12/2023, 11:40 AM
Hello everyone, I have a strange case where my Flow gets executed 2 times in a row while the retries are set to 0 and execution is successful. I have persistence disabled and second run is breaking due to missing result storage. Here is a screen, i have 2 runs & retry set to 0. Any ideas what is wrong here? What is more on the first run is completes successfully and immediately runs again.
1
j

Jeff Hale

05/12/2023, 3:25 PM
What’s the result of
prefect version
?
A few more questions that could help us figure out what’s up with your flow: • What infrastructure is being used for the run? • Are retries configured for the run? • Are any automations configured for the run? • Is the run managed by a worker or an agent? • Is it a subflow run? • Was the run trigged by
run_deployment
or a schedule? • Are you using Cloud or OSS? • If Cloud, provide workspace and flow run ids If using OSS, state transitions can be retrieved with the client with this script, and that information could be helpful:
# usage: python <file>.py <FLOW_RUN_ID>
from prefect import get_client

async def main(flow_run_id):
    async with get_client() as client:
        states = await client.read_flow_run_states(flow_run_id)
        for state in states:
            print(state.timestamp, state.type.name, state.name)


import asyncio
import sys

asyncio.run(main(sys.argv[1]))
r

Rafał Bielicki

05/12/2023, 8:15 PM
Agent version is 2.7.10 Orion server is 2.10.9 1. For this particular flow we use
Process
2. Flow has 0 retries & task has 3 retreis with bachoff. But this run is not failing just running 2 times in a row. 3. Run is managed by agnet 2.7.10 4. It is just task inside a flow. 5.
run_deployment
6. OSS
:thank-you: 1
z

Zanie

05/12/2023, 8:39 PM
The state transition information is the most important thing
Can you get us that?
r

Rafał Bielicki

05/15/2023, 6:28 PM
@Zanie Sorry for delay I had a rough weekend. Let me send the data. This is similar case for pretty mucha all cases I checked
2023-05-15T16:15:41.963169+00:00 SCHEDULED Scheduled
2023-05-15T16:15:42.289343+00:00 PENDING Pending
2023-05-15T16:15:42.304417+00:00 PENDING Pending
2023-05-15T16:15:48.667443+00:00 RUNNING Running
2023-05-15T16:15:49.092928+00:00 RUNNING Running
2023-05-15T16:15:49.396154+00:00 COMPLETED Completed
2023-05-15T16:15:49.489030+00:00 FAILED Failed
We do have multiple agents consuming this queue.
But all logs come from same agents so it seems.
z

Zanie

05/15/2023, 7:04 PM
Interesting so a single agent displays logs that it is submitting the run twice?
r

Rafał Bielicki

05/15/2023, 7:49 PM
Let me doublecheck
Yes it is even taking the result from the second run from agent cache.
Finished in state Cached(type=COMPLETED)
z

Zanie

05/15/2023, 7:53 PM
Finished in state
is a log from the run itself
We’re interested in the infrastructure management logs on the agent
like is
Submitting flow run <id>
displayed on one agent twice or both agents once?
r

Rafał Bielicki

05/15/2023, 7:56 PM
This is shown on both agents once
We have 2 pods with orion server, might it be an issue?
z

Zanie

05/15/2023, 8:00 PM
Maybe. Two agents and two APIs seems likely to increase the chance of a race condition here. We should ban the second PENDING transition though — we’ll have to look into that.
We definitely avoid this with a single agent by storing submitting runs in a local set
r

Rafał Bielicki

05/15/2023, 8:03 PM
The thing is that we run multiple agents as a horizontal scaling. We could try to use one big Agent with more resources and higher concurrency limit, but this is spot instance and there are a lot of tasks so there is aa risk of it being evicted and 2.7.10 doesn’t support SIGTERM properly(or at least no in our case, because tasks were left in “running” state forever)
Do you need anything more from me? Also is it more likely Orion or Agent issue? Because this might be a good reason to update from 2.7.10
z

Zanie

05/15/2023, 8:33 PM
It’s more likely to be a server-side issue
I haven’t seen any recent reports of this, it’s possible it’s been fixed you’re pretty far behind on versions
r

Rafał Bielicki

05/16/2023, 6:46 AM
As I mentioned HERE the server is
2.10.9
& agents is
2.7.10
z

Zanie

05/16/2023, 2:15 PM
Hm interesting
r

Rafał Bielicki

05/16/2023, 2:17 PM
I observed this issue no 2.10.9 agents as well. Moved all to single agent for the time being and it works fine since (6hours), but I would love to have more agents especailly for
Process
infra runs.
z

Zanie

05/16/2023, 4:59 PM
I believe I have a fix at https://github.com/PrefectHQ/prefect/pull/9590
🙌 1
❤️ 1
r

Rafał Bielicki

05/16/2023, 5:40 PM
Should we expect this to be fixed included in the next release?
I have one more concern going back to initial question HERE. When running we get more than one run in flow that has 0 retries.
So taskrun has retry with backoff but flow does not. why is that the flow retires itself?
@Jeff Hale Can we continue this issue in this topic? Or should I create separate since those 2 issues are probably different
1. Running such flows with run_deployment 2. Infra: K8S job 3. Prefect 2.10.9 Flow has one task inside and task has retry with backoff but flow itself is run multiple times. Is it because of job being retried by cluster?
I think it is it because I see backoffLimit: 6 in the manifest. Sorry for the hassle
z

Zanie

05/16/2023, 8:54 PM
Yes we’ll have that fix in the next release
And yes the backoffLimit is the cause of the other issue. If you flow runs are on 2.10.3+ they won’t back off unless the flow run crashes. We’re also going to change the default backoff limit.
r

Rafał Bielicki

05/17/2023, 7:38 AM
I updated manifest and it is no longer an issue 🙂 Thank you :)
🙌 1