Hello everyone I have a strange case where my Flow gets exec Prefect Community #ask-community

Hello everyone, I have a strange case where my Flo...

Rafał Bielicki

05/12/2023, 11:40 AM

Hello everyone, I have a strange case where my Flow gets executed 2 times in a row while the retries are set to 0 and execution is successful. I have persistence disabled and second run is breaking due to missing result storage. Here is a screen, i have 2 runs & retry set to 0. Any ideas what is wrong here? What is more on the first run is completes successfully and immediately runs again.

✅ 1

Jeff Hale

05/12/2023, 3:25 PM

What’s the result of

prefect version

Jeff Hale

05/12/2023, 7:51 PM

A few more questions that could help us figure out what’s up with your flow: • What infrastructure is being used for the run? • Are retries configured for the run? • Are any automations configured for the run? • Is the run managed by a worker or an agent? • Is it a subflow run? • Was the run trigged by

run_deployment

or a schedule? • Are you using Cloud or OSS? • If Cloud, provide workspace and flow run ids If using OSS, state transitions can be retrieved with the client with this script, and that information could be helpful:

Copy code

# usage: python <file>.py <FLOW_RUN_ID>
from prefect import get_client

async def main(flow_run_id):
    async with get_client() as client:
        states = await client.read_flow_run_states(flow_run_id)
        for state in states:
            print(state.timestamp, state.type.name, state.name)


import asyncio
import sys

asyncio.run(main(sys.argv[1]))

Rafał Bielicki

05/12/2023, 8:15 PM

Agent version is 2.7.10 Orion server is 2.10.9 1. For this particular flow we use

Process

2. Flow has 0 retries & task has 3 retreis with bachoff. But this run is not failing just running 2 times in a row. 3. Run is managed by agnet 2.7.10 4. It is just task inside a flow. 5.

run_deployment

6. OSS

🙏 1

Zanie

05/12/2023, 8:39 PM

The state transition information is the most important thing

Zanie

05/12/2023, 8:39 PM

Can you get us that?

Rafał Bielicki

05/15/2023, 6:28 PM

@Zanie Sorry for delay I had a rough weekend. Let me send the data. This is similar case for pretty mucha all cases I checked

Copy code

2023-05-15T16:15:41.963169+00:00 SCHEDULED Scheduled
2023-05-15T16:15:42.289343+00:00 PENDING Pending
2023-05-15T16:15:42.304417+00:00 PENDING Pending
2023-05-15T16:15:48.667443+00:00 RUNNING Running
2023-05-15T16:15:49.092928+00:00 RUNNING Running
2023-05-15T16:15:49.396154+00:00 COMPLETED Completed
2023-05-15T16:15:49.489030+00:00 FAILED Failed

Rafał Bielicki

05/15/2023, 6:45 PM

We do have multiple agents consuming this queue.

Rafał Bielicki

05/15/2023, 6:47 PM

But all logs come from same agents so it seems.

Zanie

05/15/2023, 7:04 PM

Interesting so a single agent displays logs that it is submitting the run twice?

Rafał Bielicki

05/15/2023, 7:49 PM

Let me doublecheck

Rafał Bielicki

05/15/2023, 7:53 PM

Yes it is even taking the result from the second run from agent cache.

Finished in state Cached(type=COMPLETED)

Zanie

05/15/2023, 7:53 PM

Finished in state

is a log from the run itself

Zanie

05/15/2023, 7:54 PM

We’re interested in the infrastructure management logs on the agent

Zanie

05/15/2023, 7:54 PM

like is

Submitting flow run <id>

displayed on one agent twice or both agents once?

Rafał Bielicki

05/15/2023, 7:56 PM

This is shown on both agents once

Rafał Bielicki

05/15/2023, 7:57 PM

We have 2 pods with orion server, might it be an issue?

Zanie

05/15/2023, 8:00 PM

Maybe. Two agents and two APIs seems likely to increase the chance of a race condition here. We should ban the second PENDING transition though — we’ll have to look into that.

Zanie

05/15/2023, 8:01 PM

We definitely avoid this with a single agent by storing submitting runs in a local set

Rafał Bielicki

05/15/2023, 8:03 PM

The thing is that we run multiple agents as a horizontal scaling. We could try to use one big Agent with more resources and higher concurrency limit, but this is spot instance and there are a lot of tasks so there is aa risk of it being evicted and 2.7.10 doesn’t support SIGTERM properly(or at least no in our case, because tasks were left in “running” state forever)

Rafał Bielicki

05/15/2023, 8:26 PM

Do you need anything more from me? Also is it more likely Orion or Agent issue? Because this might be a good reason to update from 2.7.10

Zanie

05/15/2023, 8:33 PM

It’s more likely to be a server-side issue

Zanie

05/15/2023, 8:34 PM

I haven’t seen any recent reports of this, it’s possible it’s been fixed you’re pretty far behind on versions

Rafał Bielicki

05/16/2023, 6:46 AM

As I mentioned HERE the server is

2.10.9

& agents is

2.7.10

Zanie

05/16/2023, 2:15 PM

Hm interesting

Rafał Bielicki

05/16/2023, 2:17 PM

I observed this issue no 2.10.9 agents as well. Moved all to single agent for the time being and it works fine since (6hours), but I would love to have more agents especailly for

Process

infra runs.

Zanie

05/16/2023, 4:59 PM

I believe I have a fix at https://github.com/PrefectHQ/prefect/pull/9590

🙌 1

❤️ 1

Rafał Bielicki

05/16/2023, 5:40 PM

Should we expect this to be fixed included in the next release?

Rafał Bielicki

05/16/2023, 7:57 PM

I have one more concern going back to initial question HERE. When running we get more than one run in flow that has 0 retries.

Rafał Bielicki

05/16/2023, 7:57 PM

So taskrun has retry with backoff but flow does not. why is that the flow retires itself?

Rafał Bielicki

05/16/2023, 8:05 PM

@Jeff Hale Can we continue this issue in this topic? Or should I create separate since those 2 issues are probably different

Rafał Bielicki

05/16/2023, 8:12 PM

1. Running such flows with run_deployment 2. Infra: K8S job 3. Prefect 2.10.9 Flow has one task inside and task has retry with backoff but flow itself is run multiple times. Is it because of job being retried by cluster?

Rafał Bielicki

05/16/2023, 8:14 PM

I think it is it because I see backoffLimit: 6 in the manifest. Sorry for the hassle

Zanie

05/16/2023, 8:54 PM

Yes we’ll have that fix in the next release

Zanie

05/16/2023, 8:55 PM

And yes the backoffLimit is the cause of the other issue. If you flow runs are on 2.10.3+ they won’t back off unless the flow run crashes. We’re also going to change the default backoff limit.

Rafał Bielicki

05/17/2023, 7:38 AM

I updated manifest and it is no longer an issue 🙂 Thank you :)

🙌 1

2 Views

Open in Slack

Previous Next