Hey all we re seeing some unexpected buggy behaviour with Pr Prefect Community #ask-community

Hey all, we’re seeing some unexpected / buggy beha...

David Elliott

02/22/2022, 4:53 PM

Hey all, we’re seeing some unexpected / buggy behaviour with Prefect Cloud, hoping you could advise? We have a flow with a daily cron schedule which gets redeployed once a day (as code changes are made). Previously when the new version of the flow is registered (part of our deployment), the old version of the flow becomes archived and the schedules for that old version were cancelled, but over the last week we’ve been seeing double-scheduled flows remaining - one scheduled flow_run for the current flow, and one for the previous (now archived) flow. Wondering if you’ve seen this issue elsewhere? In addition we’ve seen some (likely) related oddities with our Automation which triggers a PagerDuty incident if the flow doesn’t start within the first 5mins. It looks as though these phantom archived–but-still-scheduled flows are trying to run, hitting our concurrency limit of 1 and stalling, triggering PagerDuty, but then when we click through on the flow_run link, the flow run literally doesn’t exist (either in the UI or in GraphQL). It’s super weird. Hope someone can advise! We’re the

Deliveroo

prefect tenant, happy to send over some specifics / URLs if helpful

Kevin Kho

02/22/2022, 5:02 PM

Hi @David Elliott, I haven’t seen this before. How do you register this? I assume you had the same setup before but this just started acting up recently?

David Elliott

02/22/2022, 5:19 PM

Yep exactly the same setup, no changes there. We have a circleCI job which builds the docker image from our git repo and then essentially does storage.build, attaches the flows storage then finally runs flow.register()

Kevin Kho

02/22/2022, 5:20 PM

I haven’t seen any changes to cloud that would be related to this. Do you use the

set_active_schedule

flag with

flow.register()

David Elliott

02/22/2022, 5:22 PM

No we don’t use that flag, though looks like it defaults to True (and we always set a schedule in the flow definition itself)

David Elliott

02/22/2022, 5:23 PM

Looks like the first time we noticed this was Feb 16th 🤔

David Elliott

02/22/2022, 5:30 PM

It does seem weird that an archived flow would be able to still have a schedule run though..? Not sure if you’re able to see our flows / flow_runs, but here’s a current example: Flow ID • This scheduled flow-run is correctly scheduled version of the current flow (version 1071) • This scheduled flow-run is an incorrectly scheduled version of the archived flow (version 1066) If I view the archived flow (1066) in the UI it still shows it’s full 9 upcoming scheduled flow_runs, despite being archived - super weird.

David Elliott

02/22/2022, 5:31 PM

Forgot to mention we’re on this prefect version

prefect[aws,kubernetes]==0.15.5

David Elliott

02/22/2022, 5:51 PM

This may help explain it better

Copy code

{
  flow_run(
    where: {
      auto_scheduled: {_eq: true},
      state: {_in: ["Scheduled"]},
      flow: {name: {_eq: "bi_pipeline_v2"},
        project: {name: {_eq: "bi-pipeline-v2-staging"}}
      			}
    			}
    order_by: {scheduled_start_time: asc_nulls_last}
  ) {
    id
    start_time
    flow {
      name,
      version,
      archived,
      project {
        name
      }
    }
    version
    end_time
    updated
    state
    scheduled_start_time
  }
}

Kevin Kho

02/22/2022, 6:29 PM

It is pretty strange and it’s the first time I’ve seen this. Will ask some people

🙏 1

Anna Geller

02/22/2022, 6:42 PM

I remember I had a similar use case and such scheduled flow runs for an archived flow version in the end didn't manage to get into a Running state. They got cancelled by the Scheduler service with a message: "Flow run was cancelled because this flow is archived" so even if the UI shows scheduled runs for this flow version, those flow runs shouldn't manage to get into a Running state - LMK if you notice something different Having said that Kevin will ask the team to be sure

David Elliott

02/22/2022, 6:49 PM

Unfortunately we’ve had instances (one this afternoon) where they both run (albeit one after the other because we have a flow concurrently limit of 1). The current version always seems to run first, and then the archived one runs as soon as the first one is complete and we have to manually cancel it, so they do appear to be able to get into a Running state. In addition, on our production deployment we have an Automation as described above which checks if it hasn’t started for >5mins and triggers pagerduty, which in our case the archived flow run falls into and so raises an incident

Anna Geller

02/22/2022, 6:54 PM

Can you share the flow run ID of such archived flow run which still managed to get into a Running state?

Kevin Kho

02/22/2022, 7:01 PM

Do you have custom role definitions? Does someone have permission to archive but not to delete?

Anna Geller

02/22/2022, 7:10 PM

it would really help if you could send the flow run ID of an archived flow version that managed to get into a Running state. So far, the one you shared (86dfa65c-542b-45de-8010-265ed96e3d4c) only got Scheduled, but it wasn't running - as expected

David Elliott

02/22/2022, 7:37 PM

No custom role defs, and the service acct I’m fairly sure is an admin anyway, and no changes to our roles in the last few months. Yep happy to - this flow run (

a2b2af53-aa99-40db-a05e-cd1ad54b9ce6

) was version 1066 and got to

Running

today prior to us cancelling it. The current version was 1071 which ran first (

7295ddff-69b4-4ef9-b638-14fa94e1ca7a

) and then as soon as that was done the archived one above started running as they’d both been scheduled for 4pm (but the flow concurrency = 1 stopped them from both running simultaneously)

Kevin Kho

02/22/2022, 7:39 PM

How often is the schedule?

David Elliott

02/22/2022, 7:40 PM

For that one it’s daily (weekdays) at 4pm

David Elliott

02/22/2022, 7:47 PM

Btw we’ve just had 3x Pagerduty alerts raised by these dodgy Automations on production as well, ie our production flow_run has kicked off, but 3x others (all archived) also tried to start but had to wait due to flow concurrency, and triggered PD. Here’s one of the PagerDuty messages:

Copy code

Run `successful-sloth` (`3c295d6f-16bf-4cf8-964c-4437c8891bfc`) of flow `bi_pipeline_v2` failed `SCHEDULED_NOT_STARTED` SLA (`4cb3b9a7-93e8-4353-8759-48601a357106`) after 300 seconds. See [the UI](<https://cloud.prefect.io/deliveroo/flow-run/3c295d6f-16bf-4cf8-964c-4437c8891bfc>) for more details.

What’s weird is that that flow run linked doesn’t exist. But it must have done at some point in order to trigger the Automation..? When you click on the URL to the flow_run it provides, you get a blank page, and GraphQL returns nothing for that flow_run. ie we have 2 slightly different but related issues: • on staging we have multiple flow runs scheduled (you can see the archived scheduled flow_runs in the UI), some of which are archived flows, and they actually start running when given a chance to • on production we can only see 1 flow run per day scheduled (which is correct) but when it comes to schedule start time, we’ve got some phantom archived flow runs that are trying to start and triggering the pagerduty automation. But, the flow_runs don’t seem to exist

Anna Geller

02/22/2022, 7:52 PM

With respect to your CI/CD, can you share a bit more about that? Any chance you can share your yaml file (redact for security)? I wonder how did you get to over 1000 of flow versions - you said you run this daily, but it seems more unless you have this flow since 3 years 🙂 my suspicion is that the cancelling of flow runs works in most of the cases for you but the Service that does it may got slowed down by the large number of flow versions and if we can maybe reduce the amount of those new flow versions, perhaps we can tackle the root cause of the issue

Anna Geller

02/22/2022, 7:55 PM

successful-sloth wasn't that successful in the end 😄

David Elliott

02/22/2022, 9:01 PM

Sure - so on production we’re on version ~270, but on

staging

yeah we’re on 1071 - that’s because our CICD builds + registers the flow each time we merge to staging, which is multiple times per day. This pipeline is the entirety of our company’s SQL-based ETL, the flow is 1500 tasks, and we have tonnes of people working on the SQL logic in this flow. As such, we merge legitimate changes to

staging

multiple times per day, and then at 4pm whatever’s in

staging

at that time runs per this cron, and if it’s successful, we then merge

staging

master

, which happens once per day. i.e that’s why we have so any versions of this flow, but equally we’re seeing similar issues on production with the above Automation as described, and that has only ~270 versions? I feel like regardless of how many flow versions there are though, flow.register ought to be able to handle this..? If it is a scale issue with number of flow versions, people will start running into that over time anyway? We’ve maybe just hit it early due to the amount of development that happens on this flow?

David Elliott

02/22/2022, 9:05 PM

Do you have any logging on the service that archives old flows and cancels the schedules? wondering if you can see a noticeable difference in our flows when registering? wondering if we can confirm your hunch from the backend logs or something

Anna Geller

02/22/2022, 10:37 PM

Great question regarding the logging service - I asked and we'll check if we can see anything unusual during that time. Thanks for explaining the process around new flow versions. I think having many versions is not an issue but the rate of how frequently do you register new versions could cause some issues at scale. But it doesn't look like you are randomly creating new versions every 5 minutes 🙂 you seem to have a structured process for this so it shouldn't be an issue. I think long-term it won't be a problem since Orion has a better way of handling that since you determine the flow version yourself in your flow and you can redeploy a flow without bumping up the flow version (unless you want to - either way, it's more decoupled). We can check the log for this service but not sure what else we can do atm. How many flows/flow runs are affected? Does it happen often?

David Elliott

02/22/2022, 11:00 PM

Thanks. Currently it’s just affecting one of our flows (this bi-pipeline-v2 flow) which is one of our most business critical ones. It’s manageable short-term from an on-call perspective but we definitely need to get the issue resolved as this is our company’s main use of Prefect, ie it’s the one that matters most to us that it’s working and reliable! Let me know what you find in the logs - more than happy to manually register the flow at any time if it helps with log watching / retrieval too. Surely must be something weird with how the flow.register process is working with this flow - definitely shouldn’t be possible to have scheduled archived flows..!

Anna Geller

02/24/2022, 9:45 PM

I promised to keep you posted and here is our update: • the PagerDuty automation issue should be fixed now, we deployed a fix for it, • there was nothing suspicious in the logs, • to mitigate the issue, you can run the registration process a bit further away from the scheduled time. e.g. if your flow is scheduled to run at 4 pm, perhaps you can run this registration direct in the morning? This way, there is enough time to cancel those flow runs. It’s a hack but it can mitigate the issue, especially if this flow registration is automated anyway, you may shift it a bit further away from the scheduled time.

8 Views

Open in Slack

Previous Next