Hey all, we’re seeing some unexpected / buggy beha...
# prefect-community
d
Hey all, we’re seeing some unexpected / buggy behaviour with Prefect Cloud, hoping you could advise? We have a flow with a daily cron schedule which gets redeployed once a day (as code changes are made). Previously when the new version of the flow is registered (part of our deployment), the old version of the flow becomes archived and the schedules for that old version were cancelled, but over the last week we’ve been seeing double-scheduled flows remaining - one scheduled flow_run for the current flow, and one for the previous (now archived) flow. Wondering if you’ve seen this issue elsewhere? In addition we’ve seen some (likely) related oddities with our Automation which triggers a PagerDuty incident if the flow doesn’t start within the first 5mins. It looks as though these phantom archived–but-still-scheduled flows are trying to run, hitting our concurrency limit of 1 and stalling, triggering PagerDuty, but then when we click through on the flow_run link, the flow run literally doesn’t exist (either in the UI or in GraphQL). It’s super weird. Hope someone can advise! We’re the
Deliveroo
prefect tenant, happy to send over some specifics / URLs if helpful
k
Hi @David Elliott, I haven’t seen this before. How do you register this? I assume you had the same setup before but this just started acting up recently?
d
Yep exactly the same setup, no changes there. We have a circleCI job which builds the docker image from our git repo and then essentially does storage.build, attaches the flows storage then finally runs flow.register()
k
I haven’t seen any changes to cloud that would be related to this. Do you use the
set_active_schedule
flag with
flow.register()
?
d
No we don’t use that flag, though looks like it defaults to True (and we always set a schedule in the flow definition itself)
Looks like the first time we noticed this was Feb 16th 🤔
It does seem weird that an archived flow would be able to still have a schedule run though..? Not sure if you’re able to see our flows / flow_runs, but here’s a current example: Flow ID • This scheduled flow-run is correctly scheduled version of the current flow (version 1071) • This scheduled flow-run is an incorrectly scheduled version of the archived flow (version 1066) If I view the archived flow (1066) in the UI it still shows it’s full 9 upcoming scheduled flow_runs, despite being archived - super weird.
Forgot to mention we’re on this prefect version
prefect[aws,kubernetes]==0.15.5
This may help explain it better
Copy code
{
  flow_run(
    where: {
      auto_scheduled: {_eq: true},
      state: {_in: ["Scheduled"]},
      flow: {name: {_eq: "bi_pipeline_v2"},
        project: {name: {_eq: "bi-pipeline-v2-staging"}}
      			}
    			}
    order_by: {scheduled_start_time: asc_nulls_last}
  ) {
    id
    start_time
    flow {
      name,
      version,
      archived,
      project {
        name
      }
    }
    version
    end_time
    updated
    state
    scheduled_start_time
  }
}
k
It is pretty strange and it’s the first time I’ve seen this. Will ask some people
🙏 1
a
I remember I had a similar use case and such scheduled flow runs for an archived flow version in the end didn't manage to get into a Running state. They got cancelled by the Scheduler service with a message: "Flow run was cancelled because this flow is archived" so even if the UI shows scheduled runs for this flow version, those flow runs shouldn't manage to get into a Running state - LMK if you notice something different Having said that Kevin will ask the team to be sure
d
Unfortunately we’ve had instances (one this afternoon) where they both run (albeit one after the other because we have a flow concurrently limit of 1). The current version always seems to run first, and then the archived one runs as soon as the first one is complete and we have to manually cancel it, so they do appear to be able to get into a Running state. In addition, on our production deployment we have an Automation as described above which checks if it hasn’t started for >5mins and triggers pagerduty, which in our case the archived flow run falls into and so raises an incident
a
Can you share the flow run ID of such archived flow run which still managed to get into a Running state?
k
Do you have custom role definitions? Does someone have permission to archive but not to delete?
a
it would really help if you could send the flow run ID of an archived flow version that managed to get into a Running state. So far, the one you shared (86dfa65c-542b-45de-8010-265ed96e3d4c) only got Scheduled, but it wasn't running - as expected
d
No custom role defs, and the service acct I’m fairly sure is an admin anyway, and no changes to our roles in the last few months. Yep happy to - this flow run (
a2b2af53-aa99-40db-a05e-cd1ad54b9ce6
) was version 1066 and got to
Running
today prior to us cancelling it. The current version was 1071 which ran first (
7295ddff-69b4-4ef9-b638-14fa94e1ca7a
) and then as soon as that was done the archived one above started running as they’d both been scheduled for 4pm (but the flow concurrency = 1 stopped them from both running simultaneously)
k
How often is the schedule?
d
For that one it’s daily (weekdays) at 4pm
Btw we’ve just had 3x Pagerduty alerts raised by these dodgy Automations on production as well, ie our production flow_run has kicked off, but 3x others (all archived) also tried to start but had to wait due to flow concurrency, and triggered PD. Here’s one of the PagerDuty messages:
Copy code
Run `successful-sloth` (`3c295d6f-16bf-4cf8-964c-4437c8891bfc`) of flow `bi_pipeline_v2` failed `SCHEDULED_NOT_STARTED` SLA (`4cb3b9a7-93e8-4353-8759-48601a357106`) after 300 seconds. See [the UI](<https://cloud.prefect.io/deliveroo/flow-run/3c295d6f-16bf-4cf8-964c-4437c8891bfc>) for more details.
What’s weird is that that flow run linked doesn’t exist. But it must have done at some point in order to trigger the Automation..? When you click on the URL to the flow_run it provides, you get a blank page, and GraphQL returns nothing for that flow_run. ie we have 2 slightly different but related issues: • on staging we have multiple flow runs scheduled (you can see the archived scheduled flow_runs in the UI), some of which are archived flows, and they actually start running when given a chance to • on production we can only see 1 flow run per day scheduled (which is correct) but when it comes to schedule start time, we’ve got some phantom archived flow runs that are trying to start and triggering the pagerduty automation. But, the flow_runs don’t seem to exist
a
With respect to your CI/CD, can you share a bit more about that? Any chance you can share your yaml file (redact for security)? I wonder how did you get to over 1000 of flow versions - you said you run this daily, but it seems more unless you have this flow since 3 years 🙂 my suspicion is that the cancelling of flow runs works in most of the cases for you but the Service that does it may got slowed down by the large number of flow versions and if we can maybe reduce the amount of those new flow versions, perhaps we can tackle the root cause of the issue
successful-sloth wasn't that successful in the end 😄
d
Sure - so on production we’re on version ~270, but on
staging
yeah we’re on 1071 - that’s because our CICD builds + registers the flow each time we merge to staging, which is multiple times per day. This pipeline is the entirety of our company’s SQL-based ETL, the flow is 1500 tasks, and we have tonnes of people working on the SQL logic in this flow. As such, we merge legitimate changes to
staging
multiple times per day, and then at 4pm whatever’s in
staging
at that time runs per this cron, and if it’s successful, we then merge
staging
to
master
, which happens once per day. i.e that’s why we have so any versions of this flow, but equally we’re seeing similar issues on production with the above Automation as described, and that has only ~270 versions? I feel like regardless of how many flow versions there are though, flow.register ought to be able to handle this..? If it is a scale issue with number of flow versions, people will start running into that over time anyway? We’ve maybe just hit it early due to the amount of development that happens on this flow?
Do you have any logging on the service that archives old flows and cancels the schedules? wondering if you can see a noticeable difference in our flows when registering? wondering if we can confirm your hunch from the backend logs or something
a
Great question regarding the logging service - I asked and we'll check if we can see anything unusual during that time. Thanks for explaining the process around new flow versions. I think having many versions is not an issue but the rate of how frequently do you register new versions could cause some issues at scale. But it doesn't look like you are randomly creating new versions every 5 minutes 🙂 you seem to have a structured process for this so it shouldn't be an issue. I think long-term it won't be a problem since Orion has a better way of handling that since you determine the flow version yourself in your flow and you can redeploy a flow without bumping up the flow version (unless you want to - either way, it's more decoupled). We can check the log for this service but not sure what else we can do atm. How many flows/flow runs are affected? Does it happen often?
d
Thanks. Currently it’s just affecting one of our flows (this bi-pipeline-v2 flow) which is one of our most business critical ones. It’s manageable short-term from an on-call perspective but we definitely need to get the issue resolved as this is our company’s main use of Prefect, ie it’s the one that matters most to us that it’s working and reliable! Let me know what you find in the logs - more than happy to manually register the flow at any time if it helps with log watching / retrieval too. Surely must be something weird with how the flow.register process is working with this flow - definitely shouldn’t be possible to have scheduled archived flows..!
a
I promised to keep you posted and here is our update: • the PagerDuty automation issue should be fixed now, we deployed a fix for it, • there was nothing suspicious in the logs, • to mitigate the issue, you can run the registration process a bit further away from the scheduled time. e.g. if your flow is scheduled to run at 4 pm, perhaps you can run this registration direct in the morning? This way, there is enough time to cancel those flow runs. It’s a hack but it can mitigate the issue, especially if this flow registration is automated anyway, you may shift it a bit further away from the scheduled time.