Good Morning - I am experiencing an outage on my e...
# ask-community
r
Good Morning - I am experiencing an outage on my end w/ self hosted agents something is not allowing my agents to pickup flows - is anyone able to help me - are ports 8080 and 4200 the firewall rules I should check - not sure where to begin
k
Are your agents visible in the UI? When you check the Agents tab?
r
yes
my agents are look like the are responding to queries however the agents haven'
k
So you have flows stuck in the Scheduled state?
r
t picked up any flows since friday around 6:30 pm CT
the flows are all late
k
So it picked up flows before and then just stopped all of a sudden? How many total late flow runs do you have?
r
yes - all of sudden stopped working
was there something in the last release that could of caused our agents to stop working?
k
Not really, but at 725 total late flows, I think the scheduler will stop working. Could you try deleting some flow runs to go down to like 500?
upvote 1
a
@Richard Hughes did you upgrade Prefect version on your agents? if so, from which to which version did you migrate? Also: can you confirm that labels are matching between flows and agents? Normally, if flow runs are stuck in Scheduled State (causing late runs) it’s usually because of some label misconfiguration
r
I have not upgraded my agents - i was wondering if you have upgraded the cloud service that could of caused my agent on premise to stop picking up flows
a
Do you use Prefect Server or Prefect Cloud?
r
cloud
a
@Richard Hughes can you access agent logs? If so, can you see anything suspicious there? Alternatively, could you start a new agent by including
--show-flow-logs
option so that we can see all logs on the agent for debugging?
r
yes i can see logs
2021-10-25 133107.323: [2021-10-25 133107,323] INFO - agent | Waiting for flow runs...
a
Can you show what labels are configured on that agent vs. what labels are configured on the flows stuck in a Scheduled state? Are those labels matching 100%?
since you mentioned you use Prefect Cloud, could you check whether all your API keys are still valid? if some API key expired, then flows get stuck in a Scheduled state as well.
r
give me just a minute
looking into some of these items
👍 1
api tokens - show deprecated - we are using these and they show no expiration
a
don’t worry about the deprecation, it’s just a warning for now. But if you happen to restart any agents, it would be great to switch to using API keys instead of RUNNER tokens. There’s more about this in this blog
r
i started new agent w/ an extra label - but, both flows and agents have same labels example: "PROD" vs. "PROD"
a
how do you start your agent? Unless you include a flag
--no-hostname-label
, the agent also gets a hostname as label, and it’s likely that this label is missing on your flow. So your flow should have both prod, and the hostname label attached, unless you disabled this default label.
r
was this a recent change?
our flows have the hostname labels
a
afaik this has been the default behavior in local agents for a while, so it’s not a recent change
r
prefect agent start -t "{API_TOKEN}" -l "PROD"
a
Thanks! And what are the labels on your flow?
r
the hostname
a
it should be both: the hostname and “PROD”
because your agent has both labels, so the flow should also have the same labels
r
we don't have "PROD" label on the flows
a
I think this might be the issue
r
we have been running this way for a couple of years
a
could you try adding the hostname label to your flow and see whether this solves the issue?
Copy code
with Flow(
    FLOW_NAME,
    storage=STORAGE,
    run_config=LocalRun(labels=["PROD", "your-hostname"]),
) as flow:
r
we source control our flows and pipelines deploy to all flows from the agent machines
we haven't changed anything on our side
a
Sorry to hear that you have an issue with this. As far as I can tell, in the current Prefect version when I don’t disable the hostname label, and my flow doesn’t have this hostname label attached, then it won’t be picked up by the agent. Can you share which Prefect version you were using so that I can cross check if anything changed since this version?
and I would encourage you to give it a try with the exact label configuration just to see whether it helps
r
something just happened - we are running flows all of a sudden
525 flows just kicked off all of a sudden
all we did was clear the late runs
k
Oh ok so it was not a labelling issue. Yeah we didn’t change anything on that front. Just in 0.14.21 .
r
i think we are still on 13.9 or somewhere back - we need to upgrade - it requires us to adjust a template
k
Oh no worries about that. The 725 limit is across all versions anyway so it wouldn’t have helped in this scenario
r
ive clear all the late flow and stopped all the flows - still have something weird going on
k
Is it the same issue where stuff if not being picked up?
r
it seems like it was a bug in the concurrency limit - it said there was 45 flows running but, none running - removed this and re-added it and now we are running
a
Nice work! And thanks for letting us know about what was the issue