Hey everyone, We're using a test setup very simil...
# ask-community
a
Hey everyone, We're using a test setup very similar to the one detailed here. The only difference is that we're not using any tags for the local agent, and most flows use LocalDaskExecutor, and used compose scale to get a total of 10 agents.  We have observer a few of issues and would like to clear them before we can decide to implement prefect. We noticed several tasks are logging multiple times, in tandem with the number of running agents. When we used 3, we would get 3 log messages for each task status, or custom messages too, for that matter. When we scaled to 10 agents, we get 10 messages, as shown in the print. Not sure if related, but I noticed too that the docker logs seemed to show that a certain agent that only picked up a single flow  nevertheless seemed to deploy every flow that had been scheduled for that 20h scheduled time. So, it looks like every agent is trying to pickup every flow? I have also noticed a few cases where the server seemed to be showing statuses for tasks that would break their dependency chain (example below), showing earlier tasks being killed by zombie killer, but later ones marked as success. Also, many long running tasks show as failure, but generate their proper outputs. Maybe my basic setup can't handle all the requests being thrown at the API? For long running tasks, it looks like the only way would be to turn off zombie killer? From the source code it appears the 10 minute heartbeat delay is hard-coded, so if I have a task expected to run for 20m it'll be forced to fail every time? I appreciate if anyone can clear these up and guide me towards a proper configuration Thanks!
k
Hey @Andre Muraro, I’ll respond more to this tomorrow
m
Hello Andre! According to the screenshots you're creating several
flow run
and your agents are deploying/registering those flow runs with Server. Can you show a redacted flow you're running? The agent and flow runs have labels. If flow and agent labels are matching, the agent can deploy the flow run, moreover agent can have a superset of labels. So if your agents have same labels, any of them can deploy flow runs.
k
All agents poll all flows that it can be pick up. In this case, you likely only need 1 agent on the machine. 1 agent is capable of deploying all of the Flows. Having multiple agents that can pick up multiple Flows can lead to these race conditions (they are better handled in Cloud). In the UI, does it only show up as 1 agent? If you have multiple agents, labels could help specify where each flow would go to as Mariia is suggesting. The broken dependencies are a side effect of the Flow.
a
Hello @Mariia Kerimova, from what I gather, these were all actual runs that ran concurrently at that scheduled time. They were instances of different flows that run hourly. What confuses me is that agent a5d8 seemed to deploy all of the scheduled flows for 20:00gmt, but only ever executed "discreet-ermine", according to the api query. The other flow runs were actually run by other agents. Like I said, I'm not sure if that's related to log duplication, just thought I would mention it. My intended design is that any agent should be capable of running any flow, that's why I haven't used labels. I assumed all the agents would be able to divide the workload among them?
k
The agent is a separate process that is not responsible for running the Flow, it kicks off the flow as a separate process. So the agent is really just a lightweight process. Or do your agents live on different machines?
a
@Kevin Kho Yes, in the UI, it shows only one agent having run that specific flow. But that flow, specifically "Importar Ordens Typeform", is usually very fast(less than 1m). However, when run at the hourly schedule together with other long running flows, it starts to take very long, and sometimes gets killed by zombie killer, even in cases when the agent that ran it did not take any other flow.
@Kevin Kho For the moment, all agents live in one machine. So, you think a possible solution would involve mapping each flow label so that it matches one agent only?
Also, I don't think a single agent would be enough. My long running flows (about 9 of them) can take upwards of 20, 30m each right now, before I can optimize them. If I stack them sequentially they would take more than an hour, causing them to accumulate, since I need to run hourly.
FYI, all my flows follow this pattern for the moment:
with Flow('Importar Ordens Typeform', executor=LocalDaskExecutor(), state_handlers=[teams_handler]) as flow:
### Several tasks here
if prefect.config.debug == False:
flow.schedule=Schedule(clocks=[
CronClock("*/15 7-13,16-20 * * 1-5", start_date=pn.now(tz)),
CronClock("*/5 13-16 * * 1-5", start_date=pn.now(tz))
])
if prefect.config.environment == 'dev':
flow.visualize(filename='prototyping/boletas_mov')
flow.run(parameters=dict(since='2021-07-26', until='2021-07-27'))
I deploy them to prod via gitlab, where the script calls the cli to register.
k
The agent does not run anything. It just deploys the flow run, and then continues to listen for future flow runs to deploy. So you get the same effect if it’s 3 agents or 10 agents since the Flow starts in a different procees. (Although here the effect is the duplicate flow runs). Multiple local agents on the same machine does not give you any additional compute power or resources. The resources are constrained by the machine that agent lives on. One agent will be able to utilize all. An agent is not “busy” while a deployed flow is running. The solution is really to reduce to one agent. Labels only make sense if you have multiple agents on different machines so that you can dictate which flows get send to which machines. The Zombie Killer is just reporting problems that happen. I think what is going on here is that you use a LocalDaskExecutor that presumably tries to occupy all of the cores available. When other concurrent Flows run, they won’t be able to get more resources as the previous running flows are already occupying the cores of the machine. In this case, I think the Zombie Killer is being activated because of out of memory issues.
a
@Kevin Kho Ah I understand it now. Makes a lot of sense. I'll do that and report later on my results, thanks a lot!
👍 1
So yeah, I've been observing for about a day now, and it seems all my problems really came from that single misconception about how an agent works. Flow success rate went from 85% to 100%, logs are much better now, long running tasks are behaving. Nice going!
👍 2
k
Glad to hear!