Hey everyone We re using a test setup very similar to the on Prefect Community #ask-community

Hey everyone, We're using a test setup very simil...

Andre Muraro

07/27/2021, 12:50 AM

Hey everyone, We're using a test setup very similar to the one detailed here. The only difference is that we're not using any tags for the local agent, and most flows use LocalDaskExecutor, and used compose scale to get a total of 10 agents. We have observer a few of issues and would like to clear them before we can decide to implement prefect. We noticed several tasks are logging multiple times, in tandem with the number of running agents. When we used 3, we would get 3 log messages for each task status, or custom messages too, for that matter. When we scaled to 10 agents, we get 10 messages, as shown in the print. Not sure if related, but I noticed too that the docker logs seemed to show that a certain agent that only picked up a single flow nevertheless seemed to deploy every flow that had been scheduled for that 20h scheduled time. So, it looks like every agent is trying to pickup every flow? I have also noticed a few cases where the server seemed to be showing statuses for tasks that would break their dependency chain (example below), showing earlier tasks being killed by zombie killer, but later ones marked as success. Also, many long running tasks show as failure, but generate their proper outputs. Maybe my basic setup can't handle all the requests being thrown at the API? For long running tasks, it looks like the only way would be to turn off zombie killer? From the source code it appears the 10 minute heartbeat delay is hard-coded, so if I have a task expected to run for 20m it'll be forced to fail every time? I appreciate if anyone can clear these up and guide me towards a proper configuration Thanks!

Kevin Kho

07/27/2021, 1:00 AM

Hey @Andre Muraro, I’ll respond more to this tomorrow

Mariia Kerimova

07/27/2021, 2:38 PM

Hello Andre! According to the screenshots you're creating several

flow run

and your agents are deploying/registering those flow runs with Server. Can you show a redacted flow you're running? The agent and flow runs have labels. If flow and agent labels are matching, the agent can deploy the flow run, moreover agent can have a superset of labels. So if your agents have same labels, any of them can deploy flow runs.

Kevin Kho

07/27/2021, 3:23 PM

All agents poll all flows that it can be pick up. In this case, you likely only need 1 agent on the machine. 1 agent is capable of deploying all of the Flows. Having multiple agents that can pick up multiple Flows can lead to these race conditions (they are better handled in Cloud). In the UI, does it only show up as 1 agent? If you have multiple agents, labels could help specify where each flow would go to as Mariia is suggesting. The broken dependencies are a side effect of the Flow.

Andre Muraro

07/27/2021, 3:26 PM

Hello @Mariia Kerimova, from what I gather, these were all actual runs that ran concurrently at that scheduled time. They were instances of different flows that run hourly. What confuses me is that agent a5d8 seemed to deploy all of the scheduled flows for 20:00gmt, but only ever executed "discreet-ermine", according to the api query. The other flow runs were actually run by other agents. Like I said, I'm not sure if that's related to log duplication, just thought I would mention it. My intended design is that any agent should be capable of running any flow, that's why I haven't used labels. I assumed all the agents would be able to divide the workload among them?

Kevin Kho

07/27/2021, 3:29 PM

The agent is a separate process that is not responsible for running the Flow, it kicks off the flow as a separate process. So the agent is really just a lightweight process. Or do your agents live on different machines?

Andre Muraro

07/27/2021, 3:31 PM

@Kevin Kho Yes, in the UI, it shows only one agent having run that specific flow. But that flow, specifically "Importar Ordens Typeform", is usually very fast(less than 1m). However, when run at the hourly schedule together with other long running flows, it starts to take very long, and sometimes gets killed by zombie killer, even in cases when the agent that ran it did not take any other flow.

Andre Muraro

07/27/2021, 3:32 PM

@Kevin Kho For the moment, all agents live in one machine. So, you think a possible solution would involve mapping each flow label so that it matches one agent only?

Andre Muraro

07/27/2021, 3:34 PM

Also, I don't think a single agent would be enough. My long running flows (about 9 of them) can take upwards of 20, 30m each right now, before I can optimize them. If I stack them sequentially they would take more than an hour, causing them to accumulate, since I need to run hourly.

Andre Muraro

07/27/2021, 3:39 PM

FYI, all my flows follow this pattern for the moment:

with Flow('Importar Ordens Typeform', executor=LocalDaskExecutor(), state_handlers=[teams_handler]) as flow:

### Several tasks here

if prefect.config.debug == False:

flow.schedule=Schedule(clocks=[

CronClock("*/15 7-13,16-20 * * 1-5", start_date=pn.now(tz)),

CronClock("*/5 13-16 * * 1-5", start_date=pn.now(tz))

])

if prefect.config.environment == 'dev':

flow.visualize(filename='prototyping/boletas_mov')

flow.run(parameters=dict(since='2021-07-26', until='2021-07-27'))

I deploy them to prod via gitlab, where the script calls the cli to register.

Kevin Kho

07/27/2021, 3:46 PM

The agent does not run anything. It just deploys the flow run, and then continues to listen for future flow runs to deploy. So you get the same effect if it’s 3 agents or 10 agents since the Flow starts in a different procees. (Although here the effect is the duplicate flow runs). Multiple local agents on the same machine does not give you any additional compute power or resources. The resources are constrained by the machine that agent lives on. One agent will be able to utilize all. An agent is not “busy” while a deployed flow is running. The solution is really to reduce to one agent. Labels only make sense if you have multiple agents on different machines so that you can dictate which flows get send to which machines. The Zombie Killer is just reporting problems that happen. I think what is going on here is that you use a LocalDaskExecutor that presumably tries to occupy all of the cores available. When other concurrent Flows run, they won’t be able to get more resources as the previous running flows are already occupying the cores of the machine. In this case, I think the Zombie Killer is being activated because of out of memory issues.

Andre Muraro

07/27/2021, 3:49 PM

@Kevin Kho Ah I understand it now. Makes a lot of sense. I'll do that and report later on my results, thanks a lot!

👍 1

Andre Muraro

07/28/2021, 1:37 PM

So yeah, I've been observing for about a day now, and it seems all my problems really came from that single misconception about how an agent works. Flow success rate went from 85% to 100%, logs are much better now, long running tasks are behaving. Nice going!

👍 2

Kevin Kho

07/28/2021, 1:44 PM

Glad to hear!

Open in Slack

Previous Next