hi guys, we had a docker agent fail over the weeke...
# ask-community
j
hi guys, we had a docker agent fail over the weekend, so there were like 100 late scheduled runs as soon as i restarted the agent, it immediately queued and started all 100 runs these are pretty expensive tasks, so firing them all up at once immediately ground the server to a halt, couldn't ssh in to try killing things so my questions coming out of this are: • How do you configure the scheduled services to just give up if they get missed? We run the task every hour, so it's fine to just run it the next hour instead. Having it give up after, say, 30 minutes of waiting, would be very elegant, but just failing immediately is perfectly fine • How do you mass cancel scheduled jobs that have missed their time? • How do you mass cancel running jobs? I manually clicked open every job and pressed cancel until the system starting responding enough to ssh and I could go in and start killing things via the docker cli. With how unresponsive the UI is in general, this was pretty painful. • Is there some way to kill the agent to stop everything? This is the idea of some kind of breaker switch -- prefect just started going crazy on this server, shut it all down so we're back to a working state, and then we can sort it out afterwards. • After all the jobs had been marked as Cancelled or Failed, there were still containers running on the server. Are these just all the Failed containers that lost their heartbeat? Is there a nicer way to clean them up, or are Failed containers just abandoned by Prefect at that point, and the answer is to just go in and manually remove them like I did? Does that mean that any Failed tasks should always be manually reviewed for any hanging containers that need to be stopped and removed? • How do you configure a limit for the agent, so it can't get bad enough to make the server inaccessible to repair? I know based on tags that this agent is only going to run jobs of a certain type, so I could set the limit to, say, 5, and it'd just have 5 running 95 queued, and then I'd just need to mass-cancel those queued jobs.
👀 1
n
Hi @Jonathan Chu - you've got a lot of questions so I'll tackle them one at a time. SLAs: If you're on a standard Cloud plan (or above), you can configure a Flow Run SLA automation for any number of your flows (or all of them!). This SLA will state something like "If any run from <<flow>> does not start after <<some time>> <<do something>>. The "do something" could be anything, from notifications via a number of services to cancelling the run that triggered the event. Runs in the latter will end up in "Cancelled" states with a log to the effect of "failed SLA check" Something that might also be helpful here would be to set up an Agent SLA to notify you if one of your agents goes down; it works in a very similar fashion and could go a long way to mitigating what you're describing. Mass cancellation: This is something that we'll be releasing quite soon - for now, you can use the GraphQL API to query for runs that are in a "Scheduled" state that have missed their
scheduled_start_time
by any amount you wish. You can also use the "Clear" button in the UI on the Upcoming Runs tile, to remove all late jobs for all your flows (from the dashboard) or a specific flow (from the flow page). System halt: Another one we're releasing very soon - right now you can also use the GraphQL API to query for runs that fit your criteria and call the
cancel_flow_run
mutation on those. Still running containers: The answer to this is trickier - it depends heavily on what infrastructure your flows are running on, why something failed, and, to a lesser degree, which version of Prefect Core you're on. Calling the
cancel_flow_run
mutation should do a pretty good job of cleaning up resources but there are circumstances where Prefect doesn't have access to or knowledge of jobs that have been created (this is true when using Dask). Agent limits and actions: Another thing we're working on (there's a WIP issue for agent limitations here) - we'd like to introduce both limits and agent by agent cancellation in the near future; for now, you can use flow and task concurrency (which work on a label basis) to limit the number of things that can run with that label at a time. Hopefully that cleared up some of your questions but let me know if that's not the case.
🙌 1
j
nicholas, thank you very much for the detailed answers to all of my questions
could you link me to the documentation for the Flow Run SLA automation? i'm having some trouble finding it
graphQL api makes sense, i assumed that there would be some way to perform the actions through it, it just wasn't something i was looking to invest in in the heat of an outage. Probably then I should work on making some pre-canned scripts that i can use when needed
n
Ahhh, that's because the SLA docs are still being written; you can find the SLA "wizard" on the Automations tab of the UI; that'll let you create all the SLAs I mentioned above. We'll get those docs out as soon as possible.
If you can set up some SLAs for now, I hope to have a lot of the cancellation, work queue halting, and late run removal work out by the end of this week or next; that could save you a bunch of time learning GraphQL/writing scripts. The UI portion of that work is here, the API portion either already exists through the mutations I described above or will be released soon :)
j
is the LocalExecutor single-threaded? as in it'll only run one task at a time? could i use that to run my docker containers, and then the number of agents that i run directly means the maximum number of tasks that will run at once on the server?
n
Hm, yes, the LocalExecutor runs your tasks in a single thread spawned from the agent. There may be some nuance here that would achieve what you're thinking but I would instead recommend you use labels to strictly limit the numbers of any flows or tasks that run at any given time. This is typically a better choice because 1) you can increase/decrease your concurrency in a moment from the UI or API (including setting your concurrency to 0 to stop work from being submitted) and 2) you have more transparency into why work is or isn't being submitted
j
hmm, the tags seem a bit awkward for how i've set things up, or maybe it's because my choices aren't aligned with how prefect wants me to do things i've got my agents tagged as
growers-data, staging
and
growers-data, production
should each flow be tagged more like
growers-data-staging
and then search/filtering should be done with wildcards?
n
It depends - if your current labels make sense for your system of agents, you might consider adding some purely infra-related labels instead. I have a weird system I use that works for how I think about my pipelines: flows belong to streams (s) and tasks belong to tributaries (t); so i assign labels based on some load or rate I need and its priority. So I've used a system like s-1, s-2, s-3 and t-1, t-2, t-3, where the s-* dictates some workflow load or concurrency requirement, and t-* is usually governed by what my infrastructure or rate-limiting can handle (imposed by an external API or limited DB connections). It's not particularly elegant but some combination of priority/load is what I've found to be the most adaptable
Another benefit of a system like that is you can create adhoc runs (or even entire schedules) that run on different priority queues/are governed by different limits
j
thanks, i'll give some thought to infra-specific labels for priority and load that should limit the upper bound of trouble i can get into, and the SLA automation should address my original root cause
also, noticed on your advisor board you have Chris and Joy, they're great, i overlapped with them at WePay
upvote 1
x
Hi @nicholas, I'm interested in your label system. So you do literally have a label "s-1" and "s-2"? Doesn't this mean of all flows labeled "s-1", can only have one run simultanosly? Or is there a was to say "of each flow, only one can run simultanosly"?
n
It's grown a bit more complicated now but that's how it started!
But yes, you're correct, that's what it means, so you'd need to use a more individual label to limit a single flow