Monitoring/Logging question - we’ve been seeing so...
# prefect-community
Monitoring/Logging question - we’ve been seeing some odd behaviour that we’re still trying to classify, around parallell mapped tasks. We’ve rolled our own subflow semantics by taking a set of inputs [generally partitioned data files], and for numInputs we map over them and trigger a Flow with that input/file. This allows us to split out the work so we can have a bunch of Fargate nodes attack chunks of the data processing. Once these flows have been triggered, we have another mapped task that polls the flows for completion. By and large it works fine, but we see strange cases where some of the mapped tasks can do the following: • Submit and not start, so polling sits there forever waiting on it, or just fail because it never starts • Some mapped tasks, the pollers in particular, don’t actually get started for a few minutes after the others in the mapped input. For example -> mapped_lows = [flow_id_1, flow_id_2, flow_id_3] => poll_flow(mapped_flows) => poller_1, poller_2 start straightaway, poller_3 not triggered for several minutes with no indication as to why
👀 1
Hi! Your system sounds cool 🙂 Hmm I don’t have a direct answer off the top of my head for why, especially since it seems somewhat intermittent. I think we may need some more logs and potentially a reproducible example if this keeps going on, which might be better served as a github issue, so feel free to open one to continue the conversation there where we have a bit more room :) For bullet 1, I think ‘submit and not start’ is referring to a separate, second flow that was triggered in your first level map. If so a few clarifying questions: is this on Server or Prefect Cloud? What are the logs from the FlowRunner Fargate task of the newly started flow? Or do you not even see the submitted flow run in your UI or otherwise have no evidence that the new flow object has been created and is in a pending or submitted state? For the second one, is it that you don’t see the fargate task running until a few minutes later, or you don’t see the task run set to running until a few minutes later? I want to rule out if we should be investigating logs on the fargate side or the prefect side.
Hey @Laura Lorenz (she/her) Apologies for the delay! Answers below 🙂 Item 1: • Everything is on Prefect Server • It doesn’t run, so we get no logs other than “task submitted” - absolutely no idea why… Item 2: • Good question - we don’t see the Task start in the UI, so our assumption is that the Fargate Task hasn’t started, but it’s possible that it has actually started and we’re assuming it hasn’t because that’s what the UI tells us As an aside, the logs that get generated in CloudWatch under the
- the ID’s on those logs look like this
, but the
doesn’t correspond to the ID of the Flow run itself. Is there way around that? It’s making log correlation a major PITA
What is the exact log output for item 1 when you say “task submitted”? I want to track down from the logs you have where in the process we are. For example from prefect log output I would expect things like “Submitted for execution: {}” when your environment deploys a flow (which would mean the fargate task for the FlowRunner of your secondary flow has been created), or “Task {}: Starting task run…” when the taskrunner starts the tasks’ run method (which would mean the fargate task for a TaskRunner inside that flow has been created). For those at least we can find them in the UI, which may be the easiest place to track them down. Unfortunately I don’t have much experience with CloudWatch 😬 but I would expect that unless you have configured CloudWatch to pull off the logs emitted from the containers’s stdout itself, those logs are only the fargate operational logs and don’t correlate to any IDs in prefect.
Hey Laura - sorry again, crazy day. I’ll get a reproducible flow together for you tomorrow and send it over, is that ok?
No problem, yeah! And if you want to open it as a github issue then it’ll be easier for us to track and get more eyes on too potentially, feel free to link me to it if you go that route 👍 take it easy 🙂
Hey @Laura Lorenz (she/her) Back again! So I’ve reproduced the problem with a stripped down example. The sub-flow shows as “Submitted” in the UI, and that’s the state we get back when polling as well. Looking at the logs in the UI for that sub flow, this is the only entry:
Submitted for execution: Task ARN: arn:aws:ecs:us-east-1:REDACTED:task/c07abccd-7dd4-4148-a55d-d682c4662b60
That sat for about 10 minutes in Submitted before I killed it - what happened then was the retry on the triggering task caught the “Failed” state, restarted the sub flow, and then it then started and completed successfully. My current guess on this is something to due with resource checks or API limits on AWS? But having said that, each
mapped task waits for [ wait period ifor a trigger is a counter up to the number of flows multiplied by 5 seconds, so
(0…20) * 5
] between each mapped task run before it actually triggers the flow, so they should be staggered. It’s trying to create 20 instances of this flow at once, each running as a Fargate Task so it’s plausible that there’s an issue, but looking at it as Le Dumb Ass User there’s nothing to actually tell me that. I’l tidy up the sample code I have and see if I can get it ready for a GitHub issue if we can’t figure it out 🙂
Hi! Ok that is some good info. Based on the last log you see sounds like we are getting trapped after the FlowRunner Fargate task starts but before the TaskRunner Fargate task starts. It definitely could be an AWS limit (which has some credence too for being intermittent), which we probably need to track down somewhere in the Fargate logs. Is there any output in your Fargate logs that map to that task ARN you posted (which should be the FlowRunner arn?) — This might be helped by me getting a basic flow up on Fargate just so I can see where all the operational AWS logs go to in a default setup, too.
Hey! So, in true and irritating AWS style, there’s no entry of the ARN anywhere in cloudwatch, hooray for that. Not sure if there’s any other useful info I have - any suggestions?