https://prefect.io logo
Title
j

Jason

08/31/2022, 3:21 PM
Hi everyone - we're still on Prefect V1 for Q3. We had a failing heartbeat with a long-running API backfill. I did switch our configuration to 'threads' as documented here, but two backfills failed again last night (while a third smaller one succeeded). What's interesting is that the ECS Cluster is still active but the agent has stopped responding to new incoming requests, and there's likely a zombie job in Prefect due to this. The specific question I have is "Has anyone else had a problem of a new FARGATE job stopping the agent itself from responding?" I thought, based on the configuration, that those would have been on separate FARGATE clusters on ECS.
👀 1
1
b

Bianca Hoch

08/31/2022, 5:10 PM
Hello Jason, just out of curiosity what do to the flow logs look like? Do the failing flows enter a state of Submitted or Scheduled?
j

Jason

08/31/2022, 5:16 PM
Hi Bianca - they stay within scheduled:
Last State Message
[31 August 2022 9:38am]: Flow run scheduled.
Do you know if the
latest
container is still with Prefect 1? I do see 2-latest as a tag, but I was curious if perhaps some of this is because the ECS agent was upgraded to the 2 branch whereas the submissions are still from the 1-series.
Digging into the CloudWatch logs we are seeing an error that corresponds with the long-running heartbeat failure timing:
ERROR - agent | Failed to infer default networkConfiguration, please explicitly configure using `--run-task-kwargs`
And while there is a Github Issue, we do not have the same camelCase problem as the author and this configuration was running successfully for most of the summer.
b

Bianca Hoch

08/31/2022, 6:00 PM
Hi Jason, since the flows are getting stuck in a Scheduled state, I'd recommend trying out the troubleshooting steps in this article as a first step: https://discourse.prefect.io/t/why-is-my-flow-stuck-in-a-scheduled-state/73
👍 1
I'll dig a little more to see if something else could be going on here.
j

Jason

08/31/2022, 8:24 PM
Hi Bianca - we verified that the last successful query came from 1.2.0. ECS updated the container to 1.3.0 and that seems to correspond with the new error because the cluster is flapping.
I think we have a culprit - Infra deleted an old default VPC that the agent was running on, although tasks were run on the data VPC. Is there an example of the YAML file that can be fed into --run-task-kwargs that we can use?
🚀 1
1
b

Bianca Hoch

08/31/2022, 9:15 PM
Hey Jason, I think the last comment that Anna made in this article may be of service to you: https://discourse.prefect.io/t/clusternotfoundexception-when-using-ecsagent/733/3
There is an example YAML file there
j

Jason

08/31/2022, 9:42 PM
Perfect, thanks so much