https://prefect.io logo
Title
r

Rajvir Jhawar

11/17/2022, 11:57 AM
The ECS task block in my prefect cloud is experiencing some strange sizing issues. The rest of the blocks looks correct, it is only ECS block that has the issue. I am on the latest version prefect 2.6.7
1
j

Jeff Hale

11/19/2022, 4:53 PM
Thank you, Rajvar. 2.6.8 is out now - maybe upgrading will fix the issue. If you’re still seeing that behavior could you please fill out a bug report issue with the details?
r

Rajvir Jhawar

11/20/2022, 12:57 AM
It is a bigger problem than that, my setup stopped working now. All my flow runs are stuck in the "pending" stage and the started pods in the cluster indicate successful startup. The agent is able to successfully connect to prefect cloud. I updated to 2.6.8 and made sure no agents were connecting to my prefect cloud instance. I even deleted my workspace and the issue persists, i need my account resist it seems like to fix the issue. I am paying for extra users who should i reach out to in prefect to solve this ASAP?
j

Jeff Hale

11/20/2022, 2:53 AM
Hi Rajvir. Sorry to hear that You shouldn't need to delete any workspaces. I am not sure what the cause of your issue is. You could downgrade to 2.6.7 if everything was working fine other than the UI issue. That is what I would do. If you have an account-related issue you could email support@prefect.io
a

Anna Geller

11/20/2022, 4:38 AM
Could you provide more details? Runs stuck in a pending state indicate that the agent has likely not enough capacity to deploy flow runs (resource starvation). Can you try redeploying your agent to ECS task with more CPU and memory allocation?
Alternatively, add concurrency limit to your work queue so that not all runs are deployed at the same time to the agent polling for runs from that work queue
r

Rajvir Jhawar

11/20/2022, 4:58 AM
@Anna Geller I am running a eks fargate cluster: 1. I only had 1 flow run qnd 1 agent and the issue still occurred At first I was getting a “Pod never started” error so I upped the pod timeout from 60 second to about 400 seconds which solved that issue, but then I get another error and the flow run continues to be stuck in a pending state.
Even on old flows that we’re running have the same issue
I confirmed that the agent running on cluster can talk to prefect cloud
I also double checked the k8 roles for prefect and they are default ones that prefect provides
Not really scientific but this all started occurring around the time I registered an ecstask. The plan was to run a hybrid setup eks plus ecs in different regions
a

Anna Geller

11/20/2022, 3:40 PM
If you are on Kubernetes, why not use KubernetesJob blocks instead?
r

Rajvir Jhawar

11/21/2022, 11:00 PM
@Anna Geller @Jeff Hale I resolved the issue by upping the job watch timeout.
🙌 1
🎉 1
The default for job watch timeout is really short at 5 seconds, shouldn't this be bumped up?
j

Jeff Hale

11/21/2022, 11:10 PM
Great to hear it's working. Thank you for the feedback. I'm not sure if we've had that feedback before, and will try to see if there have been other reports of that issue.
r

Rajvir Jhawar

11/21/2022, 11:15 PM
it would be good to add a comment in the docs to let users know that if you having issues with k8 jobs to try upping the pod/job watch timeouts. Even the default pod watch timeout of 60 seconds is really fast especially for EKS Fargate cluster (which can take 2+ minutes to start up) Here one of the places where i encountered the same issue: https://prefect-community.slack.com/archives/CL09KU1K7/p1663263241548969
👍 1
@Jeff Hale I still running a weird issue with one of my flows: For some reason the agent says that it is in the running state however the UI never updates. Is there any kind of extra debug information i can look into?
j

Jeff Hale

11/21/2022, 11:54 PM
Yes. You can change the logging level to debug. https://docs.prefect.io/concepts/logs/#logging-configuration
a

Anna Geller

11/22/2022, 12:07 AM
however the UI never updates.
can you try a hard refresh? it must be a latency or caching issue
r

Rajvir Jhawar

11/22/2022, 12:23 AM
@Anna Geller I tried a hard refresh and flow is still stuck at pending. I will have to dig a little deeper to find out what is going on
I do get this error, but i am not sure what causes it
a

Anna Geller

11/22/2022, 12:24 AM
stuck in pending indicates issue with the agent, likely not enough resources on the agent infra
r

Rajvir Jhawar

11/22/2022, 12:25 AM
Typically how much should i be giving the agent in terms of cpu and memory?
I just set it as the min that fargate provides : .25 vCPU and 512 Mb of ram
a

Anna Geller

11/22/2022, 12:37 AM
be a little more generous for production agent - I'd pick 4096 cpu and 8192 of memory
r

Rajvir Jhawar

11/22/2022, 12:59 AM
I bumped up the memory and cpu and still get the same error. I don't think it is because of the resources, I have a hpa scaler set to monitor the cpu and memory on the agent pod, it will then spawn a new agent to take off the burden on the first one. That hpa scaler has not kicked in so the agent is not being stressed
a

Anna Geller

11/22/2022, 2:22 AM
Can you check the agent logs? Did you enable debug logs on the agent to get more info?
I'm confused whether you run the agent on ECS or EKS in the end