Hello ... since the last prefect update version .....
# prefect-cloud
l
Hello ... since the last prefect update version ... I'm receiving erros on different type of flows which don't have co relations ... They stuck on pending state, and after N hours they are cancelled ( because I have automation to cancel too long pending flows). It's never happened before ... 1 month with this issue. It's very unusual to happen, but 1 flow in 100 happens 😕 I'm using ec2 with agent inside and infra as a process, is very simple case, but the queue not pull flow to infra ... And use prefect cloud version paid. @Nate or someone else can help me?
n
hi @leonardoperazzini - its hard to know what you mean without some logs / stack trace can you give more info?
l
@Nate Yes I understand. But it's just that Prefect doesn't show me a log in that case, does not show in the prefect cloud or in the agent The flow execution remains pending... and is not called by the agent queue, while the next schedules of the same flow function normally. https://app.prefect.cloud/account/dc1ed6a7-a0e5-4127-8ff1-57045f7ef23a/workspace/dd2e34bf-cd45-4f14-82d5-f5a0c0d361b9/flow-runs/flow-run/1c1f094a-d2e3-44ce-9b55-1b0f4306e271 I already sent a email for the support email of prefect, but I'm trying here too
n
do you have agent logs from your ec2?
l
I can try save logs ... im not saving them, because using tmux as a subprocess I start agent with prefect agent start -q production-dbt-server
Do you know any easy way to save logs of agent ?
n
I'm most familiar with running the agent/worker as a `systemd` process, in which case you can just the log for that systemd process but if you're using tmux im not sure off the top, you'd likely have to redirect the output when you started the process, or maybe there's some tool to attach / read from the existing process
1
l
@Nate I don't know if it can have correlation with the problem in this thread ... but since the last update of prefect, some of flows which run inside of ECS, started to receive this error ( image) ... Are there some way to retry if I receive 429? This rate limite is only mine or it's a global concurrency? Thanks for helping ...
n
hi @leonardoperazzini that seems like a docker rate limit
hmm
failed to fetch anonymous token
is strange
l
yeah ... but its the default docker of prefect ... not mine personal docker 😕 and it not happens often too
n
as a sanity check, does returning to the previous version resolve the problem? i suspect the prefect version might be unrelated but you mentioned this had started occurring after an upgrade?
l
Yeah ... started on release of exactly day 7 march ... https://github.com/PrefectHQ/prefect/releases/tag/2.16.3 Ok, I'll try fix previous version on prefect agent on ec2 ... and check it for us.
But the prefect agent or worker version doesn't seem to be the problem... because the error started to appear after that day, but we didn't change the queue version. It seems that it is related to the version of prefect that is running in the prefect cloud or in the infrastructure, and fixing these two is more complicated for me to test here.
n
yeah i wasnt trying to suggest it could be related to the agent/worker version, rather the version of prefect used for your flow runtime
1
l
@Nate Apparently ... It's a problem with version above 1.16.2, after fix containers and process prefect version, stopped errors and stuck pending problems. I haven't been able to save the logs on the server yet, but as the problem was solved by fixing the version, I won't run with it to see the logs. If you know what could have been in these new versions... I would appreciate it
n
2.16.1 had a bug that was fixed in 2.16.2 but without logs im not sure what would have been wrong with your agent / flow its also possible you had a new version of
typer
that broke the CLI, with a version of prefect that did not pin
typer
- they've released a couple breaking changes in the recent past, we had to pin it recently
l
Today ... I received again this error prefect_aws.workers.ecs_worker.TaskFailedToStart: CannotPullContainerError: ref pull has been retried 1 time(s): failed to resolve reference "docker.io/prefecthq/prefect:2.16.2-python3.10": failed to authorize: failed to fetch anonymous token: unexpected status: 429 Too Many Requests Only on ecs, on process every thing is working ( obvious ... there we won't use containt
container*. But in process ec2, things now not stucking om pending
b
Hey Leo! Just wanted to check in here.
It's a problem with version above 1.16.2, after fix containers and process prefect version, stopped errors and stuck pending problems.
is this still the case that you're no longer seeing issues with flow runs getting stuck in
Pending
?
l
@Bianca Hoch Hello! yeah ... im still receiving this error The error persists... it no longer seems to be a container version-related error. After I fixed the container (eco) and prefect ec2 version (process) to 1.16.2, the error in the EC2 process did not occur anymore, and the error message from ECS changed to the one below: prefect_aws.workers.ecs_worker.TaskFailedToStart: CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref docker.io/prefecthq/prefect:2.16.2-python3.10: failed to authorize: failed to fetch anonymous token: unexpected status: 429 Too Many Requests
@Bianca Hoch @Nate ... someone can help with this ? 😕 I don't know more what to do about error 429
Hello, I know it seems strange... but it seems like it was done at that time, there was an internal error in the mayor's api, it seems like it's the same thing, it's hitting the docker hub more often and we're getting error 429. Not before he had 498/5,000 Translation results Translation result Hello, I know it seems strange... but it seems like it happened like this, there was an internal error in the mayor api, it seems like it's the same thing, it's hitting docker hub more frequently and we're getting error 429. This didn't exist before error, and it was after the release of a new version of Mayor Cloud and Mayor that this started to happen... Wasn't it the same thing that happened before on your side? I'm really desperate to know how to fix this, my hands are tied. Did you follow that process with me and understand how complicated it was for us to start from nothing and receive an error and it seemed like it was on our side, but in fact it was the prefect's error... Could you ask internally again for staff to take a look? ? @Bianca Hoch
n
Hi @leonardoperazzini - I don't think we've received enough information from you to know how we can help it would make it easier to us to help you if you could open an issue that details the steps you took using prefect and the errors you received when it didn't work, or feel free to list that information out clearly here instead 429s in general mean rate limits. so someone, whether its Prefect Cloud, Github Registry, AWS or someone else, is saying "you're sending us too many requests" im sorry for any frustration!
👍 1
b
+1 to Nate's suggestion. an issue would be very helpful. Additionally, I think this post may get you closer to finding a root cause.
Ah- and I think this post from one of our engineers may be helpful to you as well. Someone else reported a similar issue, but it wasn't Prefect-specific.
👍 2
l
@Nate and @Bianca Hoch just to you know ... im receiving this error too, like prefect is hitting more times then normal apis of ecs or image of docker as the last times I said 😕
It's seen things is worst today 😕