Is anyone else getting the rate limit error 429 exactly on t Prefect Community #prefect-cloud

Is anyone else getting the rate limit error 429 ex...

Leonardo Perazzini

12/12/2023, 7:14 PM

Is anyone else getting the rate limit error 429 exactly on the task_runs/bacc484b-9632-4ac9-9616-30e1451205ca/set_state endpoint? I don't hit this endpoint directly... the actual execution of the flows and tasks that are calling this endpoint. It started this afternoon... and I've already checked all the flows runs and task runs... I shouldn't be getting this error @Nate do you know anything about this ?

👀 1

Leonardo Perazzini

12/12/2023, 7:14 PM

@Marvin

Marvin

12/12/2023, 7:14 PM

Hello! How can I assist you today? If you have any questions or need help with something, feel free to ask.

Bianca Hoch

12/12/2023, 7:16 PM

Hey Leonardo, I sent you over an email with some pointers for the 429 errors you're seeing

Bianca Hoch

12/12/2023, 7:17 PM

i'll recap here just so that others can see: > For the rate limits, any call the is made to the flow_run, task_run, and flow endpoints counts towards the limit. This includes task/flow run state changes, task/flow run creation, deletion, cancellation, etc. > > It bears mentioning that the 429 errors are automatically retried by the client a certain number of times. Retries can also be set to a configurable number (see here ), which can help with managing smaller spikes in requests. However, consistent high volumes of requests would consume a lot of bandwidth, and retries wouldn't be of much help on that front. At that point, you may consider spreading out the number of flow runs and task runs that run at a given time. > > You can add sleeps to your tasks, or set concurrency limits to avoid bumping into the rate limit. > • Work pool concurrency > • Task run concurrency

Leonardo Perazzini

12/12/2023, 7:24 PM

@Bianca Hoch Hey i saw now your email! But has something wrong ... because we didnt change nothing today to change the number of hits on api ... and the rate limit error is always on taks_runs/{id}/set_state ... even if has a few flows running ... and i know that flows dont have to many tasks running too. Are there some way to audit the calls that are causing the throttle ? Im really desperate 😕

Leonardo Perazzini

12/13/2023, 12:35 PM

@Nate can you help with this too please ? We didn't change anything in the codes or add anything that was something that was very recurring and didn't involve many tasks lately. The moment the rate limit is being triggered is exactly in tasks that have existed for months... and use task concurrency. If it is an instability or a bug on the prefect cloud side, we can wait... otherwise, we will need to migrate the platform, or to the prefect server... At the moment we are receiving the error, we have a maximum of 20 active tasks and 8 flows... it is very unlikely that what previously never gave an error, out of nowhere, without changes and always when flows that have competing tasks are running, is triggering the rate limit :/

Nate

12/13/2023, 3:29 PM

hi @Leonardo Perazzini - I'm not aware of an instability in Cloud at the moment that would cause this, there could be one though. would you be able to share an example of one of your flows where you encounter rate limits and a trace where it has failed?

Leonardo Perazzini

12/13/2023, 3:41 PM

@Nate One of my flows that is giving an error is https://app.prefect.cloud/account/dc1ed6a7-a0e5-4127-8ff1-57045f7ef23a/workspace/dd2e34bf-cd45-4f14-82d5-f5a0c0d361b9/deployments/deployment/a3e608ee-7f5e-4075-868b-fea4cfca1ec8, I don't know if you can see it internally and within my UI, but the problem started yesterday, it runs every 30 minutes for example and uses a concurrent task of 5 to not let it run more than that at the same time. same time and even breaking the container in ecs. Here's one that failed, https://app.prefect.cloud/account/dc1ed6a7-a0e5-4127-8ff1-57045f7ef23a/workspace/dd2e34bf-cd45-4f14-82d5-f5a0c0d361b9/flow-runs/flow-run/5edfbe6c- 479d-4539-8c5f-dc57edfa9fc5 if you need a closer look. It's difficult to understand what could be causing this, since other flows could be having an impact of course... and there are some flows that also run concurrent tasks at this time. But as I said before, there was no change in the number of tasks in these flows (I can guarantee it, as the number of tasks that will be submitted is fixed), and no other flows were inserted that could compete in order to exceed the rate limit.

Bianca Hoch

12/13/2023, 5:34 PM

Hi Leonardo, we're taking a peak here

Bianca Hoch

12/13/2023, 5:35 PM

so it looks like ~109 task runs were submitted

Bianca Hoch

12/13/2023, 5:35 PM

Does that sound right?

Leonardo Perazzini

12/13/2023, 6:17 PM

@Bianca Hoch Yes! Its that!

Bianca Hoch

12/13/2023, 6:18 PM

sounds like we're getting somewhere then!

Leonardo Perazzini

12/13/2023, 6:18 PM

They are all submitted at the same time with the .submit() method, and then using task concurrency with the tags, they are done in a batch of 5. There are other flows that also do this. Some with less competition, and some with no competition at all... they are linear

👀 1

Bianca Hoch

12/13/2023, 6:18 PM

ps: the way I figured that out was just taking a look at the "Task runs" tab for the

brawny-skink

flow run

Leonardo Perazzini

12/13/2023, 6:25 PM

hehe yes! I thought you were talking about tasks submitted in all flows at the same time, not just the one above

😄 1

Bianca Hoch

12/13/2023, 6:26 PM

yup, juuuust the one above

Bianca Hoch

12/13/2023, 6:27 PM

it look like a majority of the tasks, if not all, were submitted in that 10:03 - 10:04 minute window

Bianca Hoch

12/13/2023, 6:27 PM

and juuuust when everything starts to complete starting around 10:05 , a crash happens at 10:06

Leonardo Perazzini

12/13/2023, 6:32 PM

Yes I understand they are all being submitted and created in a very close window... and they should be being considered in the rate limit. But if the rate limit resets in 1 minute, submissions shouldn't be interfering since a few minutes have passed.

Leonardo Perazzini

12/13/2023, 6:33 PM

One thing I noticed... I'm now getting the last month's payment slip to send to the company's controller. And I noticed that the month of November doesn't appear as paid... there's no chance I'm hitting the rate limit of 400 because of that?

Leonardo Perazzini

12/13/2023, 6:33 PM

Sorry for the English, I'm using the translator to help me and try to speed up the problem hehe

Bianca Hoch

12/13/2023, 6:34 PM

Pode falar portugues se voce quer

Leonardo Perazzini

12/13/2023, 6:35 PM

👀

Bianca Hoch

12/13/2023, 6:35 PM

minha gramatica e bem ruim, mas posso fazer uma força

Leonardo Perazzini

12/13/2023, 6:35 PM

hehehe obrigado!

👍 1

Bianca Hoch

12/13/2023, 6:36 PM

vou olhar pra sua conta e os detalhes sobre o pagamento, um momentinho

Bianca Hoch

12/13/2023, 6:38 PM

hmm, creo que nao e sobre a conta. o plan e "pro-tier" ainda.

Leonardo Perazzini

12/13/2023, 6:38 PM

Entendi ...

Leonardo Perazzini

12/13/2023, 6:40 PM

A questão das submissões, eu não sei se me expressei certo ali em cima. Mas como os erros aconteceram no minuto 05-06, e as 109 subsmissões foram no minuto 3 ... Parece que mesmo acontecendo todas as subsmissões ao mesmo tempo, não é ali que está gerando o rate limit, e sim durante as execuções em si.

Bianca Hoch

12/13/2023, 7:04 PM

Eu acho que entendi. Na verdade, o limite é aplicado na conta inteira, não apenas um flow. Se esse flow fosse executado ao mesmo tempo que outro, com o seu próprio conjunto de tasks, isso também tornaria a probabilidade de atingir o limite mais significante.

Bianca Hoch

12/13/2023, 7:09 PM

Existem maneiras de evitar isso, como definir time.sleep(5) dentro dos tasks, ou ajustar a variável que mostrei um pouco antes: ``PREFECT_CLIENT_MAX_RETRIES``

Bianca Hoch

12/13/2023, 7:14 PM

O valor está definido como 5, você pode aumentá-lo para ver se ajuda. Você também pode espaçar os retries com outra configuração:

PREFECT_CLIENT_RETRY_JITTER_FACTOR

(here)

Leonardo Perazzini

12/13/2023, 7:14 PM

Eu precisaria ter um retry dentro dos proprios flows ? Ou só alterando esse parametro, ele já iria fazer o retry automatico ?

Leonardo Perazzini

12/13/2023, 7:15 PM

https://docs.prefect.io/latest/concepts/flows/ Na parte de retry como parametro do flow

Bianca Hoch

12/13/2023, 7:16 PM

PREFECT_CLIENT_MAX_RETRIES

acima é para o cliente, não para o flow. Essencialmente, ele define o número de tentativas que o cliente tenta acessar a Prefect API se um erro 429/500 for atingido.

Bianca Hoch

12/13/2023, 7:19 PM

Você precisaria definir essa variável no container/ecs task onde seu flow é executado.

Leonardo Perazzini

12/13/2023, 7:19 PM

Ok! Irei tentar alterar essa config!

🍀 1

🙏 1

Bianca Hoch

12/13/2023, 7:22 PM

Como último recurso, posso pedir à minha equipe se podemos oferecer um contrato anual para uma conta com limites mais elevados. Você pagaria o mesmo valor, seria apenas anualmente.

Bianca Hoch

12/13/2023, 7:22 PM

Depende de você!

Leonardo Perazzini

12/13/2023, 7:50 PM

Obrigado Bianca! Irei levar para o gerente de dados essa possibilidade. Provavelmente no futuro iremos precisar sim! Solucionando o problema momentaneo, iremos entender o que fazer. Irei retornar com atualizações!

Leonardo Perazzini

12/13/2023, 8:27 PM

@Bianca Hoch ... uma ultima informação ... acho que isso é bem importante para você passar para a sua equipe técnica... Eu estou chamando alguns endpoints de deleção de tasks que estavam pendentes, mas não foram canceladas e nem terminadas no prefect cloud. Eu estou fazendo isso de forma manual, n é um processo recorrente. Ele está sendo feito de forma sequencial ... O que está acontecendo e eu estou achando estranho, requests com menos de milisegundos de diferenças, alguns retorna 429 e outros 204. 204 204 429 204 429 204 429 204 204 429 204 204 204 204 204 204 204 204 204 204 204 204 204 429 204 204

👀 1

Leonardo Perazzini

12/13/2023, 8:29 PM

Se o rate limit é reniciado após 1 minuto, porque de forma aleatorio ele está retornando 429 e depois retorna 204 ? Ele não deveria retornar depois do primeiro 429, uma sequencia de 429 até o rate limit ser reniciado ? Eles saberiam explicar o porque disso ? Eu não tinha o comportamento anterior, pq eu nunca tinha feito essa execução, mas será que isso não pode estar causando o problema ?

Leonardo Perazzini

12/14/2023, 11:07 AM

@Bianca Hoch com a mudança proposta que você passou, a gente agora não está tendo mais erros na execução dos flows. Claro que quando estamos operando a UI, ele retorna erro de rate limit, mas conseguimos pelo menos apagar o fogo com essa solução. Você poderia passar para o time tecnico esse cenário acima que eu te passei ? Porque mesmo que a gente tenha apagado o fogo, fica muito dificil usar a UI, com falhas na hora de clicar nos botões.

👍 1

🎊 1

Leonardo Perazzini

12/14/2023, 11:08 AM

Sem_titulo.py

Leonardo Perazzini

12/14/2023, 11:11 AM

Você pode passar esse código acima para eles fazerem o teste, e para entender o porque o rate limit não está bloqueando tudo a partir do primeiro 429 e reniciando só no final do minuto. Eu fiz um teste agora de manha e aconteceu a mesma coisa ... Estou questionando isso Bianca, porque realmente foi uma mudança muito forte do nosso lado, e a gente não tinha tido nenhuma alteração. Para a gente é um game change, uma vez que a gente tinha entrando na versão PRO a alguns meses atrás exatamente para resolver o problema de rate limit... Muito obrigado pelo suporte e apoio que você está dando!

👍 1

Leonardo Perazzini

12/14/2023, 3:53 PM

Oii @Bianca Hoch ... alguem mais está recebendo erro 500 da api? 😕 Crash detected! Execution was interrupted by an unexpected exception: PrefectHTTPStatusError: Server error '500 Internal Server Error' for url 'https://api.prefect.cloud/api/accounts/dc1ed6a7-a0e5-4127-8ff1-57045f7ef23a/workspaces/dd2e34bf-cd45-4f14-82d5-f5a0c0d361b9/task_runs/108f9fc8-ae4e-4a5b-bee7-244ae69edbdd/set_state' Response: {'exception_message': 'Internal Server Error'} For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500

Leonardo Perazzini

12/14/2023, 3:54 PM

https://app.prefect.cloud/account/dc1ed6a7-a0e5-4127-8ff1-57045f7ef23a/workspace/dd2e[…]61b9/flow-runs/flow-run/9b1c21fa-928f-4fe9-89cf-d348a082deba Url de um dos flows que recebeu erro 500

Bianca Hoch

12/14/2023, 4:02 PM

Oi! Vou verificar internamente sobre o erro 500, um momentinho

Bianca Hoch

12/14/2023, 4:15 PM

Avisei meu equipe, eles estão trabalhando ativamente em uma solução

🙌 1

Bianca Hoch

12/14/2023, 5:10 PM

A solução foi implementada, está tudo bem agora? 👀

Leonardo Perazzini

12/14/2023, 5:12 PM

Oi! Parece que está sim, os erros 500 não estão mais acontecendo! Mas ainda está rolando bastante 429 até quando tem poucos flows rolando ( eu consigo saber porque a UI não funciona devido a esses erros 😕 ), o pessoal chegou a ver ali o código e o porque retorna 429 , 200, e vai alternando?

Leonardo Perazzini

12/14/2023, 5:13 PM

As vezes a nossa conta está caindo em algum caso de exceção e estou com azar hehe ( já vi acontecer em projetos que trabalhei) ... por isso estou com esperança

Bianca Hoch

12/15/2023, 2:38 PM

Bom dia Leo, só queria que você soubesse que não esqueci de você aqui. Meu equipe está investigando os 429 erros agora.

❤️ 1

Bianca Hoch

12/15/2023, 4:17 PM

Obrigado pela sua paciência, Leo! Meu equipe identificou um feature do lado do servidor que estava potencialmente causando o aumento dos 429. Eles fizeram alguns ajustes para resolver o problema. Você deve ver menos erros aparecendo do seu lado.

Leonardo Perazzini

12/15/2023, 4:19 PM

Oi @Bianca Hoch! Nossa que maravilha, fico feliz que pode ter chance de realmente não ter atingido o limite por enquanto! Vou ficar de olho e verificar! Muito obrigado!

Leonardo Perazzini

12/15/2023, 4:22 PM

Imagino que o backlog de vocês deve ser bem extenso, mas fica a sugestão de ter um endpoint para verificar quantas chamadas já foram feitas dentro um range de tempo ... O hubspot tem um desses, e ajuda bastante. Um log de auditoria fica mais dificil quando é algo como apenas 1 minuto que reseta, mas talvez com a quantidade por data, já agregado daria para entender por exemplo se houver mudanças bruscas!

6 Views

Open in Slack

Previous Next