Struggling to understand how to read this as I am ...
# prefect-ui
d
Struggling to understand how to read this as I am not seeing the output of print statements in the job run in some cases so wondering if I hit a limit. Does this show log and API limits? What is the difference between usage and granted/requested? Is this showing the pro limit of 2k even as a free user? If the graph shows a data point above the 400 limit with the state of granted, does that mean the limit is a soft limit as long as it is not overly higher than the tier allocated amount? Thanks
z
Hi @Denver H! Thanks for the feedback here
Does this show log and API limits?
Currently the graph only shows API limits for flow/task run endpoints. If you share your account id I’m happy to check if you’re running into logging rate limits. We’re working on adding other limits to this chart.
What is the difference between usage and granted/requested?
• Granted = # of requests successfully made under the rate limit, which should be equal to usage • Requested = # the number of requests made
Is this showing the pro limit of 2k even as a free user?
That’s looking like a bug in our system I’ll track down. The free tier has a limit of 400.
If the graph shows a data point above the 400 limit with the state of granted, does that mean the limit is a soft limit as long as it is not overly higher than the tier allocated amount?
In some cases we may allow traffic slightly in excess of the rate limit, but rate limits will become less “soft” going forward.
d
Thank you - that all makes sense Account: <> I am aware I have sometimes gone over the limit as I have a lot of small fast jobs pipping out input so looking for ways to track and fix that up
z
Looking on our side over the past 3 days I do see that account exceeding our log rate limit. On the free tier we allow users to upload up to 700 logs per minute. The chart below is grouped by minute, purple bars are the number of logs that we forbid to be written due to rate limits. During some peaks we were seeing more than 40k logs per minute, which would result in logs being dropped once we exhaust client retries on 429 api responses. > I am not seeing the output of print statements in the job run in some cases so wondering if I hit a limit
🙌 1
d
Thanks. Digging into this, I believe the issue may have been that a flow had
log_prints=True
in the decorator but some tasks within did not have it as a decorator but it presumably inherited and set prints as a prefect log entry. In any case, this might also explain a more significant issue I have been chasing down for weeks where I get sporadic
QueueingSpanExporter - Failed to export batch: HTTPSConnectionPool(host='api.prefect.cloud', port=443): Read timed out. (read timeout=10)
Sometimes that is the extent of it whereas other times it makes the process hang for 15 mins exactly and then a db error is surfaced and the process continues. I am now wondering if excessive logging results in some temporary ban which break the open prefect communication and/or leaves some db processes open/orphaned. Been a real blocker and have gone down some deep rabbit holes trying to reproduce consistently and debug but this now seems more probable
z
Copy code
QueueingSpanExporter - Failed to export batch: HTTPSConnectionPool(host='api.prefect.cloud', port=443): Read timed out. (read timeout=10)
As you suggested, this error is likely caused by hitting rate limits and being unable to upload logs. It also may delay completion of a process because the prefect client will attempt to upload all logs before exiting a process. I’m less sure about / haven’t seen the db error you’re referencing though. Happy to help debug if you can share a stack trace
d
Right, still validating some ideas as the stack has added complexity in how it wraps some traffic via wireguard that might be related if a host running a flow changes IPs during a run. Are there longer term considerations to have rate limiting thresholds be handled on the HTTP layer vs seemingly blocking the IP on an infra level or have I triggered other controls that then considers the excessive traffic malicious?
z
Hmm today rate limiting thresholds should only apply at the HTTP layer, not blocking the IP. Even if you exceed your logging limit, for example, you should still be able to make API calls to run a flow. If you’re reliably seeing other calls fail when logs are rate limited, something unexpected is happening. I suppose it’s possible our GCP infra is doing some automatic IP blocking to prevent DDoS but we don’t have anything explicitly configured at the infra level to block. It’s possible the shared
httpx
client within the Python process is breaking somehow. Without an MRE it’s difficult to say though
d
Got it - Appreciate the speedy responses. Working on slowing down the print statements to logs and will drop more details when I have a clear flow to reproduce
highfive 1