Hi. we are using prefect agents with dask executor...
# ask-community
d
Hi. we are using prefect agents with dask executors... after some time (<1day) log streaming from tasks (logger = prefect.context.get("logger"); logger.info(text)) stops to appear in the cloud flow log.... the log can be still seen in the dask worker log on the machine ... is streaming breaking? is it a rate limit? any pointer how to debug that? (cc @Connor Martin )... though there is a tiny bit of log until it runs on dask ... Flow successfully downloaded. Using commit: YYY Beginning Flow run for 'XXX' Connecting to an existing Dask cluster at tcp://localhost:8786 ... no content here which is the problem ... Flow run FAILED: some reference tasks failed.
a
I'm seeing what I can find out from the team!
hi Daniel! I believe this is a limitation of Dash, that the workers don't send logs to the scheduler. When the logger is instantiated on the worker, it loses the handler -- you'll need a third-party tool to persist the logs. A bit more info here!
d
Thanks for finding out ... just read the referenced issue and did read a bit in the code .... the code seems have heavily changed since the comments in the issue ... there e.g. is now an ensure_started() method in the LogManager https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/logging.py#L51 ... any chance to get the logger + handler instantiated also on the dask worker?
maybe re-adding in the flow via e.g. : logger.addHandler(CloudHandler()) ?
k
Hey @Daniel Bast, I guess this is worth revisiting. I’ll talk to the engineers about this and get some clarity.
d
Hi Kevin, thanks for taking care... we have logging working with DASK workers until it stops after some time ... so I guess it should work... when it stops working we find those messages in the dask agent log on the machine
Copy code
.../miniconda3/envs/prefect/lib/python3.9/site-packages/prefect/utilities/logging.py:124: UserWarning: Failed to write logs with error: HTTPError('413 Client Error: Request Entity Too Large for url: <https://api.prefect.io/graphql>')
warnings.warn(f"Failed to write logs with error: {exc!r}")
k
Talked to the team, and you are right that this is likely due to limiting. Logs from an existing Dask cluster does work (just not the workers).
d
ok. what does that mean if we see that "413 Client Error: Request Entity Too Large" error? what can we do about it? does https://api.prefect.io/graphql have different limits compared to what the log client code in https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/logging.py#L36 applies before sending?
The issues is currently not easy to reproduce... restarted the agent + dask-worker on yesterday and so far log sending did not stop ... lots of flow runs since then ... if the sending stops, restarting the agent+dask-worker did help... but we want to avoid restarting ...
added some debugging code to see LOG_MANAGER.pending_logs and LOG_MANAGER.pending_length ... to see what is going on if that sending error comes gain
but it looked like that the cloud had tighter limits than the client code -> running then into a send failure without recovering ...
k
Hey Daniel, you are right that the logs seem to big for the payload here. Our recommendation has been to reduce the logs being sent, but we are seeing this a bunch so we will explore the way we batch it
d
ok. Thanks. Please tell if you need more details or if I should report that somewhere to an issue on Github etc. Having logging working and reliable is very important for us here. thanks!
k
I chatted with the team and the initial feedback is you can try editing 2 values in
logging.py
. (Edit the source on your end)
Copy code
MAX_LOG_LENGTH = 1_000_000  # 1 MB - max length of a single log message
MAX_BATCH_LOG_LENGTH = 20_000_000  # 20 MB - max total batch size for log messages
Smaller values might make this work. Would you be willing to try this and tell us if it helps?
d
ok. thanks. will try that.. so far the failure did not happen for 2 days... I'll try that and come back when that happens again
thanks!
@Kevin Kho Now we have actual numbers when the log sending error happens: len(pending_logs)/pending_length/queue-size/MAX_LOG_LENGTH/MAX_BATCH_LOG_LENGTH:24104/4668473/11/1000000/20000000
So the pending_length is 4 times lower than the default MAX_BATCH_LOG_LENGTH ... still sending logs fails
k
Forwarding this to the team
f
@Kevin Kho Any updates on this? Is there also another way to set MAX_LOG_LENGTH and MAX_BATCH_LOG_LENGTH without adjusting the
logging.py
?
k
Yeah the API limit was updated to 5 MB in 0.15.5 or 0.15.6
f
Is there no way to customise this as well? @Kevin Kho
k
If Cloud, not really. Do you still run into issues at the 5MB?
f
I am using Server not Cloud, going to upgrade to 0.15.5 / 0.15.6 now. Will get back to you if this does not fix it.