Hi we are using prefect agents with dask executors after som Prefect Community #ask-community

Hi. we are using prefect agents with dask executor...

Daniel Bast

08/06/2021, 5:58 PM

Hi. we are using prefect agents with dask executors... after some time (<1day) log streaming from tasks (logger = prefect.context.get("logger"); logger.info(text)) stops to appear in the cloud flow log.... the log can be still seen in the dask worker log on the machine ... is streaming breaking? is it a rate limit? any pointer how to debug that? (cc @Connor Martin )... though there is a tiny bit of log until it runs on dask ... Flow successfully downloaded. Using commit: YYY Beginning Flow run for 'XXX' Connecting to an existing Dask cluster at tcp://localhost:8786 ... no content here which is the problem ... Flow run FAILED: some reference tasks failed.

Allyson Lubimir

08/06/2021, 8:00 PM

I'm seeing what I can find out from the team!

Allyson Lubimir

08/06/2021, 8:13 PM

hi Daniel! I believe this is a limitation of Dash, that the workers don't send logs to the scheduler. When the logger is instantiated on the worker, it loses the handler -- you'll need a third-party tool to persist the logs. A bit more info here!

Daniel Bast

08/09/2021, 11:24 AM

Thanks for finding out ... just read the referenced issue and did read a bit in the code .... the code seems have heavily changed since the comments in the issue ... there e.g. is now an ensure_started() method in the LogManager https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/logging.py#L51 ... any chance to get the logger + handler instantiated also on the dask worker?

Daniel Bast

08/09/2021, 11:27 AM

maybe re-adding in the flow via e.g. : logger.addHandler(CloudHandler()) ?

Kevin Kho

08/10/2021, 4:12 PM

Hey @Daniel Bast, I guess this is worth revisiting. I’ll talk to the engineers about this and get some clarity.

Daniel Bast

08/10/2021, 4:17 PM

Hi Kevin, thanks for taking care... we have logging working with DASK workers until it stops after some time ... so I guess it should work... when it stops working we find those messages in the dask agent log on the machine

Copy code

.../miniconda3/envs/prefect/lib/python3.9/site-packages/prefect/utilities/logging.py:124: UserWarning: Failed to write logs with error: HTTPError('413 Client Error: Request Entity Too Large for url: <https://api.prefect.io/graphql>')
warnings.warn(f"Failed to write logs with error: {exc!r}")

Daniel Bast

08/10/2021, 4:18 PM

I see the limits here and wonder how that could happen https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/logging.py#L36

Kevin Kho

08/10/2021, 4:20 PM

Talked to the team, and you are right that this is likely due to limiting. Logs from an existing Dask cluster does work (just not the workers).

Daniel Bast

08/10/2021, 4:30 PM

ok. what does that mean if we see that "413 Client Error: Request Entity Too Large" error? what can we do about it? does https://api.prefect.io/graphql have different limits compared to what the log client code in https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/logging.py#L36 applies before sending?

Daniel Bast

08/10/2021, 4:39 PM

The issues is currently not easy to reproduce... restarted the agent + dask-worker on yesterday and so far log sending did not stop ... lots of flow runs since then ... if the sending stops, restarting the agent+dask-worker did help... but we want to avoid restarting ...

Daniel Bast

08/10/2021, 4:40 PM

added some debugging code to see LOG_MANAGER.pending_logs and LOG_MANAGER.pending_length ... to see what is going on if that sending error comes gain

Daniel Bast

08/10/2021, 4:41 PM

but it looked like that the cloud had tighter limits than the client code -> running then into a send failure without recovering ...

Kevin Kho

08/10/2021, 5:40 PM

Hey Daniel, you are right that the logs seem to big for the payload here. Our recommendation has been to reduce the logs being sent, but we are seeing this a bunch so we will explore the way we batch it

Daniel Bast

08/10/2021, 7:06 PM

ok. Thanks. Please tell if you need more details or if I should report that somewhere to an issue on Github etc. Having logging working and reliable is very important for us here. thanks!

Kevin Kho

08/11/2021, 3:10 PM

I chatted with the team and the initial feedback is you can try editing 2 values in

logging.py

. (Edit the source on your end)

Copy code

MAX_LOG_LENGTH = 1_000_000  # 1 MB - max length of a single log message
MAX_BATCH_LOG_LENGTH = 20_000_000  # 20 MB - max total batch size for log messages

Smaller values might make this work. Would you be willing to try this and tell us if it helps?

Daniel Bast

08/11/2021, 3:45 PM

ok. thanks. will try that.. so far the failure did not happen for 2 days... I'll try that and come back when that happens again

Daniel Bast

08/11/2021, 3:45 PM

thanks!

Daniel Bast

08/16/2021, 2:24 PM

@Kevin Kho Now we have actual numbers when the log sending error happens: len(pending_logs)/pending_length/queue-size/MAX_LOG_LENGTH/MAX_BATCH_LOG_LENGTH:24104/4668473/11/1000000/20000000

Daniel Bast

08/16/2021, 2:25 PM

So the pending_length is 4 times lower than the default MAX_BATCH_LOG_LENGTH ... still sending logs fails

Kevin Kho

08/16/2021, 5:57 PM

Forwarding this to the team

Fabrice Toussaint

09/23/2021, 11:07 AM

@Kevin Kho Any updates on this? Is there also another way to set MAX_LOG_LENGTH and MAX_BATCH_LOG_LENGTH without adjusting the

logging.py

Kevin Kho

09/23/2021, 1:51 PM

Yeah the API limit was updated to 5 MB in 0.15.5 or 0.15.6

Fabrice Toussaint

10/06/2021, 3:18 PM

Is there no way to customise this as well? @Kevin Kho

Kevin Kho

10/06/2021, 3:21 PM

If Cloud, not really. Do you still run into issues at the 5MB?

Fabrice Toussaint

10/06/2021, 3:22 PM

I am using Server not Cloud, going to upgrade to 0.15.5 / 0.15.6 now. Will get back to you if this does not fix it.

2 Views

Open in Slack

Previous Next