Howdy. We have a flow we are running that uses th...
# prefect-community
j
Howdy. We have a flow we are running that uses the core server to manage it. The flow hits a certain task and then just stalls out. It does not write any of the logs, it ignores the timeout set in the
@task
settings, and just sits waiting for an hour until the heartbeat dies. We can run it on the same worker server manually with no problems at all. There doesn’t seem to be any logs indicating it is out of memory. This seems pretty squarely on the prefect side but I am lost how to debug it since only the server has the issue.
Copy code
Process 27956: /opt/prefect/.venv/bin/python3.8 -m prefect heartbeat flow-run -i 14303ee3-e96a-49c9-87e6-6577fcace7d3
Python v3.8.0 (/usr/bin/python3.8)
Thread 27956 (idle): "MainThread"
    flow_run (prefect/cli/heartbeat.py:98)
    invoke (click/core.py:610)
    invoke (click/core.py:1066)
    invoke (click/core.py:1259)
    invoke (click/core.py:1259)
    main (click/core.py:782)
    __call__ (click/core.py:829)
    <module> (prefect/__main__.py:4)
    _run_code (runpy.py:85)
    _run_module_as_main (runpy.py:192)
c
Hi Jeff - there are two large differences between a local run (i.e.
flow.run
) and a server managed run that hopefully provide a place to begin looking: • firstly, a server managed run will be run by an agent. If you are using a
LocalAgent
, then that means the flow run will take place in a subprocess managed by the agent • secondly, a server managed run will additionally spawn its own subprocess to send regular heartbeats to the server API (which is what the process ID you’ve listed here looks to be - that CLI command does very little other than send a POST request and sleep: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/cli/heartbeat.py#L91-L98) It’s really hard to say for certain, but my guess would be a resource limitation of some kind caused by these subprocesses, or possibly some locked state caused by that particular task running in a subprocess as opposed to the main process. We’ve seen instances of deployments that don’t include a process cleanup service (e.g., tini https://github.com/krallin/tini) that can quickly result in a clogged deployment, so I’d start by investigating that.
j
Awesome will check and report back
👍 1
@Chris White We discovered this in the logs:
Copy code
Feb 08 23:25:25 ip-10-3-14-187 prefect[7845]: graphql_1   | INFO:     172.18.0.3:57846 - "POST /graphql/ HTTP/1.1" 200 OK
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    | PayloadTooLargeError: request entity too large
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at readStream (/apollo/node_modules/raw-body/index.js:155:17)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at getRawBody (/apollo/node_modules/raw-body/index.js:108:12)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at read (/apollo/node_modules/body-parser/lib/read.js:77:3)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at jsonParser (/apollo/node_modules/body-parser/lib/types/json.js:135:5)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at Layer.handle [as handle_request] (/apollo/node_modules/express/lib/router/layer.js:95:5)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at trim_prefix (/apollo/node_modules/express/lib/router/index.js:317:13)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at /apollo/node_modules/express/lib/router/index.js:284:7
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at Function.process_params (/apollo/node_modules/express/lib/router/index.js:335:12)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at next (/apollo/node_modules/express/lib/router/index.js:275:10)
Feb 08 23:25:29 ip-10-3-14-187 prefect[7845]: apollo_1    |     at cors (/apollo/node_modules/cors/lib/index.js:188:7)
Graphql is erroring with a payload being too large which is confusing us as to what would be posting to the server.
c
Ah very interesting - so there are two things it might be: • really long logs (we’ve seen this on `ShellTask`s that log their stdout) • a really large state object - this is usually caused either by a state message being large (unlikely) or a state result being large (more likely if you’re using a
PrefectResult
instead of a
LocalResult
) The limit in Server is currently configured to 1mb (https://github.com/PrefectHQ/server/blob/master/services/apollo/src/index.js#L103)
j
The problem with the task stall lies somewhere with the results writing. The task was in a series of small tasks passing large datasets. When I turn off checkpointing for them it works fine. The disk isn’t full and writing is allowed so not sure if you have an idea.
c
Whoa that is pretty interesting and surprising to me (unless the stalling is happening in sending the result metadata to the server API?); would you mind opening up a GitHub issue describing the situation including the OS and agent details? I think you’ll be OK in the short-term disabling checkpointing but this sounds like something we can get to the bottom of and rectify with a reproducible example or sufficient detail
j
Will do for sure.