Alex Furrier
07/14/2021, 8:25 PMHeartbeat process died with exit code 1
and the few children tasks left incomplete are unsuccsful.
Any ideas how to debug why the parent mapping task is dying?Kevin Kho
07/14/2021, 8:43 PMAlex Furrier
07/14/2021, 8:44 PMKevin Kho
07/14/2021, 8:45 PMAlex Furrier
07/14/2021, 8:55 PMrequests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f90cdf864f0>: Failed to establish a new connection: [Errno 111] Connection refused'))
The task eventually fails with
Finished task run for task with final state: 'ClientFailed'
Failed to set task state with error: ConnectionError(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Kevin Kho
07/15/2021, 1:57 PMAlex Furrier
07/15/2021, 5:37 PMFailed to retrieve task state with error: ClientError([{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.0.83.43:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.0.83.43:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}])
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 154, in initialize_run
task_run_info = self.client.get_task_run_info(
File "/opt/conda/lib/python3.8/site-packages/prefect/client/client.py", line 1399, in get_task_run_info
result = self.graphql(mutation) # type: Any
File "/opt/conda/lib/python3.8/site-packages/prefect/client/client.py", line 319, in graphql
raise ClientError(result["errors"])
prefect.utilities.exceptions.ClientError: [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.0.83.43:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_or_create_task_run_info'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.0.83.43:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
and this is for the second part once it's hit max retries
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py", line 446, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql/graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa4041c3f10>: Failed to establish a new connection: [Errno 111] Connection refused'))
So the root issue seems to be whatever is causing that connection refusal between the mapped task and the graphql endpoint.
Any ideas of where to look where increasing resources might alleviate that issue?Kevin Kho
07/16/2021, 4:32 PMnicholas
07/16/2021, 4:44 PMAlex Furrier
07/16/2021, 4:47 PMLocalResult()
to serialize to disk and release memory but that didn't appear to work.nicholas
07/16/2021, 4:50 PMAlex Furrier
07/16/2021, 4:51 PMDaskExecutor()
For this flow I've set that to 12GB with
with Flow(
name="my OOM Flow",
storage=Docker(
base_image="<http://container-registry.io/oom-flow:dev|container-registry.io/oom-flow:dev>",
registry_url="<http://container-registry.io|container-registry.io>",
image_name="my-oom-flow",
image_tag="latest",
),
executor=DaskExecutor(
cluster_class=lambda: KubeCluster(make_pod_spec(
image=prefect.context.image, memory_limit='12G', memory_request='4G'), namespace='prefect'),
adapt_kwargs={"minimum": 2, "maximum": 25},
),
run_config=KubernetesRun(),
# result=PrefectResult()
) as my_oom_flow:
When this executes I can view the Dask dashboard to view resource consumption. Here it's showing across all workers memory consumption above the 12GB limit:nicholas
07/16/2021, 5:06 PMAlex Furrier
07/16/2021, 5:10 PMAzureResult
or S3Result
?nicholas
07/16/2021, 5:15 PMAlex Furrier
07/19/2021, 5:13 PMAzureResult
for all tasks but it didn't seem to change much. If anything it seems to have possibly made it worse as the serialization+deserialization to Azure blob consumed more memory that wasn't released.
It seems like the main issue is a Dask one. When running a mapping task with a large number of child tasks the memory consumed by a child task is never released after reaching a Success state. I may be wrong but what I think should be happening is a task reaches a Success state, serializes the result, and releases memory for that child task. Once all child tasks have been completed the serialized results are aggregated by the parent task. Is that correct?