Andrew Druchenko
02/25/2021, 5:47 PMAndrew Druchenko
02/25/2021, 5:47 PMFailed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='10.0.1.160', port=4200): Read timed out. (read timeout=30)"))
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.7/http/client.py", line 1369, in getresponse
response.begin()
File "/usr/local/lib/python3.7/http/client.py", line 310, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.7/http/client.py", line 271, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 423, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.0.1.160', port=4200): Read timed out. (read timeout=30)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 128, in call_runner_target_handlers
cache_for=self.task.cache_for,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1371, in set_task_run_state
version=version,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 281, in graphql
retry_on_api_error=retry_on_api_error,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 237, in post
retry_on_api_error=retry_on_api_error,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 401, in _request
session=session, method=method, url=url, params=params, headers=headers
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 319, in _send_request
response = <http://session.post|session.post>(url, headers=headers, json=params, timeout=30)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='10.0.1.160', port=4200): Read timed out. (read timeout=30)
Zanie
Zanie
Andrew Druchenko
02/25/2021, 6:01 PMToo many open files
kind of errors. Once I’ve raised ulimit -a
it disappeared, but it seems that it’s the same kind of problem. I tried to get some insights by looking at apollo logs, but they seem to be almost empty. I was thinking about running apollo in some sort of verbose/debug mode, is there any command-line switch for this?Andrew Druchenko
02/25/2021, 6:03 PM> Are you running flows with a lot of mapping?There is definitely some mapping, but I’d say it’s not huge.
Zanie
Zanie
Andrew Druchenko
02/26/2021, 4:48 PMError getting flow run info
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 65, in create_connection
sock = socket.socket(af, socktype, proto)
File "/usr/local/lib/python3.7/socket.py", line 151, in __init__
_socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 24] Too many open files
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
File "/usr/local/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/usr/local/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9ffdef08d0>: Failed to establish a new connection: [Errno 24] Too many open files
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.199', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9ffdef08d0>: Failed to establish a new connection: [Errno 24] Too many open files'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/flow_runner.py", line 179, in interrupt_if_cancelling
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1001, in get_flow_run_info
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 281, in graphql
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 237, in post
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 401, in _request
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 319, in _send_request
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.0.0.199', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9ffdef08d0>: Failed to establish a new connection: [Errno 24] Too many open files'))
Andrew Druchenko
02/26/2021, 4:48 PMapollo
service, and I’m back again to too many open files
Andrew Druchenko
02/26/2021, 4:48 PMroot@432c20687fcb:/apollo# ulimit -n
1048576
I am now thinking that this is caused by underlying services apollo depends on, like hasura? Or is this related to ulimits inside Fargate container managed by AWS?Andrew Druchenko
02/26/2021, 4:50 PMFailed to set task state with error: ClientError([{'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.'}}}])
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 128, in call_runner_target_handlers
cache_for=self.task.cache_for,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1371, in set_task_run_state
version=version,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 294, in graphql
raise ClientError(result["errors"])
prefect.utilities.exceptions.ClientError: [{'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.'}}}]
I guess this is just dependent GraphQL-related stuff fails due to the main OSErrorZanie
Andrew Druchenko
02/26/2021, 4:59 PMcommand: graphql-engine serve
)
• prefecthq/server:core-0.13.10 (command: bash -c "python src/prefect_server/services/graphql/server.py"
)
• prefecthq/apollo:core-0.13.10 (npm run serve
)
• prefecthq/ui:core-0.13.10
• prefecthq/server:core-0.13.10 (command: "python src/prefect_server/services/towel/__main__.py"
)
• prefect-fargate-agent (command: prefect agent start fargate )Andrew Druchenko
02/26/2021, 4:59 PMdocker-compose
Andrew Druchenko
02/26/2021, 5:01 PMUnexpected error: OSError("Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: connect() didn't finish in time")
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/prefect/engine/executors/dask.py", line 267, in start
yield
File "/usr/local/lib/python3.7/site-packages/prefect/engine/flow_runner.py", line 603, in get_flow_run_state
for t in final_tasks
File "/usr/local/lib/python3.7/site-packages/prefect/engine/executors/dask.py", line 451, in wait
return self.client.gather(futures)
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1967, in gather
asynchronous=asynchronous,
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 816, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 347, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 331, in f
result[0] = yield future
File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1826, in _gather
raise exception.with_traceback(traceback)
File "/usr/local/lib/python3.7/site-packages/prefect/engine/flow_runner.py", line 725, in run_task
flow_result=flow_result,
File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 52, in __init__
self.client = Client()
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 124, in __init__
tenant_info = self.graphql({"query": {"tenant": {"id"}}})
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 281, in graphql
retry_on_api_error=retry_on_api_error,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 237, in post
retry_on_api_error=retry_on_api_error,
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 401, in _request
session=session, method=method, url=url, params=params, headers=headers
File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 319, in _send_request
response = <http://session.post|session.post>(url, headers=headers, json=params, timeout=30)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: None: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/prefect/engine/runner.py", line 48, in inner
new_state = method(self, state, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/prefect/engine/flow_runner.py", line 618, in get_flow_run_state
assert isinstance(final_states, dict)
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/site-packages/prefect/engine/executors/dask.py", line 269, in start
self._post_start_yield()
File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 409, in __exit__
super().__exit__(typ, value, traceback)
File "/usr/local/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 364, in __exit__
return self.sync(self.__aexit__, typ, value, traceback)
File "/usr/local/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 163, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 347, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 331, in f
result[0] = yield future
File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/usr/local/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 371, in __aexit__
await self.close()
File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 381, in _close
await self._correct_state()
File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 308, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 677, in send_recv_from_rpc
comm = await self.live_comm()
File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 639, in live_comm
**self.connection_args,
File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 243, in connect
_raise(error)
File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 213, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: connect() didn't finish in time
Andrew Druchenko
02/26/2021, 5:01 PMAndrew Druchenko
02/26/2021, 5:13 PMZanie
Zanie
Zanie
Andrew Druchenko
03/15/2021, 10:56 PM0.14.9
in the hopes of resolving above mentioned issues, but it didn’t help.
This is how Prefect fargate is run
prefect-fargate-agent:
image: <http://registry.gitlab.com/mycomp/dev/analytical-pipeline:1.4.8|registry.gitlab.com/mycomp/dev/analytical-pipeline:1.4.8>
restart: always
depends_on:
- apollo
environment:
- PREFECT__BACKEND=server
- PREFECT__SERVER__HOST=<http://10.0.0.199>
- PREFECT__LOGGING__LEVEL=DEBUG
- PREFECT__FLOWS__CHECKPOINTING=true
- PREFECT__CONTEXT__SECRETS__SLACK_WEBHOOK_URL=<https://hooks.slack.com/services/>
- AWS_S3_BUCKET=prod-flow-results
- AWS_ACCESS_KEY_ID=AKIA
- AWS_SECRET_ACCESS_KEY=
- REGION_NAME=eu-central-1
- DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=120
- containerDefinitions_ulimits=[{"softLimit":"2048","hardLimit":"8192","name":"nofile"}]
- containerDefinitions_logConfiguration={"logDriver":"awslogs","options":{"awslogs-group":"prod/prefect","awslogs-region":"eu-central-1","awslogs-stream-prefix":"flow-runs"}}
- containerDefinitions_environment=[{"name":"MYVAR","value":"TRY1"}]
command: >
prefect agent start fargate
--name prod-layout-fargate-agent
--show-flow-logs
--label fargate
memory=4096
cpu=2048
executionRoleArn=arn:aws:iam::76666666666:role/ecsTaskExecutionRole
cluster=prod-ecs1
networkConfiguration="{'awsvpcConfiguration': {'assignPublicIp': 'DISABLED', 'subnets': ['subnet-000000'], 'securityGroups': []}}"
The weird thing is, it does not seem to create a task definition with correct ulimits
as specified here https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html#fargate-resource-limits
The flow is deployed without any problems, task definition is created with custom environment (meaning containerDefinitions_*
vars are working), but ulimits section is empty no matter whatZanie
Zanie
Zanie
container_definitions_kwarg_list = [
"mountPoints",
"secrets",
"environment",
"logConfiguration",
"repositoryCredentials",
]
Andrew Druchenko
03/15/2021, 11:14 PMAndrew Druchenko
03/15/2021, 11:27 PMsrc/prefect/agent/fargate/agent.py
what was the reason to omit ulimits
? Any other ways to circumvent it except upgrading to newer ECS agent?Zanie
ulimits
was excluded because it wasn't used. You could open a PR to add it (or add it to your own code).