Hi, we’re running our own Prefect (0.13.10) deploy...
# prefect-server
a
Hi, we’re running our own Prefect (0.13.10) deployment on AWS using Fargate agent. Most of the services run on a single EC2 node (t2.medium) via docker-compose. From time to time, some flows randomly fail with the inability to connect to apollo service (timeout). At this point I am not sure in what direction to further look at, any guidance/help will be appreciated.
Stacktrace
Copy code
Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='10.0.1.160', port=4200): Read timed out. (read timeout=30)"))
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1369, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 310, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 271, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 423, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 331, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.0.1.160', port=4200): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 128, in call_runner_target_handlers
    cache_for=self.task.cache_for,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1371, in set_task_run_state
    version=version,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 281, in graphql
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 237, in post
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 401, in _request
    session=session, method=method, url=url, params=params, headers=headers
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 319, in _send_request
    response = <http://session.post|session.post>(url, headers=headers, json=params, timeout=30)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='10.0.1.160', port=4200): Read timed out. (read timeout=30)
z
Hmm 30 seconds is quite the timeout. Perhaps try adjusting it per https://github.com/PrefectHQ/prefect/issues/4118
worth noting that this is 'Read' timeout not a 'Connection' timeout so it is successfully connecting but the apollo server is not finishing the request fast enough. You may want to allocate more resources to your server. Are you running flows with a lot of mapping?
a
Thanks for quick reply. Before that error there were a lot of
Too many open files
kind of errors. Once I’ve raised
ulimit -a
it disappeared, but it seems that it’s the same kind of problem. I tried to get some insights by looking at apollo logs, but they seem to be almost empty. I was thinking about running apollo in some sort of verbose/debug mode, is there any command-line switch for this?
> Are you running flows with a lot of mapping?
There is definitely some mapping, but I’d say it’s not huge.
z
I don't know about flags off the top of my head, I'll ask our devops team.
Perhaps throwing some cloudwatch monitors on your node for cpu/mem consumption would help narrow things down?
👍 1
a
Copy code
Error getting flow run info
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 65, in create_connection
    sock = socket.socket(af, socktype, proto)
  File "/usr/local/lib/python3.7/socket.py", line 151, in __init__
    _socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
  File "/usr/local/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9ffdef08d0>: Failed to establish a new connection: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.199', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9ffdef08d0>: Failed to establish a new connection: [Errno 24] Too many open files'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/flow_runner.py", line 179, in interrupt_if_cancelling
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1001, in get_flow_run_info
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 281, in graphql
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 237, in post
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 401, in _request
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 319, in _send_request
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.0.0.199', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9ffdef08d0>: Failed to establish a new connection: [Errno 24] Too many open files'))
Upgraded node running
apollo
service, and I’m back again to
too many open files
even though, ulimits are maxed
Copy code
root@432c20687fcb:/apollo# ulimit -n 
1048576
I am now thinking that this is caused by underlying services apollo depends on, like hasura? Or is this related to ulimits inside Fargate container managed by AWS?
This OSError are followed by another set of errors like this
Copy code
Failed to set task state with error: ClientError([{'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.'}}}])
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 128, in call_runner_target_handlers
    cache_for=self.task.cache_for,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 1371, in set_task_run_state
    version=version,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 294, in graphql
    raise ClientError(result["errors"])
prefect.utilities.exceptions.ClientError: [{'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.', 'locations': [{'line': 2, 'column': 5}], 'path': ['set_task_run_states'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'State update failed for task run ID 22409b10-9ef0-443e-9394-e2fc6d95b4a3: provided a running state but associated flow run 93d71053-7afc-4bdb-98c1-fdc727417ee1 is not in a running state.'}}}]
I guess this is just dependent GraphQL-related stuff fails due to the main OSError
z
Just to clarify, these errors are coming from your agent? Is the agent running on the same machine as the server?
a
yes, it’s all one host with all services including • hasura/graphql-engine:v1.3.0 (
command: graphql-engine serve
) • prefecthq/server:core-0.13.10 (
command: bash -c "python src/prefect_server/services/graphql/server.py"
) • prefecthq/apollo:core-0.13.10 (
npm run serve
) • prefecthq/ui:core-0.13.10 • prefecthq/server:core-0.13.10 (
command: "python src/prefect_server/services/towel/__main__.py"
) • prefect-fargate-agent (command:   prefect agent start fargate )
all managed by
docker-compose
Copy code
Unexpected error: OSError("Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: connect() didn't finish in time")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/executors/dask.py", line 267, in start
    yield
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/flow_runner.py", line 603, in get_flow_run_state
    for t in final_tasks
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/executors/dask.py", line 451, in wait
    return self.client.gather(futures)
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1967, in gather
    asynchronous=asynchronous,
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 816, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 347, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 331, in f
    result[0] = yield future
  File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1826, in _gather
    raise exception.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/flow_runner.py", line 725, in run_task
    flow_result=flow_result,
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/cloud/task_runner.py", line 52, in __init__
    self.client = Client()
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 124, in __init__
    tenant_info = self.graphql({"query": {"tenant": {"id"}}})
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 281, in graphql
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 237, in post
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 401, in _request
    session=session, method=method, url=url, params=params, headers=headers
  File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 319, in _send_request
    response = <http://session.post|session.post>(url, headers=headers, json=params, timeout=30)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: None: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/flow_runner.py", line 618, in get_flow_run_state
    assert isinstance(final_states, dict)
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/prefect/engine/executors/dask.py", line 269, in start
    self._post_start_yield()
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 409, in __exit__
    super().__exit__(typ, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 364, in __exit__
    return self.sync(self.__aexit__, typ, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 163, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 347, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 331, in f
    result[0] = yield future
  File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 371, in __aexit__
    await self.close()
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 381, in _close
    await self._correct_state()
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 308, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 677, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 639, in live_comm
    **self.connection_args,
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 243, in connect
    _raise(error)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 213, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: Timed out trying to connect to '<tcp://127.0.0.1:45401>' after 120 s: connect() didn't finish in time
And also, random timeouts, but this time for DaskExecutor
Another thing I’ve stumbled upon, is duration time for PostgreSQL queries generated by Prefect. It’s spiking at around 8-10seconds per query according to slow_log from our RDS. So I’m thinking about upgrading DB instance as well, as it is not only hosting our Prefect DB but also our production DB which is under heavy load about the same time as agent starts running flows
z
It seems feasible that the timeouts would be related to DB performance
Can you make sure you're not affected by https://github.com/PrefectHQ/prefect/issues/4114 re the open files?
It seems like you'd have to adjust the ulimit inside the agent fargate container, see https://docs.docker.com/compose/compose-file/compose-file-v2/#ulimits
a
@Zanie Following up here. I’ve upgraded to
0.14.9
in the hopes of resolving above mentioned issues, but it didn’t help. This is how Prefect fargate is run
Copy code
prefect-fargate-agent:
    image: <http://registry.gitlab.com/mycomp/dev/analytical-pipeline:1.4.8|registry.gitlab.com/mycomp/dev/analytical-pipeline:1.4.8>
    restart: always
    depends_on:
      - apollo
    environment:
      - PREFECT__BACKEND=server
      - PREFECT__SERVER__HOST=<http://10.0.0.199>
      - PREFECT__LOGGING__LEVEL=DEBUG
      - PREFECT__FLOWS__CHECKPOINTING=true
      - PREFECT__CONTEXT__SECRETS__SLACK_WEBHOOK_URL=<https://hooks.slack.com/services/>
      - AWS_S3_BUCKET=prod-flow-results
      - AWS_ACCESS_KEY_ID=AKIA
      - AWS_SECRET_ACCESS_KEY=
      - REGION_NAME=eu-central-1
      - DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=120
      - containerDefinitions_ulimits=[{"softLimit":"2048","hardLimit":"8192","name":"nofile"}]
      - containerDefinitions_logConfiguration={"logDriver":"awslogs","options":{"awslogs-group":"prod/prefect","awslogs-region":"eu-central-1","awslogs-stream-prefix":"flow-runs"}}
      - containerDefinitions_environment=[{"name":"MYVAR","value":"TRY1"}]
    command: >
      prefect agent start fargate
        --name prod-layout-fargate-agent
        --show-flow-logs
        --label fargate
        memory=4096
        cpu=2048
        executionRoleArn=arn:aws:iam::76666666666:role/ecsTaskExecutionRole
        cluster=prod-ecs1
        networkConfiguration="{'awsvpcConfiguration': {'assignPublicIp': 'DISABLED', 'subnets': ['subnet-000000'], 'securityGroups': []}}"
The weird thing is, it does not seem to create a task definition with correct
ulimits
as specified here https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html#fargate-resource-limits The flow is deployed without any problems, task definition is created with custom environment (meaning
containerDefinitions_*
vars are working), but ulimits section is empty no matter what
z
It looks like you're using the deprecated fargate agent, I'd switch to using the ecs agent and provide a custom task definition.
The list of supported variables is hard-coded for the fargate agent
Copy code
container_definitions_kwarg_list = [
            "mountPoints",
            "secrets",
            "environment",
            "logConfiguration",
            "repositoryCredentials",
        ]
a
hmm, seems I was looking at the wrong place. Isn’t this the code where ulimits are parsed and passed via boto to ECS https://github.com/PrefectHQ/prefect/blob/0.14.9/src/prefect/environments/execution/fargate/fargate_task.py#L282 ?
Okay, found the corresponding code in
Copy code
src/prefect/agent/fargate/agent.py
what was the reason to omit
ulimits
? Any other ways to circumvent it except upgrading to newer ECS agent?
z
I believe
ulimits
was excluded because it wasn't used. You could open a PR to add it (or add it to your own code).
👍 1