Hi all, I am currently getting a `urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='loca...
c
Hi all, I am currently getting a `urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=4200): Read timed out. (read timeout=15)`on my EC2 Prefect Server setup. I stumbled upon this: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/config.toml#L64 Although it seems a Cloud configuration. Can someone please point me out on the right direction?
a
Yes, sure we can! Here is how you can increase the read timeout on Prefect Server: • Is it possible to increase the GraphQL API request timeout?How to check a GraphQL query timeout settings on Server? How to increase that timeout value?
Although it seems a Cloud configuration.
I understand your confusion. Even though the variable is called
PREFECT__CLOUD__REQUEST_TIMEOUT
, it's not just for Cloud, but it's also valid for Server. your error may also come from the telemetry requests, which you can disable this way:
Copy code
export PREFECT__SERVER__TELEMETRY__ENABLED=False
more on that here
btw are you just getting started with Prefect and setting up Server? if so, you may start directly with Prefect 2.0
c
Hi @Anna Geller I am sorry about the delay in the reply. I have been working with Prefect for a while, but heavily on the last weeks. Your suggestions seems to have worked partially. I don’t get the errors anymore but my flow runs keep stuck on submitted if I get too many at the same time. I have already applied this with a few tweaks, but the problem remains. I am running Prefect on AWS EC2 with GitHub storage and DockerRun. The typical use case I work on are fast flow runs (up to 1min) but I usually have a lot of flow runs at the same time.
a
thanks for explaining. Regarding flow runs stuck in a Submitted state, this is an issue with the execution layer, not the orchestration backend. Check this page for more details
c
@Anna Geller The only reason I am still in doubt is that the execution layer hasn’t enough capacity. In my case, that would mean AWS EC2 instance upgrade (if I am right?). I have upgraded over the last weekend and it seemed like the number of runs until I get the same problem increased. Although, I am questioning if that would be my only option.
a
hard to say, perhaps checking the resource utilization on your EC2 instance could help? otherwise, it could be that you are spinning up too many runs at once and your Server instance can't handle that. Scaling Server is not easy, that's why many users who need that scale opt for Prefect Cloud. This will be easier to tackle in Prefect 2.0 - have you considered migrating to Orion and investigating if it scales better in your use case?
c
Unfortunately, migration to 2.0 is not an option for now.
I am now facing an error on
Copy code
UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
I have tried to set the variables in the Agent with the -e but it seems to have no effect. Am I doing something wrong?
a
Check this thread explaining the timeout settings in Server
c
Hmm, thanks for the quick response. I thought that would work only for GraphQL API?
a
That's true. Can you provide more info? You kind of just copy-pasted your
TimeoutError
without providing any context, and I did the same with my answer 😄 If you want to dive deeper: 1. When do you get this error exactly? 2. Is this an issue related to Prefect or to your infrastructure? (we generally can't provide much help with infrastructure issues since it's hard to troubleshoot remotely via Slack, but I can still try to help) 3. How often does this error occur? 4. What was the first time you saw this error and how did you find out about this issue - Docker compose logs? 5. Did you try restarting your Docker service as shown in the SO issue? 6. Did you try restarting your Server?
c
I get this error message every time I run more than 25 flow runs. I have tried to restart both the Server and Docker.
k
I am confused why the port is None here. Do you have more logs? That’s a really weird issue. Are the 25 flow runs concurrent?
c
The flow runs shouldn’t be concurrent. I have applied this with a few tweaks. Logs follow:
Exception encountered while deploying flow run 705cf945-26ce-45dc-831e-c65374acec38
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 277, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/util/retry.py", line 532, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
raise value
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/prefect/agent/agent.py", line 391, in _deploy_flow_run
deployment_info = self.deploy_flow(flow_run)
File "/home/ubuntu/.local/lib/python3.8/site-packages/prefect/agent/docker/agent.py", line 482, in deploy_flow
container = self.docker_client.create_container(
File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/container.py", line 428, in create_container
return self.create_container_from_config(config, name)
File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/container.py", line 438, in create_container_from_config
res = self._post_json(u, data=config, params=params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/client.py", line 296, in _post_json
return self._post(url, data=json.dumps(data2), **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/client.py", line 233, in _post
return <http://self.post|self.post>(url, **self._set_request_timeout(kwargs))
File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/sessions.py", line 590, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
k
This looks like your Docker daemon dying? After this error happens, can you run docker stuff like
docker run
or
docker pull
?
c
Yes, I can.
a
you may also want to increase your Docker CPU and memory allocation, then set those two variables:
Copy code
export DOCKER_CLIENT_TIMEOUT=120
export COMPOSE_HTTP_TIMEOUT=120
then restart the Docker service and then restart Prefect Server the order might be important
c
I have already tried it and it didn’t work
a
In that case, maybe it's worth recreating the whole environment? If nothing else works... What Prefect and docker-compose versions are you on?
c
Both are on the latest version
I have no idea why, but the issue seems to be on the concurrent handler. If I turn it off all goes smoothly.
def concurrent_handler(flow: Flow, old_state: State, new_state: State) -> Optional[State]:
if old_state.is_pending() and new_state.is_running():
client = Client()
now = dt.now(tz=pytz.UTC).replace(microsecond=0) + timedelta(seconds=1)
# Replacing microseconds because graphql api cant always handle the number of decimals
result = client.graphql(
"""{
flow(where: {
archived: {_eq: false},
name: {_eq: "%s"}
}) {
name
archived
flow_runs (where: {
state: {_in: ["Submitted", "Queued", "Scheduled", "Retrying", "Running"]},
scheduled_start_time:{_lte: "%s"}
}) {
scheduled_start_time
start_time
created
name
state
id
}
}
}"""
% (flow.name, now.isoformat())  # Sorry for % operator, but those {} make it a pain
)
# These flow runs will be everything thats scheduled to start in the past and
# might have built up.
logger = prefect.context.get("logger")
# This might fail if the GraphQL cant find anything, but havent seen this in practise
flow_runs = result["data"]["flow"][0]["flow_runs"]
# I dont want to run another task if theres already more than one flow running
# For me, Im happy to have two running at once, as API issues means we can get timeouts and
# hangs that dont terminate easily. For other use cases, Id generally say to cancel if theres
# any running
num_running = sum([1 if f["state"] in ("Running", "Retrying") else 0 for f in flow_runs])
if num_running > 1:
msg = "Existing tasks are already running"
<http://logger.info|logger.info>(msg)
return Scheduled(msg,
start_time=dt.strptime(now.isoformat(), "%Y-%m-%dT%H:%M:%S%z") + timedelta(seconds=15))
# And if there are multiple scheduled, only the latest one should be run
scheduled = [
f for f in flow_runs if f["state"] in ("Pending", "Scheduled", "Queued", "Submitted")
]
if len(scheduled) > 1:
last_scheduled_time = max(
[dt.strptime(f["scheduled_start_time"], "%Y-%m-%dT%H:%M:%S.%f%z") for f in scheduled]
)
<http://logger.info|logger.info>(scheduled)
this_flow_run_id = prefect.context.get("flow_run_id")
matching_runs = [f for f in scheduled if f["id"] == this_flow_run_id]
if not matching_runs:
<http://logger.info|logger.info>(f"Current id is {this_flow_run_id}")
<http://logger.info|logger.info>(f"Flow runs are: {scheduled}")
return Cancelled("Nope")
this_run = matching_runs[0]
this_run_time = dt.strptime(this_run["scheduled_start_time"], "%Y-%m-%dT%H:%M:%S.%f%z")
if this_run_time != last_scheduled_time:
msg = "Multiple scheduled tasks, this is not the last one"
<http://logger.info|logger.info>(msg)
return Scheduled(msg, start_time=this_run_time + timedelta(seconds=10))
return new_state
k
Could you move the code off the main channel? We’ll still see the message
Maybe that GraphQL call is putting a heavy load on the API? Could you try with just the query?
c
I am sorry, just removed it. What do you mean by just the query?
k
The
client.graphql()
portion. I think it’s overloading the API
c
With what should I replace it with?
k
No not replace. Just try removing everything but the query because my guess is that the query is overloading the API so if you remove everything else in the state handler but leave the API call, and you still have the same issue, you will know that you need to bump up API resources to handle this query. It might be that you just have a lot of data being pulled when you do this query which overloads the APi
c
It works just with the query
k
I think this has to do with multiple Docker containers starting up at the same time then? When you do
timedelta(seconds=10)
, how many of these are concurrently running? (even if just the state handler), like how many containers are there up at the same time? It really seems like the Docker daemon is just failing to respond. If bumping up the resources doesn’t work, I’m not sure what we can do other than space out the flows a bit more
c
They are all up. All the containers spin up and the Flow Runs stay on Submitted for a while before it starts to process the runs.
k
I think the Submitted for a while is a sign Docker is really struggling to process the volume of requests
Do you have a very large layer in the image like they suggest here ?
Uhh some people just suggest restarting Docker but you did that right?
c
Although, same thing should happen with just the query, right?
k
When you just did the query, what was your return? Did you do the
Scheduled
also?
c
No, I have removed
Sheduled
k
I’m sure you know but I’m very confused right now as well 😅. I think though that the Scheduled is just causing a backlog of container spin ups (10 seconds seems very short) so the Docker daemon requests accumulate and then it struggles. I am positive it’s not the API container struggling now since the state handler with just the GraphQL query worked. Nothing else in the state handler is computationally intensive so I think it has to do with the
Scheduled
c
Thanks for the help @Anna Geller @Kevin Kho! I have solved my problem with the change of Agent and Run Config to ECS (with Fargate).
a
nice work! thanks for reporting back about that
👍 1
1543 Views