Hi all, I am currently getting a `urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='loca...

Carlos Paiva

05/01/2022, 5:49 PM

Hi all, I am currently getting a `urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=4200): Read timed out. (read timeout=15)`on my EC2 Prefect Server setup. I stumbled upon this: https://github.com/PrefectHQ/prefect/blob/master/src/prefect/config.toml#L64 Although it seems a Cloud configuration. Can someone please point me out on the right direction?

Anna Geller

05/01/2022, 8:52 PM

Yes, sure we can! Here is how you can increase the read timeout on Prefect Server: • Is it possible to increase the GraphQL API request timeout? • How to check a GraphQL query timeout settings on Server? How to increase that timeout value?

Although it seems a Cloud configuration.

I understand your confusion. Even though the variable is called

PREFECT__CLOUD__REQUEST_TIMEOUT

, it's not just for Cloud, but it's also valid for Server. your error may also come from the telemetry requests, which you can disable this way:

Copy code

export PREFECT__SERVER__TELEMETRY__ENABLED=False

more on that here

Anna Geller

05/01/2022, 8:52 PM

btw are you just getting started with Prefect and setting up Server? if so, you may start directly with Prefect 2.0

Carlos Paiva

05/03/2022, 11:32 AM

Hi @Anna Geller I am sorry about the delay in the reply. I have been working with Prefect for a while, but heavily on the last weeks. Your suggestions seems to have worked partially. I don’t get the errors anymore but my flow runs keep stuck on submitted if I get too many at the same time. I have already applied this with a few tweaks, but the problem remains. I am running Prefect on AWS EC2 with GitHub storage and DockerRun. The typical use case I work on are fast flow runs (up to 1min) but I usually have a lot of flow runs at the same time.

Anna Geller

05/03/2022, 12:11 PM

thanks for explaining. Regarding flow runs stuck in a Submitted state, this is an issue with the execution layer, not the orchestration backend. Check this page for more details

Carlos Paiva

05/03/2022, 1:44 PM

@Anna Geller The only reason I am still in doubt is that the execution layer hasn’t enough capacity. In my case, that would mean AWS EC2 instance upgrade (if I am right?). I have upgraded over the last weekend and it seemed like the number of runs until I get the same problem increased. Although, I am questioning if that would be my only option.

Anna Geller

05/03/2022, 10:48 PM

hard to say, perhaps checking the resource utilization on your EC2 instance could help? otherwise, it could be that you are spinning up too many runs at once and your Server instance can't handle that. Scaling Server is not easy, that's why many users who need that scale opt for Prefect Cloud. This will be easier to tackle in Prefect 2.0 - have you considered migrating to Orion and investigating if it scales better in your use case?

Carlos Paiva

05/05/2022, 4:55 PM

Unfortunately, migration to 2.0 is not an option for now.

Carlos Paiva

05/05/2022, 5:03 PM

I am now facing an error on

Copy code

UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

I have tried to set the variables in the Agent with the -e but it seems to have no effect. Am I doing something wrong?

Anna Geller

05/05/2022, 5:07 PM

Check this thread explaining the timeout settings in Server

Carlos Paiva

05/05/2022, 5:17 PM

Hmm, thanks for the quick response. I thought that would work only for GraphQL API?

Anna Geller

05/05/2022, 5:24 PM

That's true. Can you provide more info? You kind of just copy-pasted your

TimeoutError

without providing any context, and I did the same with my answer 😄 If you want to dive deeper: 1. When do you get this error exactly? 2. Is this an issue related to Prefect or to your infrastructure? (we generally can't provide much help with infrastructure issues since it's hard to troubleshoot remotely via Slack, but I can still try to help) 3. How often does this error occur? 4. What was the first time you saw this error and how did you find out about this issue - Docker compose logs? 5. Did you try restarting your Docker service as shown in the SO issue? 6. Did you try restarting your Server?

Carlos Paiva

05/05/2022, 5:27 PM

I get this error message every time I run more than 25 flow runs. I have tried to restart both the Server and Docker.

Kevin Kho

05/05/2022, 5:32 PM

I am confused why the port is None here. Do you have more logs? That’s a really weird issue. Are the 25 flow runs concurrent?

Carlos Paiva

05/05/2022, 5:42 PM

The flow runs shouldn’t be concurrent. I have applied this with a few tweaks. Logs follow:

Exception encountered while deploying flow run 705cf945-26ce-45dc-831e-c65374acec38

Traceback (most recent call last):

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request

six.raise_from(e, None)

File "<string>", line 3, in raise_from

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request

httplib_response = conn.getresponse()

File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse

response.begin()

File "/usr/lib/python3.8/http/client.py", line 316, in begin

version, status, reason = self._read_status()

File "/usr/lib/python3.8/http/client.py", line 277, in _read_status

line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")

File "/usr/lib/python3.8/socket.py", line 669, in readinto

return self._sock.recv_into(b)

socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/adapters.py", line 439, in send

resp = conn.urlopen(

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 755, in urlopen

retries = retries.increment(

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/util/retry.py", line 532, in increment

raise six.reraise(type(error), error, _stacktrace)

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise

raise value

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen

httplib_response = self._make_request(

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 447, in _make_request

self._raise_timeout(err=e, url=url, timeout_value=read_timeout)

File "/home/ubuntu/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout

raise ReadTimeoutError(

urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/ubuntu/.local/lib/python3.8/site-packages/prefect/agent/agent.py", line 391, in _deploy_flow_run

deployment_info = self.deploy_flow(flow_run)

File "/home/ubuntu/.local/lib/python3.8/site-packages/prefect/agent/docker/agent.py", line 482, in deploy_flow

container = self.docker_client.create_container(

File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/container.py", line 428, in create_container

return self.create_container_from_config(config, name)

File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/container.py", line 438, in create_container_from_config

res = self._post_json(u, data=config, params=params)

File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/client.py", line 296, in _post_json

return self._post(url, data=json.dumps(data2), **kwargs)

File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner

return f(self, *args, **kwargs)

File "/home/ubuntu/.local/lib/python3.8/site-packages/docker/api/client.py", line 233, in _post

return <http://self.post|self.post>(url, **self._set_request_timeout(kwargs))

File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/sessions.py", line 590, in post

return self.request('POST', url, data=data, json=json, **kwargs)

File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/sessions.py", line 542, in request

resp = self.send(prep, **send_kwargs)

File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/sessions.py", line 655, in send

r = adapter.send(request, **kwargs)

File "/home/ubuntu/.local/lib/python3.8/site-packages/requests/adapters.py", line 529, in send

raise ReadTimeout(e, request=request)

requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

Kevin Kho

05/05/2022, 5:45 PM

This looks like your Docker daemon dying? After this error happens, can you run docker stuff like

docker run

docker pull

Carlos Paiva

05/05/2022, 5:54 PM

Yes, I can.

Anna Geller

05/05/2022, 6:04 PM

you may also want to increase your Docker CPU and memory allocation, then set those two variables:

Copy code

export DOCKER_CLIENT_TIMEOUT=120
export COMPOSE_HTTP_TIMEOUT=120

then restart the Docker service and then restart Prefect Server the order might be important

Carlos Paiva

05/05/2022, 6:20 PM

I have already tried it and it didn’t work

Anna Geller

05/05/2022, 6:48 PM

In that case, maybe it's worth recreating the whole environment? If nothing else works... What Prefect and docker-compose versions are you on?

Carlos Paiva

05/05/2022, 7:12 PM

Both are on the latest version

Carlos Paiva

05/05/2022, 8:28 PM

I have no idea why, but the issue seems to be on the concurrent handler. If I turn it off all goes smoothly.

def concurrent_handler(flow: Flow, old_state: State, new_state: State) -> Optional[State]:

if old_state.is_pending() and new_state.is_running():

client = Client()

now = dt.now(tz=pytz.UTC).replace(microsecond=0) + timedelta(seconds=1)

# Replacing microseconds because graphql api cant always handle the number of decimals

result = client.graphql(

"""{

flow(where: {

archived: {_eq: false},

name: {_eq: "%s"}

}) {

name

archived

flow_runs (where: {

state: {_in: ["Submitted", "Queued", "Scheduled", "Retrying", "Running"]},

scheduled_start_time:{_lte: "%s"}

}) {

scheduled_start_time

start_time

created

name

state

id

}"""

% (flow.name, now.isoformat())  # Sorry for % operator, but those {} make it a pain

# These flow runs will be everything thats scheduled to start in the past and

# might have built up.

logger = prefect.context.get("logger")

# This might fail if the GraphQL cant find anything, but havent seen this in practise

flow_runs = result["data"]["flow"][0]["flow_runs"]

# I dont want to run another task if theres already more than one flow running

# For me, Im happy to have two running at once, as API issues means we can get timeouts and

# hangs that dont terminate easily. For other use cases, Id generally say to cancel if theres

# any running

num_running = sum([1 if f["state"] in ("Running", "Retrying") else 0 for f in flow_runs])

if num_running > 1:

msg = "Existing tasks are already running"

<http://logger.info|logger.info>(msg)

return Scheduled(msg,

start_time=dt.strptime(now.isoformat(), "%Y-%m-%dT%H:%M:%S%z") + timedelta(seconds=15))

# And if there are multiple scheduled, only the latest one should be run

scheduled = [

f for f in flow_runs if f["state"] in ("Pending", "Scheduled", "Queued", "Submitted")

if len(scheduled) > 1:

last_scheduled_time = max(

[dt.strptime(f["scheduled_start_time"], "%Y-%m-%dT%H:%M:%S.%f%z") for f in scheduled]

<http://logger.info|logger.info>(scheduled)

this_flow_run_id = prefect.context.get("flow_run_id")

matching_runs = [f for f in scheduled if f["id"] == this_flow_run_id]

if not matching_runs:

<http://logger.info|logger.info>(f"Current id is {this_flow_run_id}")

<http://logger.info|logger.info>(f"Flow runs are: {scheduled}")

return Cancelled("Nope")

this_run = matching_runs[0]

this_run_time = dt.strptime(this_run["scheduled_start_time"], "%Y-%m-%dT%H:%M:%S.%f%z")

if this_run_time != last_scheduled_time:

msg = "Multiple scheduled tasks, this is not the last one"

<http://logger.info|logger.info>(msg)

return Scheduled(msg, start_time=this_run_time + timedelta(seconds=10))

return new_state

Kevin Kho

05/05/2022, 9:15 PM

Could you move the code off the main channel? We’ll still see the message

Kevin Kho

05/05/2022, 9:16 PM

Maybe that GraphQL call is putting a heavy load on the API? Could you try with just the query?

Carlos Paiva

05/06/2022, 8:19 AM

I am sorry, just removed it. What do you mean by just the query?

Kevin Kho

05/06/2022, 1:45 PM

The

client.graphql()

portion. I think it’s overloading the API

Carlos Paiva

05/06/2022, 2:07 PM

With what should I replace it with?

Kevin Kho

05/06/2022, 2:09 PM

No not replace. Just try removing everything but the query because my guess is that the query is overloading the API so if you remove everything else in the state handler but leave the API call, and you still have the same issue, you will know that you need to bump up API resources to handle this query. It might be that you just have a lot of data being pulled when you do this query which overloads the APi

Carlos Paiva

05/06/2022, 3:34 PM

It works just with the query

Kevin Kho

05/06/2022, 3:40 PM

I think this has to do with multiple Docker containers starting up at the same time then? When you do

timedelta(seconds=10)

, how many of these are concurrently running? (even if just the state handler), like how many containers are there up at the same time? It really seems like the Docker daemon is just failing to respond. If bumping up the resources doesn’t work, I’m not sure what we can do other than space out the flows a bit more

Carlos Paiva

05/06/2022, 3:47 PM

They are all up. All the containers spin up and the Flow Runs stay on Submitted for a while before it starts to process the runs.

Kevin Kho

05/06/2022, 3:50 PM

I think the Submitted for a while is a sign Docker is really struggling to process the volume of requests

Kevin Kho

05/06/2022, 3:52 PM

Do you have a very large layer in the image like they suggest here ?

Kevin Kho

05/06/2022, 3:53 PM

Uhh some people just suggest restarting Docker but you did that right?

Carlos Paiva

05/06/2022, 3:53 PM

Although, same thing should happen with just the query, right?

Kevin Kho

05/06/2022, 3:54 PM

When you just did the query, what was your return? Did you do the

Scheduled

also?

Carlos Paiva

05/06/2022, 3:56 PM

No, I have removed

Sheduled

Kevin Kho

05/06/2022, 3:58 PM

I’m sure you know but I’m very confused right now as well 😅. I think though that the Scheduled is just causing a backlog of container spin ups (10 seconds seems very short) so the Docker daemon requests accumulate and then it struggles. I am positive it’s not the API container struggling now since the state handler with just the GraphQL query worked. Nothing else in the state handler is computationally intensive so I think it has to do with the

Scheduled

Carlos Paiva

05/10/2022, 12:04 PM

Thanks for the help @Anna Geller @Kevin Kho! I have solved my problem with the change of Agent and Run Config to ECS (with Fargate).

Anna Geller

05/10/2022, 12:24 PM

nice work! thanks for reporting back about that

👍 1

1789 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.