hey everyone I ve been getting HTTP Read Timeouts on my flow Prefect Community #ask-community

hey everyone! I’ve been getting HTTP Read Timeouts...

Alexey Stoletny

09/13/2022, 4:37 PM

hey everyone! I’ve been getting HTTP Read Timeouts on my flow execution from Prefect 2.0 API; just checking if something I can do or if this is a usual issue for everyone?

✅ 1

Christopher Boyd

09/13/2022, 5:28 PM

Hi Alexey, What's the error look like ? Are there a lot of them , or intermittent ? Are you running a lot of tasks / flows ?

Alexey Stoletny

09/13/2022, 5:52 PM

hi @Christopher Boyd thank you! Here’s the error snippet:

Alexey Stoletny

09/13/2022, 5:52 PM

Alexey Stoletny

09/13/2022, 5:52 PM

in fact here, this is the more full one:

Alexey Stoletny

09/13/2022, 5:52 PM

Alexey Stoletny

09/13/2022, 5:53 PM

so this happens when loading a Secret from Prefect

Alexey Stoletny

09/13/2022, 6:01 PM

perhaps the API waited too long to respond; I’m running it from a EC2 box; perhaps the solution is to have the flow retry prior to failing? Or increasing the wait times

Christopher Boyd

09/13/2022, 6:05 PM

It is solely when you do a secret load? Would you possibly have the ability to test just a complete standalone minimal flow where you just try to load the secret, and print some generic “yay” message upon success? I’m curious if the issue has to do wiht the secret loading, the api, the timeout, network considerations, truthfully maybe a number of things

Christopher Boyd

09/13/2022, 6:05 PM

Unfortunately, nothing is really coming to mind off this error alone of what it could be

Alexey Stoletny

09/13/2022, 6:07 PM

@Christopher Boyd thank you, no, it’s a flow that has been running w/o change for days or weeks, but then every now and then would fail due to a timeout; this is very inconsistent and works 99% of the time; the problem is, some flows fail and trigger incident response on our end

Alexey Stoletny

09/13/2022, 6:07 PM

typically, the next flow will always run successfully

Alexey Stoletny

09/13/2022, 6:07 PM

this never happened on Prefect 1.0, but it does happen almost every day with 2.0

Alexey Stoletny

09/13/2022, 6:08 PM

(for context, the flow runs every 15 minutes, so it would fail perhaps once or twice in 2 days)

Christopher Boyd

09/13/2022, 6:09 PM

hrmm, it’s possible maybe you are encountering some rate limits / concurrency? how many flows / tasks are you running simultaneously? Is it possible when this one fails there are a number of flows / tasks actively running or have just completed?

Alexey Stoletny

09/13/2022, 6:09 PM

We have a single agent that is running on an EC2 box, so it sounds like that box should be always connected to the Internet, and therefore my assumption was that sometimes Prefect API sometimes would delay the response

Alexey Stoletny

09/13/2022, 6:09 PM

and yes, we have (sometimes) 2 flows at the same time

Alexey Stoletny

09/13/2022, 6:10 PM

this was the case with the one that failed, we had one agent run 2 flows @Christopher Boyd

Alexey Stoletny

09/13/2022, 6:10 PM

the schedule “coincided”

Alexey Stoletny

09/13/2022, 6:11 PM

@Christopher Boyd here:

Alexey Stoletny

09/13/2022, 6:12 PM

the “weightless-skink” and “flying-ibex” are 2 different flows from 2 different deployments but did coincide w each other

Christopher Boyd

09/13/2022, 6:12 PM

that’s a pretty small amount, I was concerned more with hundreds / thousands

Alexey Stoletny

09/13/2022, 6:12 PM

oh no we only have very little, approx. 10 runs an hour

Christopher Boyd

09/13/2022, 6:13 PM

There’s really nothing that stands out from a prefect standpoint regarding the timeout error; I guess the only other concern I have is if the failure always seems to happen at the same spot with loading the secret block?

Christopher Boyd

09/13/2022, 6:13 PM

I’d probably have to try and reproduce that

Alexey Stoletny

09/13/2022, 6:13 PM

nope, sometimes it happens during other things; for example, here:

Alexey Stoletny

09/13/2022, 6:14 PM

Pranit

09/13/2022, 6:14 PM

Is prefect 1.0 cloud going to be discontinued?

Alexey Stoletny

09/13/2022, 6:15 PM

apparently shell_run_command attempted to call the API and got the read timeout

Alexey Stoletny

09/13/2022, 6:15 PM

Alexey Stoletny

09/13/2022, 6:15 PM

so it just happened again

Alexey Stoletny

09/13/2022, 6:16 PM

the flow became “Late” now

Christopher Boyd

09/13/2022, 6:16 PM

hrmm, do you have an example of your flow code that you can share?

Alexey Stoletny

09/13/2022, 6:16 PM

yes

Pranit

09/13/2022, 6:17 PM

@Alexey Stoletny The API on 2.0 is little buggy it seems, I have also faced task going late, And sometimes prefect agent not running in backend as well. Closing as we end the local session

🙌 1

Alexey Stoletny

09/13/2022, 6:18 PM

Copy code

AWS_ACCESS_KEY_ID = s3_block.aws_access_key_id.get_secret_value()
AWS_SECRET_KEY_ID = s3_block.aws_secret_access_key.get_secret_value()

CLICKHOUSE_HOST = Secret.load("clickhouse-host").get()
CLICKHOUSE_PORT = Secret.load("clickhouse-port").get()
CLICKHOUSE_PASS = Secret.load("clickhouse-pass").get()

REDACTED_S3_URL = Secret.load("REDACTED-s3-url").get()
REDACTED_FILENAME_REGEXP = String.load("REDACTED-filename-regexp").value
REDACTED_FILENAME_TEMPLATE = String.load("REDACTED-filename-template").value

QUERY_TEMPLATE_PATH = "./sql/REDACTED.sql"

query_template_file = open(QUERY_TEMPLATE_PATH, "r")
query_string_template = query_template_file.read()
query_template_file.close()

query_string = query_string_template.format(
                    AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID, 
                    AWS_SECRET_KEY_ID=AWS_SECRET_KEY_ID,
                    ...)

<http://logger.info|logger.info>(query_string.replace(AWS_ACCESS_KEY_ID, "*****").replace(AWS_SECRET_KEY_ID, "*****"))

query_file_name = "./temp-query-{random_string}.sql"
query_file_name = query_file_name.format(random_string=''.join(random.choice(string.ascii_lowercase) for i in range(10)))
query_file = open(query_file_name, "w")
query_file.write(query_string)
query_file.close()

command_template = "clickhouse-client --host {CLICKHOUSE_HOST} --secure --port {CLICKHOUSE_PORT} --password {CLICKHOUSE_PASS} --queries-file {query_file_name} && rm {query_file_name}"

command_to_execute = command_template.format(CLICKHOUSE_HOST=CLICKHOUSE_HOST, CLICKHOUSE_PORT=CLICKHOUSE_PORT, CLICKHOUSE_PASS=CLICKHOUSE_PASS, query_file_name=query_file_name)
<http://logger.info|logger.info>("About to run: " + command_to_execute.replace(CLICKHOUSE_PASS, "*****"))

<http://logger.info|logger.info>(shell_run_command(command=command_to_execute, return_all=True))

Alexey Stoletny

09/13/2022, 6:18 PM

@Christopher Boyd this is the entirety of the flow

Alexey Stoletny

09/13/2022, 6:22 PM

right now the agent has stopped executing the commands and the flows became late; this happened to us a few times and I had to reboot the agent

Alexey Stoletny

09/13/2022, 6:22 PM

to me it seems like the API is sometimes down for a bit too long

Alexey Stoletny

09/13/2022, 6:22 PM

or takes a while to respond

Alexey Stoletny

09/13/2022, 6:23 PM

it definitely didn’t happen on 1.0, but did happen to us consistently on 2.0 (sometimes when loading a secret, sometimes when something else is reaching to the API)

Christopher Boyd

09/13/2022, 6:33 PM

this isn’t really an answer per se to address the core problem of what might be timing out - truthfully, I don’t really suspect it’s the api being unresponsive, otherwise we’d have many many other reports of that issue and it would be much more wide spread

upvote 1

Christopher Boyd

09/13/2022, 6:33 PM

you can try wrapping the loads in try / catch

Christopher Boyd

09/13/2022, 6:33 PM

with a retry and backoff time

Christopher Boyd

09/13/2022, 6:34 PM

Copy code

success = False
    counter = 0
    while not success and counter < MAX_RETRY:
        try:
            secret.load()
            success = True
        except (ConnectionResetError, TimeoutError) as err:
            if counter >= MAX_RETRY:
                raise
            counter += 1
            time.sleep(TIME_BETWEEN_RETRY)
        except Exception as e:
            logging.warning(repr(e))
            raise

Christopher Boyd

09/13/2022, 6:36 PM

I think what could be doable is add some retry logic and logging the time elapsed, something ala:

Christopher Boyd

09/13/2022, 6:37 PM

Copy code

tic = time.time()
secret.load()
toc = time.time()
print toc - tic

This won’t solve the issue, but at least maybe get some insight into how long this request is taking to get a response

Alexey Stoletny

09/13/2022, 6:38 PM

right, I was thinking the same, or perhaps I might try to configure retries on the entire flow?

Alexey Stoletny

09/13/2022, 6:38 PM

or is it just the tasks that can have retries? @Christopher Boyd

Alexey Stoletny

09/13/2022, 6:38 PM

e.g., to avoid writing extra code, just say “run the flow twice before failing”?

Christopher Boyd

09/13/2022, 6:39 PM

you can add retries to both

Alexey Stoletny

09/13/2022, 6:39 PM

okay, is that via deployment? I will try that

Christopher Boyd

09/13/2022, 6:39 PM

@flow(retries=3) @task(retries=3)

Christopher Boyd

09/13/2022, 6:40 PM

You can add it as a decorator to the task / flow definition

Alexey Stoletny

09/13/2022, 6:41 PM

gotcha, thank you; I believe this will solve the flow issue and the only thing that will remain is the task sometimes being late (also once or twice in a couple days) but I will move the “Late” alerts to a separate queue and won’t wake up the team for them

🙌 1

Alexey Stoletny

09/13/2022, 6:45 PM

Thank you @Christopher Boyd

Alexey Stoletny

09/14/2022, 11:44 PM

Hey @Christopher Boyd apologies for resurfacing this, but the Prefect API continues to give us some trouble; we’ve been seeing HTTP 500 today and a few flows failed: prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘https://api.prefect.cloud/api/accounts/43db7ccd-9f39-41f2-8989-000b28747858/workspaces/cedd89e9-9f12-421e-a17b-94045c976a2a/task_runs/bb6b3e1e-e90f-4b72-b020-4ab23e1b410b/set_state’ Response: {‘exception_message’: ‘Internal Server Error’}

Robin Weiß

09/15/2022, 7:38 AM

@Alexey Stoletny @Christopher Boyd Unfortunately, I am also running into those same issues repeatedly :(

Christopher Boyd

09/15/2022, 12:08 PM

I’m curious there this syntax came from?

Copy code

CLICKHOUSE_HOST = Secret.load("clickhouse-host").get()
CLICKHOUSE_PORT = Secret.load("clickhouse-port").get()
CLICKHOUSE_PASS = Secret.load("clickhouse-pass").get()

REDACTED_S3_URL = Secret.load("REDACTED-s3-url").get()

Christopher Boyd

09/15/2022, 12:08 PM

There shoul dbe a secret_block.load and a separate secret_block.get() https://docs.prefect.io/api-ref/prefect/blocks/system/?h=secret#prefect.blocks.system.Secret

Christopher Boyd

09/15/2022, 12:09 PM

You are trying to chain a load to retrieve, and load into memory simultaneously - lets start with breaking that into two distinct steps? I know you can chain methods, but chaining a method to both read across the network and set in memory a value (considering that’s where your failure occurs) might be a good start

Alexey Stoletny

09/17/2022, 4:19 AM

yeah we did that, most failures occurred not there; this is just one example of where this happened; as you can see above this was entirely unrelated to our code above

Alexey Stoletny

09/17/2022, 4:19 AM

the Prefect API is just returning HTTP 500 sometimes @Christopher Boyd

Alexey Stoletny

09/17/2022, 4:20 AM

@Robin Weiß I added retries for now which seems to be helpful

Robin Weiß

09/17/2022, 11:19 AM

Hey @Alexey Stoletny thanks for letting me know! Unfortunately for me the Flow’s end up in

Crashed

state which for some reason means the retries don’t work. For your information: I found that quite a few people seem to be having this problem and it seems possible that it’s an infrastructure or load problem on the Prefect Cloud side. I will probably try deploying our own Orion setup next week to see if it remedies the issues. Will keep you posted!

Alexey Stoletny

09/21/2022, 7:01 PM

Thank you @Robin Weiß, have you had any luck with this? The problem continues, I believe it’s the Prefect Cloud too (@Christopher Boyd)

Robin Weiß

09/22/2022, 8:22 AM

Nope, the same issues still persist and we have duct-taped it by making the flow-runs resumable and manually re-triggering the runs. We are currently looking into deploying Orion ourselves on K8s to circumvent these problems

✅ 1

Christopher Boyd

09/22/2022, 12:48 PM

Hi All, unfortunately this is still under investigation at this time and I don’t have additional input yet

✅ 1

Alexey Stoletny

09/23/2022, 8:45 PM

@Christopher Boyd appreciate your quick response in the other thread; has has been a problem for us every now and then, though much improved after retries; maybe you can check in on that one as this was now fixed? 🙂 Thank you so much!

Alexey Stoletny

09/23/2022, 8:46 PM

thank you @Robin Weiß, we’ve been doing the retries and it worked well, but would love to see how your setup works out!

Christopher Boyd

09/23/2022, 8:48 PM

Hi Alexey, this is also still under investigation - there are some internal issues opened to tackle the issues you are seeing, and I am maintaining eyes on, but unfortunately require a bit more reproduction and resolution

Christopher Boyd

09/23/2022, 8:49 PM

It’s definitely not unnoticed as I think you were the first I saw to report this issue. but a number of others have hit it as well, so it’s definitely being addressed

Alexey Stoletny

09/23/2022, 8:49 PM

appreciate it, thank you @Christopher Boyd

Alexey Stoletny

09/23/2022, 8:50 PM

have a great weekend

Christopher Boyd

09/23/2022, 8:54 PM

You as well!

8 Views

Open in Slack

Previous Next