hey everyone! I’ve been getting HTTP Read Timeouts...
# prefect-community
a
hey everyone! I’ve been getting HTTP Read Timeouts on my flow execution from Prefect 2.0 API; just checking if something I can do or if this is a usual issue for everyone?
1
c
Hi Alexey, What's the error look like ? Are there a lot of them , or intermittent ? Are you running a lot of tasks / flows ?
a
hi @Christopher Boyd thank you! Here’s the error snippet:
in fact here, this is the more full one:
so this happens when loading a Secret from Prefect
perhaps the API waited too long to respond; I’m running it from a EC2 box; perhaps the solution is to have the flow retry prior to failing? Or increasing the wait times
c
It is solely when you do a secret load? Would you possibly have the ability to test just a complete standalone minimal flow where you just try to load the secret, and print some generic “yay” message upon success? I’m curious if the issue has to do wiht the secret loading, the api, the timeout, network considerations, truthfully maybe a number of things
Unfortunately, nothing is really coming to mind off this error alone of what it could be
a
@Christopher Boyd thank you, no, it’s a flow that has been running w/o change for days or weeks, but then every now and then would fail due to a timeout; this is very inconsistent and works 99% of the time; the problem is, some flows fail and trigger incident response on our end
typically, the next flow will always run successfully
this never happened on Prefect 1.0, but it does happen almost every day with 2.0
(for context, the flow runs every 15 minutes, so it would fail perhaps once or twice in 2 days)
c
hrmm, it’s possible maybe you are encountering some rate limits / concurrency? how many flows / tasks are you running simultaneously? Is it possible when this one fails there are a number of flows / tasks actively running or have just completed?
a
We have a single agent that is running on an EC2 box, so it sounds like that box should be always connected to the Internet, and therefore my assumption was that sometimes Prefect API sometimes would delay the response
and yes, we have (sometimes) 2 flows at the same time
this was the case with the one that failed, we had one agent run 2 flows @Christopher Boyd
the schedule “coincided”
@Christopher Boyd here:
the “weightless-skink” and “flying-ibex” are 2 different flows from 2 different deployments but did coincide w each other
c
that’s a pretty small amount, I was concerned more with hundreds / thousands
a
oh no we only have very little, approx. 10 runs an hour
c
There’s really nothing that stands out from a prefect standpoint regarding the timeout error; I guess the only other concern I have is if the failure always seems to happen at the same spot with loading the secret block?
I’d probably have to try and reproduce that
a
nope, sometimes it happens during other things; for example, here:
p
Is prefect 1.0 cloud going to be discontinued?
a
apparently shell_run_command attempted to call the API and got the read timeout
so it just happened again
the flow became “Late” now
c
hrmm, do you have an example of your flow code that you can share?
a
yes
p
@Alexey Stoletny The API on 2.0 is little buggy it seems, I have also faced task going late, And sometimes prefect agent not running in backend as well. Closing as we end the local session
🙌 1
a
Copy code
AWS_ACCESS_KEY_ID = s3_block.aws_access_key_id.get_secret_value()
AWS_SECRET_KEY_ID = s3_block.aws_secret_access_key.get_secret_value()

CLICKHOUSE_HOST = Secret.load("clickhouse-host").get()
CLICKHOUSE_PORT = Secret.load("clickhouse-port").get()
CLICKHOUSE_PASS = Secret.load("clickhouse-pass").get()

REDACTED_S3_URL = Secret.load("REDACTED-s3-url").get()
REDACTED_FILENAME_REGEXP = String.load("REDACTED-filename-regexp").value
REDACTED_FILENAME_TEMPLATE = String.load("REDACTED-filename-template").value

QUERY_TEMPLATE_PATH = "./sql/REDACTED.sql"

query_template_file = open(QUERY_TEMPLATE_PATH, "r")
query_string_template = query_template_file.read()
query_template_file.close()

query_string = query_string_template.format(
                    AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID, 
                    AWS_SECRET_KEY_ID=AWS_SECRET_KEY_ID,
                    ...)

<http://logger.info|logger.info>(query_string.replace(AWS_ACCESS_KEY_ID, "*****").replace(AWS_SECRET_KEY_ID, "*****"))

query_file_name = "./temp-query-{random_string}.sql"
query_file_name = query_file_name.format(random_string=''.join(random.choice(string.ascii_lowercase) for i in range(10)))
query_file = open(query_file_name, "w")
query_file.write(query_string)
query_file.close()

command_template = "clickhouse-client --host {CLICKHOUSE_HOST} --secure --port {CLICKHOUSE_PORT} --password {CLICKHOUSE_PASS} --queries-file {query_file_name} && rm {query_file_name}"

command_to_execute = command_template.format(CLICKHOUSE_HOST=CLICKHOUSE_HOST, CLICKHOUSE_PORT=CLICKHOUSE_PORT, CLICKHOUSE_PASS=CLICKHOUSE_PASS, query_file_name=query_file_name)
<http://logger.info|logger.info>("About to run: " + command_to_execute.replace(CLICKHOUSE_PASS, "*****"))

<http://logger.info|logger.info>(shell_run_command(command=command_to_execute, return_all=True))
@Christopher Boyd this is the entirety of the flow
right now the agent has stopped executing the commands and the flows became late; this happened to us a few times and I had to reboot the agent
to me it seems like the API is sometimes down for a bit too long
or takes a while to respond
it definitely didn’t happen on 1.0, but did happen to us consistently on 2.0 (sometimes when loading a secret, sometimes when something else is reaching to the API)
c
this isn’t really an answer per se to address the core problem of what might be timing out - truthfully, I don’t really suspect it’s the api being unresponsive, otherwise we’d have many many other reports of that issue and it would be much more wide spread
upvote 1
you can try wrapping the loads in try / catch
with a retry and backoff time
Copy code
success = False
    counter = 0
    while not success and counter < MAX_RETRY:
        try:
            secret.load()
            success = True
        except (ConnectionResetError, TimeoutError) as err:
            if counter >= MAX_RETRY:
                raise
            counter += 1
            time.sleep(TIME_BETWEEN_RETRY)
        except Exception as e:
            logging.warning(repr(e))
            raise
I think what could be doable is add some retry logic and logging the time elapsed, something ala:
Copy code
tic = time.time()
secret.load()
toc = time.time()
print toc - tic
This won’t solve the issue, but at least maybe get some insight into how long this request is taking to get a response
a
right, I was thinking the same, or perhaps I might try to configure retries on the entire flow?
or is it just the tasks that can have retries? @Christopher Boyd
e.g., to avoid writing extra code, just say “run the flow twice before failing”?
c
you can add retries to both
a
okay, is that via deployment? I will try that
c
@flow(retries=3) @task(retries=3)
You can add it as a decorator to the task / flow definition
a
gotcha, thank you; I believe this will solve the flow issue and the only thing that will remain is the task sometimes being late (also once or twice in a couple days) but I will move the “Late” alerts to a separate queue and won’t wake up the team for them
🙌 1
Thank you @Christopher Boyd
Hey @Christopher Boyd apologies for resurfacing this, but the Prefect API continues to give us some trouble; we’ve been seeing HTTP 500 today and a few flows failed: prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘https://api.prefect.cloud/api/accounts/43db7ccd-9f39-41f2-8989-000b28747858/workspaces/cedd89e9-9f12-421e-a17b-94045c976a2a/task_runs/bb6b3e1e-e90f-4b72-b020-4ab23e1b410b/set_state’ Response: {‘exception_message’: ‘Internal Server Error’}
r
@Alexey Stoletny @Christopher Boyd Unfortunately, I am also running into those same issues repeatedly :(
c
I’m curious there this syntax came from?
Copy code
CLICKHOUSE_HOST = Secret.load("clickhouse-host").get()
CLICKHOUSE_PORT = Secret.load("clickhouse-port").get()
CLICKHOUSE_PASS = Secret.load("clickhouse-pass").get()

REDACTED_S3_URL = Secret.load("REDACTED-s3-url").get()
There shoul dbe a secret_block.load and a separate secret_block.get() https://docs.prefect.io/api-ref/prefect/blocks/system/?h=secret#prefect.blocks.system.Secret
You are trying to chain a load to retrieve, and load into memory simultaneously - lets start with breaking that into two distinct steps? I know you can chain methods, but chaining a method to both read across the network and set in memory a value (considering that’s where your failure occurs) might be a good start
a
yeah we did that, most failures occurred not there; this is just one example of where this happened; as you can see above this was entirely unrelated to our code above
the Prefect API is just returning HTTP 500 sometimes @Christopher Boyd
@Robin Weiß I added retries for now which seems to be helpful
r
Hey @Alexey Stoletny thanks for letting me know! Unfortunately for me the Flow’s end up in
Crashed
state which for some reason means the retries don’t work. For your information: I found that quite a few people seem to be having this problem and it seems possible that it’s an infrastructure or load problem on the Prefect Cloud side. I will probably try deploying our own Orion setup next week to see if it remedies the issues. Will keep you posted!
a
Thank you @Robin Weiß, have you had any luck with this? The problem continues, I believe it’s the Prefect Cloud too (@Christopher Boyd)
r
Nope, the same issues still persist and we have duct-taped it by making the flow-runs resumable and manually re-triggering the runs. We are currently looking into deploying Orion ourselves on K8s to circumvent these problems
1
c
Hi All, unfortunately this is still under investigation at this time and I don’t have additional input yet
1
a
@Christopher Boyd appreciate your quick response in the other thread; has has been a problem for us every now and then, though much improved after retries; maybe you can check in on that one as this was now fixed? 🙂 Thank you so much!
thank you @Robin Weiß, we’ve been doing the retries and it worked well, but would love to see how your setup works out!
c
Hi Alexey, this is also still under investigation - there are some internal issues opened to tackle the issues you are seeing, and I am maintaining eyes on, but unfortunately require a bit more reproduction and resolution
It’s definitely not unnoticed as I think you were the first I saw to report this issue. but a number of others have hit it as well, so it’s definitely being addressed
a
appreciate it, thank you @Christopher Boyd
have a great weekend
c
You as well!