https://prefect.io logo
Title
a

Alexey Stoletny

09/13/2022, 4:37 PM
hey everyone! I’ve been getting HTTP Read Timeouts on my flow execution from Prefect 2.0 API; just checking if something I can do or if this is a usual issue for everyone?
1
c

Christopher Boyd

09/13/2022, 5:28 PM
Hi Alexey, What's the error look like ? Are there a lot of them , or intermittent ? Are you running a lot of tasks / flows ?
a

Alexey Stoletny

09/13/2022, 5:52 PM
hi @Christopher Boyd thank you! Here’s the error snippet:
in fact here, this is the more full one:
so this happens when loading a Secret from Prefect
perhaps the API waited too long to respond; I’m running it from a EC2 box; perhaps the solution is to have the flow retry prior to failing? Or increasing the wait times
c

Christopher Boyd

09/13/2022, 6:05 PM
It is solely when you do a secret load? Would you possibly have the ability to test just a complete standalone minimal flow where you just try to load the secret, and print some generic “yay” message upon success? I’m curious if the issue has to do wiht the secret loading, the api, the timeout, network considerations, truthfully maybe a number of things
Unfortunately, nothing is really coming to mind off this error alone of what it could be
a

Alexey Stoletny

09/13/2022, 6:07 PM
@Christopher Boyd thank you, no, it’s a flow that has been running w/o change for days or weeks, but then every now and then would fail due to a timeout; this is very inconsistent and works 99% of the time; the problem is, some flows fail and trigger incident response on our end
typically, the next flow will always run successfully
this never happened on Prefect 1.0, but it does happen almost every day with 2.0
(for context, the flow runs every 15 minutes, so it would fail perhaps once or twice in 2 days)
c

Christopher Boyd

09/13/2022, 6:09 PM
hrmm, it’s possible maybe you are encountering some rate limits / concurrency? how many flows / tasks are you running simultaneously? Is it possible when this one fails there are a number of flows / tasks actively running or have just completed?
a

Alexey Stoletny

09/13/2022, 6:09 PM
We have a single agent that is running on an EC2 box, so it sounds like that box should be always connected to the Internet, and therefore my assumption was that sometimes Prefect API sometimes would delay the response
and yes, we have (sometimes) 2 flows at the same time
this was the case with the one that failed, we had one agent run 2 flows @Christopher Boyd
the schedule “coincided”
@Christopher Boyd here:
the “weightless-skink” and “flying-ibex” are 2 different flows from 2 different deployments but did coincide w each other
c

Christopher Boyd

09/13/2022, 6:12 PM
that’s a pretty small amount, I was concerned more with hundreds / thousands
a

Alexey Stoletny

09/13/2022, 6:12 PM
oh no we only have very little, approx. 10 runs an hour
c

Christopher Boyd

09/13/2022, 6:13 PM
There’s really nothing that stands out from a prefect standpoint regarding the timeout error; I guess the only other concern I have is if the failure always seems to happen at the same spot with loading the secret block?
I’d probably have to try and reproduce that
a

Alexey Stoletny

09/13/2022, 6:13 PM
nope, sometimes it happens during other things; for example, here:
p

Pranit

09/13/2022, 6:14 PM
Is prefect 1.0 cloud going to be discontinued?
a

Alexey Stoletny

09/13/2022, 6:15 PM
apparently shell_run_command attempted to call the API and got the read timeout
so it just happened again
the flow became “Late” now
c

Christopher Boyd

09/13/2022, 6:16 PM
hrmm, do you have an example of your flow code that you can share?
a

Alexey Stoletny

09/13/2022, 6:16 PM
yes
p

Pranit

09/13/2022, 6:17 PM
@Alexey Stoletny The API on 2.0 is little buggy it seems, I have also faced task going late, And sometimes prefect agent not running in backend as well. Closing as we end the local session
🙌 1
a

Alexey Stoletny

09/13/2022, 6:18 PM
AWS_ACCESS_KEY_ID = s3_block.aws_access_key_id.get_secret_value()
AWS_SECRET_KEY_ID = s3_block.aws_secret_access_key.get_secret_value()

CLICKHOUSE_HOST = Secret.load("clickhouse-host").get()
CLICKHOUSE_PORT = Secret.load("clickhouse-port").get()
CLICKHOUSE_PASS = Secret.load("clickhouse-pass").get()

REDACTED_S3_URL = Secret.load("REDACTED-s3-url").get()
REDACTED_FILENAME_REGEXP = String.load("REDACTED-filename-regexp").value
REDACTED_FILENAME_TEMPLATE = String.load("REDACTED-filename-template").value

QUERY_TEMPLATE_PATH = "./sql/REDACTED.sql"

query_template_file = open(QUERY_TEMPLATE_PATH, "r")
query_string_template = query_template_file.read()
query_template_file.close()

query_string = query_string_template.format(
                    AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID, 
                    AWS_SECRET_KEY_ID=AWS_SECRET_KEY_ID,
                    ...)

<http://logger.info|logger.info>(query_string.replace(AWS_ACCESS_KEY_ID, "*****").replace(AWS_SECRET_KEY_ID, "*****"))

query_file_name = "./temp-query-{random_string}.sql"
query_file_name = query_file_name.format(random_string=''.join(random.choice(string.ascii_lowercase) for i in range(10)))
query_file = open(query_file_name, "w")
query_file.write(query_string)
query_file.close()

command_template = "clickhouse-client --host {CLICKHOUSE_HOST} --secure --port {CLICKHOUSE_PORT} --password {CLICKHOUSE_PASS} --queries-file {query_file_name} && rm {query_file_name}"

command_to_execute = command_template.format(CLICKHOUSE_HOST=CLICKHOUSE_HOST, CLICKHOUSE_PORT=CLICKHOUSE_PORT, CLICKHOUSE_PASS=CLICKHOUSE_PASS, query_file_name=query_file_name)
<http://logger.info|logger.info>("About to run: " + command_to_execute.replace(CLICKHOUSE_PASS, "*****"))

<http://logger.info|logger.info>(shell_run_command(command=command_to_execute, return_all=True))
@Christopher Boyd this is the entirety of the flow
right now the agent has stopped executing the commands and the flows became late; this happened to us a few times and I had to reboot the agent
to me it seems like the API is sometimes down for a bit too long
or takes a while to respond
it definitely didn’t happen on 1.0, but did happen to us consistently on 2.0 (sometimes when loading a secret, sometimes when something else is reaching to the API)
c

Christopher Boyd

09/13/2022, 6:33 PM
this isn’t really an answer per se to address the core problem of what might be timing out - truthfully, I don’t really suspect it’s the api being unresponsive, otherwise we’d have many many other reports of that issue and it would be much more wide spread
:upvote: 1
you can try wrapping the loads in try / catch
with a retry and backoff time
success = False
    counter = 0
    while not success and counter < MAX_RETRY:
        try:
            secret.load()
            success = True
        except (ConnectionResetError, TimeoutError) as err:
            if counter >= MAX_RETRY:
                raise
            counter += 1
            time.sleep(TIME_BETWEEN_RETRY)
        except Exception as e:
            logging.warning(repr(e))
            raise
I think what could be doable is add some retry logic and logging the time elapsed, something ala:
tic = time.time()
secret.load()
toc = time.time()
print toc - tic
This won’t solve the issue, but at least maybe get some insight into how long this request is taking to get a response
a

Alexey Stoletny

09/13/2022, 6:38 PM
right, I was thinking the same, or perhaps I might try to configure retries on the entire flow?
or is it just the tasks that can have retries? @Christopher Boyd
e.g., to avoid writing extra code, just say “run the flow twice before failing”?
c

Christopher Boyd

09/13/2022, 6:39 PM
you can add retries to both
a

Alexey Stoletny

09/13/2022, 6:39 PM
okay, is that via deployment? I will try that
c

Christopher Boyd

09/13/2022, 6:39 PM
@flow(retries=3) @task(retries=3)
You can add it as a decorator to the task / flow definition
a

Alexey Stoletny

09/13/2022, 6:41 PM
gotcha, thank you; I believe this will solve the flow issue and the only thing that will remain is the task sometimes being late (also once or twice in a couple days) but I will move the “Late” alerts to a separate queue and won’t wake up the team for them
🙌 1
Thank you @Christopher Boyd
Hey @Christopher Boyd apologies for resurfacing this, but the Prefect API continues to give us some trouble; we’ve been seeing HTTP 500 today and a few flows failed: prefect.exceptions.PrefectHTTPStatusError: Server error ‘500 Internal Server Error’ for url ‘https://api.prefect.cloud/api/accounts/43db7ccd-9f39-41f2-8989-000b28747858/workspaces/cedd89e9-9f12-421e-a17b-94045c976a2a/task_runs/bb6b3e1e-e90f-4b72-b020-4ab23e1b410b/set_state’ Response: {‘exception_message’: ‘Internal Server Error’}
r

Robin Weiß

09/15/2022, 7:38 AM
@Alexey Stoletny @Christopher Boyd Unfortunately, I am also running into those same issues repeatedly :(
c

Christopher Boyd

09/15/2022, 12:08 PM
I’m curious there this syntax came from?
CLICKHOUSE_HOST = Secret.load("clickhouse-host").get()
CLICKHOUSE_PORT = Secret.load("clickhouse-port").get()
CLICKHOUSE_PASS = Secret.load("clickhouse-pass").get()

REDACTED_S3_URL = Secret.load("REDACTED-s3-url").get()
There shoul dbe a secret_block.load and a separate secret_block.get() https://docs.prefect.io/api-ref/prefect/blocks/system/?h=secret#prefect.blocks.system.Secret
You are trying to chain a load to retrieve, and load into memory simultaneously - lets start with breaking that into two distinct steps? I know you can chain methods, but chaining a method to both read across the network and set in memory a value (considering that’s where your failure occurs) might be a good start
a

Alexey Stoletny

09/17/2022, 4:19 AM
yeah we did that, most failures occurred not there; this is just one example of where this happened; as you can see above this was entirely unrelated to our code above
the Prefect API is just returning HTTP 500 sometimes @Christopher Boyd
@Robin Weiß I added retries for now which seems to be helpful
r

Robin Weiß

09/17/2022, 11:19 AM
Hey @Alexey Stoletny thanks for letting me know! Unfortunately for me the Flow’s end up in
Crashed
state which for some reason means the retries don’t work. For your information: I found that quite a few people seem to be having this problem and it seems possible that it’s an infrastructure or load problem on the Prefect Cloud side. I will probably try deploying our own Orion setup next week to see if it remedies the issues. Will keep you posted!
a

Alexey Stoletny

09/21/2022, 7:01 PM
Thank you @Robin Weiß, have you had any luck with this? The problem continues, I believe it’s the Prefect Cloud too (@Christopher Boyd)
r

Robin Weiß

09/22/2022, 8:22 AM
Nope, the same issues still persist and we have duct-taped it by making the flow-runs resumable and manually re-triggering the runs. We are currently looking into deploying Orion ourselves on K8s to circumvent these problems
1
c

Christopher Boyd

09/22/2022, 12:48 PM
Hi All, unfortunately this is still under investigation at this time and I don’t have additional input yet
1
a

Alexey Stoletny

09/23/2022, 8:45 PM
@Christopher Boyd appreciate your quick response in the other thread; has has been a problem for us every now and then, though much improved after retries; maybe you can check in on that one as this was now fixed? 🙂 Thank you so much!
thank you @Robin Weiß, we’ve been doing the retries and it worked well, but would love to see how your setup works out!
c

Christopher Boyd

09/23/2022, 8:48 PM
Hi Alexey, this is also still under investigation - there are some internal issues opened to tackle the issues you are seeing, and I am maintaining eyes on, but unfortunately require a bit more reproduction and resolution
It’s definitely not unnoticed as I think you were the first I saw to report this issue. but a number of others have hit it as well, so it’s definitely being addressed
a

Alexey Stoletny

09/23/2022, 8:49 PM
appreciate it, thank you @Christopher Boyd
have a great weekend
c

Christopher Boyd

09/23/2022, 8:54 PM
You as well!