Don’t know if you want a full issue raised for thi...
# ask-community
d
Don’t know if you want a full issue raised for this specifically, but I just ran into the error
Task run 'xxxx' received abort during orchestration
which I note has a TODO against it RE discovering why it happens. Details in 🧵
I’m reasonably sure it happened due to an overload of the prefect cloud postgres backend (too many task run state updates at once) per the attached error message
Copy code
Task run '6f4c3941-1a44-4ed4-8c05-f5646c7ef8e7' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.
I hit 10 of these in my large flow run just now - I ran 3600 tasks in about 30mins, max concurrency of 100 tasks at a given time.
Copy code
Version:             2.7.11
API version:         0.8.4
Python version:      3.8.7
Git commit:          6b27b476
Built:               Thu, Feb 2, 2023 7:22 PM
OS/Arch:             darwin/x86_64
Profile:             bi-p
Server type:         cloud
Note - it also meant the above tasks were left orphaned in a Running state (they were trying to update to Completed) and the downstream tasks ended up in NotReady
z
Ah interesting thanks for the details, cc @Chris Guidry
🙏 3
👍 1
Regarding the TODO, we determined this usually occurs when the flow run is in a terminal state and a task attempts to run. This is another great case to handle better client-side though.
d
Oddly enough just seen this again on our evening run where I’ve reduced the concurrency back down to 64 (from the stretch of 100 earlier) Task run Same issue as previously
Copy code
Task run '80dc471b-bd97-47d2-9f54-fb8707551b39' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.
It’s mid flow-run (flow is still running) and was thrown at task run end time. Wonder if it’d be worth adding a retry for a PG timeout error? Also, is the PG timeout common / expected..?
@Zanie qq - would me setting a higher value for ORION_DATABASE_TIMEOUT help with avoiding the above timeout issue? (if not, is there a different variable that would?) If yes, what’s the highest I can (safely) set it to? I’m going to raise an issue for the above anyway as it’s happening on most of our production runs to 1 or 2 tasks (which is blocking our 2.0 migration) but hoping this var might be a quick fix in the meantime? (note - we’re on prefect cloud, not self-hosted)
z
@David Elliott all the
PREFECT_ORION_
prefixed settings only affect the server (i.e. setting it client-side has no effect)
This is a high priority bug though, we’re working on a fix right now.
🙏 2
🙌 3
d
Ahh ok fair enough! Thank you 🙏 - do you need me to raise an issue, or do you already have one?
z
Someone is working on writing the internal issue still, I might have a fix up before then 😄
🙌 1
👍 1
d
Amazing, thanks so much 🙂 I’ve literally just watched another one appear in our evening prod run 😅
z
Here’s a draft in the open source — https://github.com/PrefectHQ/prefect/pull/8425 There’s another in-progress for Cloud
1
🙌 2
a
@Zanie, @David Elliott could it be rely to my issue? I have 20 flows (with 2 subflows each one). Dask task runner. So in the same time ~1360 tasks are going to be executed, but only ~50% in the same time. https://prefect-community.slack.com/archives/CL09KU1K7/p1675851326761279
z
Sorry Aleksandr that looks like a separate issue
1
a
@Zanie It seems to me that now, I have the same bug:
Copy code
Task run 'c34bdce0-743d-45ec-a387-33e7f17464be' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.
z
👍 we’re still working on addressing this
👀 1
Unfortunately I’m having a very hard time getting the reproduction to behave in the test suite, but I do have the fix written already so it shouldn’t be long until we release it
This fix is going out to production now.
🙏 3
🚀 2
d
Thanks! I’m curious, what was the fix in cloud, add retries? I’ll give it a whirl later today 👍
z
The fix in Cloud was to restore previous behavior where a 503 was returned instead of a 200 (ABORT); the client retries on 503s.
🙌 2
gratitude thank you 2
d
@Zanie did you change the cloud rate limits as part of this fix, or might that be unrelated? I’m hitting a tonne of 429s when trying to send logs up to cloud in today’s run, which we weren’t seeing last week. Obv I’ll raise separately if it’s unrelated, just feels like it might be part of the same fix..?
s
That said, what are the rate limits? 🤔
z
It should not be related.
👍 1
I don’t know the rate limits, actually. We use a couple of different algorithms and I believe it differs for some specific routes.
🙏 1
@David Elliott it looks like your rate limit was lowered by accident (?), we’ve raised it to the normal level for your org again.
d
Ah ok cool good to know, thanks for checking!
We’ll be tentatively running production from tomorrow on P2 (now that the above timeout issue is fixed) - will report back if I see anything else weird 👍
k
The rate limits for cloud are listed here: https://docs.prefect.io/ui/rate-limits/?h=rate+li
🙏 1