https://prefect.io logo
Title
d

David Elliott

02/03/2023, 6:32 PM
Don’t know if you want a full issue raised for this specifically, but I just ran into the error
Task run 'xxxx' received abort during orchestration
which I note has a TODO against it RE discovering why it happens. Details in 🧵
I’m reasonably sure it happened due to an overload of the prefect cloud postgres backend (too many task run state updates at once) per the attached error message
Task run '6f4c3941-1a44-4ed4-8c05-f5646c7ef8e7' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.
I hit 10 of these in my large flow run just now - I ran 3600 tasks in about 30mins, max concurrency of 100 tasks at a given time.
Version:             2.7.11
API version:         0.8.4
Python version:      3.8.7
Git commit:          6b27b476
Built:               Thu, Feb 2, 2023 7:22 PM
OS/Arch:             darwin/x86_64
Profile:             bi-p
Server type:         cloud
Note - it also meant the above tasks were left orphaned in a Running state (they were trying to update to Completed) and the downstream tasks ended up in NotReady
z

Zanie

02/03/2023, 7:14 PM
Ah interesting thanks for the details, cc @Chris Guidry
:thank-you: 3
👍 1
Regarding the TODO, we determined this usually occurs when the flow run is in a terminal state and a task attempts to run. This is another great case to handle better client-side though.
d

David Elliott

02/03/2023, 8:04 PM
Oddly enough just seen this again on our evening run where I’ve reduced the concurrency back down to 64 (from the stretch of 100 earlier) Task run Same issue as previously
Task run '80dc471b-bd97-47d2-9f54-fb8707551b39' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.
It’s mid flow-run (flow is still running) and was thrown at task run end time. Wonder if it’d be worth adding a retry for a PG timeout error? Also, is the PG timeout common / expected..?
@Zanie qq - would me setting a higher value for ORION_DATABASE_TIMEOUT help with avoiding the above timeout issue? (if not, is there a different variable that would?) If yes, what’s the highest I can (safely) set it to? I’m going to raise an issue for the above anyway as it’s happening on most of our production runs to 1 or 2 tasks (which is blocking our 2.0 migration) but hoping this var might be a quick fix in the meantime? (note - we’re on prefect cloud, not self-hosted)
z

Zanie

02/06/2023, 6:31 PM
@David Elliott all the
PREFECT_ORION_
prefixed settings only affect the server (i.e. setting it client-side has no effect)
This is a high priority bug though, we’re working on a fix right now.
🙌 3
:thank-you: 2
d

David Elliott

02/06/2023, 6:33 PM
Ahh ok fair enough! Thank you 🙏 - do you need me to raise an issue, or do you already have one?
z

Zanie

02/06/2023, 6:44 PM
Someone is working on writing the internal issue still, I might have a fix up before then 😄
🙌 1
👍 1
d

David Elliott

02/06/2023, 6:45 PM
Amazing, thanks so much 🙂 I’ve literally just watched another one appear in our evening prod run 😅
z

Zanie

02/07/2023, 3:50 PM
Here’s a draft in the open source — https://github.com/PrefectHQ/prefect/pull/8425 There’s another in-progress for Cloud
1
🙌 2
a

Aleksandr Liadov

02/08/2023, 10:28 AM
@Zanie, @David Elliott could it be rely to my issue? I have 20 flows (with 2 subflows each one). Dask task runner. So in the same time ~1360 tasks are going to be executed, but only ~50% in the same time. https://prefect-community.slack.com/archives/CL09KU1K7/p1675851326761279
z

Zanie

02/08/2023, 3:51 PM
Sorry Aleksandr that looks like a separate issue
1
a

Aleksandr Liadov

02/09/2023, 4:18 PM
@Zanie It seems to me that now, I have the same bug:
Task run 'c34bdce0-743d-45ec-a387-33e7f17464be' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.
z

Zanie

02/09/2023, 4:30 PM
👍 we’re still working on addressing this
👀 1
Unfortunately I’m having a very hard time getting the reproduction to behave in the test suite, but I do have the fix written already so it shouldn’t be long until we release it
This fix is going out to production now.
:thank-you: 3
🚀 2
d

David Elliott

02/13/2023, 9:40 AM
Thanks! I’m curious, what was the fix in cloud, add retries? I’ll give it a whirl later today 👍
z

Zanie

02/13/2023, 3:32 PM
The fix in Cloud was to restore previous behavior where a 503 was returned instead of a 200 (ABORT); the client retries on 503s.
🙌 2
:gratitude-thank-you: 2
d

David Elliott

02/13/2023, 6:36 PM
@Zanie did you change the cloud rate limits as part of this fix, or might that be unrelated? I’m hitting a tonne of 429s when trying to send logs up to cloud in today’s run, which we weren’t seeing last week. Obv I’ll raise separately if it’s unrelated, just feels like it might be part of the same fix..?
s

Stéphan Taljaard

02/13/2023, 6:43 PM
That said, what are the rate limits? 🤔
z

Zanie

02/13/2023, 7:21 PM
It should not be related.
👍 1
I don’t know the rate limits, actually. We use a couple of different algorithms and I believe it differs for some specific routes.
:thank-you: 1
@David Elliott it looks like your rate limit was lowered by accident (?), we’ve raised it to the normal level for your org again.
d

David Elliott

02/13/2023, 7:55 PM
Ah ok cool good to know, thanks for checking!
We’ll be tentatively running production from tomorrow on P2 (now that the above timeout issue is fixed) - will report back if I see anything else weird 👍
k

Kalise Richmond

02/23/2023, 5:34 PM
The rate limits for cloud are listed here: https://docs.prefect.io/ui/rate-limits/?h=rate+li
:thank-you: 1