Hi :wave: :prefect-duck: We encountered some inter...
# prefect-community
t
Hi 👋 prefect duck We encountered some interesting performance bottlenecks when throwing hundreds of tasks at Prefect. At some point, the task execution just stops, and the flow run freezes. To rule out capacity constraints, we ran the same test on a ray cluster with autoscaling enabled. Even with 90 CPUs and 75GB MEM the behavior was the same (ray cluster
address
param removed from the code snippet for readability). It seems like resetting the Prefect database fixes this issue. Could the tracking of tasks overload the DB? Is this edge-case known and are there workarounds for it?
1
j
Thank you for the report, Tony. What Prefect version are you using?
t
Thanks for the quick response! Prefect 2.4.0, Python 3.8. Will post
pip freeze
and poetry pyproject.toml below
pip freeze.cpp
pyproject.toml
j
Does upgrading to 2.4.1 (released yesterday) help? There is work being done in this area and improvements were made in 2.4.1.
t
2.4.1, unfortunately, introduced the
failed to parse annotation from 'Name' node: 'NoneType' object has no attribute 'resolve'
bug reported by Oscar Björhn for me (Slack ref. )
"There is work being done in this area and improvements were made" does that mean that this bottleneck is known and the assumption that the database is the root cause is true?
j
A fix for
failed to parse annotation from 'Name' node: 'NoneType' object has no attribute 'resolve'
should be forthcoming very soon. The improvements in 2.4.1 (2.4.2 presumably after the fix above) might eliminate the issue totally for you, but I think that’s correct.
t
OK, got 2.4.1 to work. It runs through, but when you increase it to 4000 tasks and run it with the default task runner, the same behavior comes up. Is there documentation on how the database is used for flow/task tracking and what number of tasks could create a degradation of service?
z
From what I’ve seen, Ray is by far the slowest task runner. We’re working with the Ray team so hopefully we can find some ways to improve that in the long run!
Can you share the output of
prefect version
?
t
Copy code
❯ prefect --version
2.4.1
j
Copy code
prefect version
From the cli, no dashes, should give a bunch more output.
t
@Zanie Can you share under which circumstances Ray is the slowest TaskRunner? At least with "mock" Tasks (hundreds/thousands of tasks that only execute
time.sleep
), Ray outperformed Dask by far due to the memory congestion that happens on the Dask Head Node when managing over ~1000 tasks. It also outperformed the ConcurrentTaskRunner in our tests. "We’re working with the Ray team" that is amazing to hear! Is there a way to follow this effort? Please also let me know if we can support this in any way
Copy code
❯ prefect version
Version:             2.4.1
API version:         0.8.0
Python version:      3.8.10
Git commit:          e941523d
Built:               Thu, Sep 22, 2022 12:26 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1
z
Thanks! It looks like you’re using the ephemeral server with SQLite which is going to be the least performant option (but the easiest to get started).
You’ll want to move to a hosted server (i.e. with
prefect orion start
) and use Postgres once you’re at scale (we have many tasks performing writes in parallel with separate clients which is not SQLite’s forte).
t
Sorry, was still on the local OSS debugging mode
Copy code
❯ prefect version
Version:             2.4.1
API version:         0.8.0
Python version:      3.8.10
Git commit:          e941523d
Built:               Thu, Sep 22, 2022 12:26 PM
OS/Arch:             linux/x86_64
Profile:             dev
Server type:         cloud
z
Can you share under which circumstances Ray is the slowest TaskRunner?
That’s interesting to hear! I just noticed it was slow when I was building it out.
aka in unit / integration tests
Ah haha I see, if you’re using Cloud that definitely doesn’t apply.
t
Please excuse the confusion. Bottom line: The tests were run with the normal Prefect 2.0 Enterprise Cloud environment
👍 1
switched to the local OSS Orion to check whether it might be an API throttling on the Cloud side (does not seem to be the case)
z
I see. Yeah there are a few things we’re working on to increase the scale that the client and server can handle.
🙌 1
Here’s one issue we’re tracking for this: https://github.com/PrefectHQ/prefect/issues/6492
🎉 1
I’m not leading the work with Ray, you can see some collaboration in https://github.com/PrefectHQ/prefect-ray/issues/33#issuecomment-1220836688
t
That's great to hear! So the constraint is on the server (== Prefect cloud workspace?) level? Having 1 flow with 4k tasks each or 4k flows with 1 each has the same performance impact under the assumption that the TaskRunner has infinite resources?
z
I’m not sure if the constraint is specifically server-side. I imagine that your second case would work better 🙂
We’re looking at improvements on both sides. For example, we recently updated the client to allow for something like a 100x improvement in submission of futures. However, this increased the load on the server especially for bursts of small tasks. So with the client optimization we need to improve server-side handling of a new pattern of traffic.
t
Interesting. So would that mean, that as a workaround, it could improve performance to bunch tasks together as subflows? (although subflow execution is currently only possible in a sequential mode, not concurrent, right?)
z
You can run subflows concurrently with with async directly (i.e.
asyncio.gather
or
anyio.TaskGroup
)
I imagine that it might reduce overall performance but you’d see less failure at scale? It’s really hard to say what different execution patterns will do to load. Generally, we don’t expect users to be running thousands of tasks that do nothing so benchmarks with trivial tasks will always push the system in a weird way.
t
100% understood. Most of our workloads involve long-running geospatial analysis of satellite images that regularly scale to 250 - 1000 concurrent tasks and utilize quite the compute power. We wanted to test some mock scenarios before we use hundreds of CPUs and melt our credit card with the cloud bill 😄 So you're saying that this specific bottleneck could potentially disappear when we put serious and "production-like" code/load into the tasks?
z
Once the code doesn’t just
sleep(1)
they’ll have more spread out requests for orchestration 🙂 it’s weird that the flow run is freezing though, there’s no case where that should be happening
I’ll likely spend more time investigating this once I finish result handling, I’ve really got to get that done asap though.
🙌 1
t
Thanks for your support! Please let me know if I can assist in any way