What s the standard approach to running thousands of tasks w Prefect Community #prefect-getting-started

What’s the standard approach to running thousands ...

Cole Erickson

10/13/2023, 6:18 PM

What’s the standard approach to running thousands of tasks without crashing due to rate limits? I see that the rate limit is 400 - 2000 per minute. Are you just supposed to throttle your jobs down to this rate? I saw that Prefect has a global concurrency limit feature but it seems like a weird workaround. To use this effectively I think I’d need to write wrappers around all Prefect API functions to respect the Prefect rate limits. Has someone already done this? Is there a mode I can switch into where the Prefect client throttles itself on all API requests, perhaps respecting a Retry-After header on the 429s? Please let me know if there’s a better place to direct these questions

👀 1

Kevin Grismore

10/13/2023, 6:24 PM

Hey Cole! you're definitely asking in the right place. I've got some suggestions for your first question, and I'll get them written out as soon as I have a moment!

Cole Erickson

10/13/2023, 6:25 PM

Thanks so much!

Cole Erickson

10/13/2023, 8:16 PM

The problem seems to go away when self-hosting with

prefect server start

. This unblocks me for now

Emil Christensen

10/13/2023, 8:21 PM

@Cole Erickson The Prefect client does indeed respect Retry-After and will retry all requests up to a configurable number of times (default 5).

Emil Christensen

10/13/2023, 8:22 PM

There is no concept of rate limiting in self-hosted… you have to ensure you run the service at scale to handle the requests, otherwise it may fail over.

Cole Erickson

10/13/2023, 8:30 PM

Thanks Emil, good to see that there are retries. It seems the server isn’t responding with a conservative enough Retry-After delay because the workflows are failing with the default Prefect settings

Emil Christensen

10/13/2023, 8:32 PM

@Cole Erickson with enough volume of requests, client retries won’t solve the problem. Think of them as there to buffer out any smaller spikes or any transient issues. The retry-after sent by the server reflects when the next requests would be allowed… that bandwidth could get rapidly consumed depending on how many requests are trying to get through.

👍 1

Emil Christensen

10/13/2023, 8:33 PM

I definitely see your point, but there’s a tradeoff here on how long you’re willing to wait on a request. For some users, waiting substantially isn’t desirable, though that might be something nice to consider.

Emil Christensen

10/13/2023, 8:34 PM

I would recommend setting a high value for

PREFECT_CLIENT_MAX_RETRIES

. I guess another opportunity could be introducing a way to wait for longer than the minimum retry period.

👀 1

Cole Erickson

10/13/2023, 8:35 PM

I see what you’re saying. Thanks. I think I have three workarounds to pursue 1. Increase retries as you suggest 2. Application-level code to throttle my

submit

calls 3. Self-host

Emil Christensen

10/13/2023, 8:36 PM

What kind of volume are you planning for? What’s the order of magnitude of flows/tasks and what’s the pattern (bursty, sustained, or something else)?

Kevin Grismore

10/13/2023, 8:38 PM

Just throwing this out there, you could combine the methods explained above with a somewhat more strict approach of concurrency limits on the work pool or work queue level. If you have a decent estimate of orchestration or logging API calls in your flows, you've at least got a mechanism for constraining when retries will be needed.

👀 1

Cole Erickson

10/13/2023, 8:39 PM

That’s an interesting idea, thanks

Cole Erickson

10/13/2023, 8:46 PM

@Emil Christensen It’s an offline batch job that runs 10k+ tasks per job. Today, we usually run them with 300-1000 workers in GCP Dataflow. First we run one job to download and preprocess satellite images. Each worker node can process only one task at a time due to memory requirements. After the download job is done, we run another Dataflow job with GPUs attached to run an ML model on each image. Naively written, it’s extremely bursty because we just want to run two `task.map`s with a big list of image IDs. However, the speed of the job isn’t critical, so it’s fine to throttle it. We could also reframe it as a streaming computation instead of batch, but I don’t think I’ll need to

Cole Erickson

10/13/2023, 8:47 PM

Copy code

image_ids = [ (10,000+ elements) ]
preprocessed_images = preprocess.map(image_ids)
ml_images = run_ml_model.map(preprocessed_images)

👀 1

Emil Christensen

10/13/2023, 8:50 PM

Ah I see… and I’m guessing the processing of each image takes a while? As in… the problem is upfront submission of tasks, not necessarily the required requests over the whole lifespan of the job.

👍 1

Cole Erickson

10/13/2023, 8:50 PM

Yeah, the image processing takes 10 seconds to 10 minutes

Cole Erickson

10/13/2023, 8:53 PM

I think if I just write a

slow_map

with some sleeping in there it could take care of this in an acceptable way

✅ 1

Cole Erickson

10/13/2023, 8:54 PM

Signing off for today. Thanks for all the help - have a nice weekend!

Emil Christensen

10/13/2023, 8:55 PM

Sounds like a plan! Happy to discuss this more next week. Have a great weekend

34 Views

Open in Slack

Previous Next