https://prefect.io logo
Title
a

Andrey Tatarinov

04/25/2022, 8:54 PM
Hi, we're looking into ways to reduce latency between Flow Run submission and getting results back. At the moment we're using KubernetesAgent which spawns k8s Jobs for each flow run. Job initialization is quite slow in our case due to image size. Question: if we setup permanent Dask cluster with appropriate image and set executor to DaskExecutor - would we be able to skip Job initialization step? i.e. is it Agent who sends specific commands to Dask cluster or k8s Job cannot be avoided?
k

Kevin Kho

04/25/2022, 8:57 PM
I think there is a difference right? You are combining Flow job setup and Executor job setup into one. I thought the default of KubernetesRun was to pull an image
IfNotExists
which has caused some members in the community not pulling updated images because they already had one with the same name. Yes you can skip executor initialization. Agent spins up the FlowRunner (flow setup) and then the Flow connects to the cluster and sends work. I think you for either, you should be able to bring latency down by caching Kubernetes images on the cluster? I donโ€™t know a lot about it but I know it can be done
a

Andrey Tatarinov

04/25/2022, 9:00 PM
Image caching - yes, but it's not exactly what we're looking for. We have quite a volume of Prefect flow runs, so shaving a bit from each Job initialization is a significant thing for us.
If new Job is inevitable in KubernetesAgent, we will run "LocalAgent" in k8s pod and point it into predefined static Dask cluster ๐Ÿ™‚
k

Kevin Kho

04/25/2022, 9:03 PM
Wow ok yeah local agent would be faster than a new pod, as long as it can pull the flow from storage
a

Andrey Tatarinov

04/25/2022, 9:06 PM
Good point, there should be something tricky about the labels, right?
k

Kevin Kho

04/25/2022, 9:08 PM
Default labels are associated with storage. You can just turn off the default labels of local storage though like this
a

Andrey Tatarinov

04/25/2022, 9:08 PM
exactly what we need, thanks ๐Ÿ˜‰
๐Ÿ‘ 1
m

Matthias

04/26/2022, 8:05 AM
Do you know which specific part of flow initialization is slow? Is it pod startup or something else?
a

Andrey Tatarinov

04/26/2022, 2:02 PM
@Matthias mostly resource allocation, pod initialization. All the k8s stuff
m

Matthias

04/26/2022, 2:34 PM
That might be mostly related to container images being too large. There are several ways to slim them down.
What could also help (but that depends) is to set resource requests/limits for each job so that the scheduler can make better informed decisions on where to place pods. And lastly, creating dedicated worker groups (or node pools) and adding nodeselectors to your job manifests can also help to better spread the load.
a

Andrey Tatarinov

04/26/2022, 3:37 PM
nothing that contains pytorch gpu can be slimmed down :)))
m

Matthias

04/26/2022, 4:57 PM
What you could do is to work with a flow of flow where you only use the pytorch container (in a subflow) when it is strictly necessary.