Hi, A general question about how dask is used unde...
# prefect-community
t
Hi, A general question about how dask is used under the hood. In "regular" dask code there's usually a client (script or from REPL) which connects to a cluster (ephemeral or long-running) and then it can submit work to that cluster. How this is done in Prefect? Assuming I am using DaskExecutor and running on k8s with kubernetes agent - does the client code run in the agents itself? As part of the KubeCluster? Where should I look for the code running it - k8s agent?
a
Oh this is actually a common question, let me write this up on Discourse - I'll share the link soon
t
Thanks for this! In the discord link you sent, the first item "Prefect initializes the Dask *client*" - I assume Prefect here is referring to the agent that runs the flow? So if I am considering using DaskExecutor with a k8s agent, the "client" code would be running in the agent, and in case I handle large data, it can potentially affect the agent? Do you know if it is possible to use dask's scatter (as in here) to keep the data in dask?
a
if I am considering using DaskExecutor with a k8s agent, the "client" code would be running in the agent
on K8s it depends ā€¢ if you use LocalDaskExecutor, all execution happens within flow run pod running independently of Kubernetes agent's pod ā€¢ if you run DaskExecutor e.g. with KubeCluster, then it runs on the Dask cluster spun up o K8s (a full cluster with many pods, likely across multiple distributed compute nodes) - regarding KubeCluster, check out this blog post
Do you know if it is possible to use dask's scatter (as in here) to keep the data in dask?
For such Dask questions, I recommend posting it on Dask Discourse
t
Regarding dask/client - when running k8s, it is using KubernetesRun and the "client" code actually runs in that pod - then, in case of local cluster, the dask worker runs on the same resources, and in case of KubeCluster, it runs on additional resources created for the KubeCluster dask cluster. Did I get you right? As for the scatter - I wonder if they would know since Prefect is handling the orchestration. I guess I could give it a try.
a
yup, you're correct! ā€¢ LocalDaskExecutor = all task runs are executed within the flow run pod ā€¢ DaskExecutor(cluster_class=KubeCluster()) = flow run is executed within the flow run pod, but task runs are executed on a distributed Kube Dask cluster across many pods/nodes
šŸ™Œ 1
t
Maybe add this to the discourse article you made? I think that reading it explicitly would have helped me understanding what's going on faster... šŸ™‚
šŸ‘ 1