Hey all, I've been looking at upgrading my prefect...
# ask-community
d
Hey all, I've been looking at upgrading my prefect version and also my distrbuted, dask & dask-kubernetes versions for our production pipeline, and just wanted to clarify some change in behaviour that I've noticed? • Previously when I ran a flow, the k8s agent would create a job which was in effect the dask scheduler, creating and retiring pods as it needed to. In my case that
prefect-job-xxxxx
would create 4 ephemeral dask workers (named something like
dask-root-xxxx
) • Now the behaviour I'm seeing is: ◦ K8s agent creates the
prefect-job-xxx
◦ In the
prefect-job
logs, it gives me _prefect.DaskExecutor | Creating a new Dask cluster with
__main__.make_cluster
.Creating scheduler pod on cluster. This may take some time._ ◦ there are then 5x
dask-root-xxx pods
created, where 1 of them is a dask scheduler - ie the scheduler no-longer sits within the
prefect-job-xx
? Just wanted to check if this was expected/intended behaviour - I couldn't see any reference to it in the prefect release notes • In addition, (and this is more a side note that I think the prefect k8s rbac needs updating) - I've had to add 2x more rulesets to my k8s RBAC to make it work - see these docs for what's now required. Here is specifically what's changed vs the prefect docs Thanks!
My versions have gone from -> to: Prefect:
0.14.19
-->
0.15.3
Dask:
2021.2.0
-->
2021.7.2
Distributed:
2020.12.0
-->
2021.7.2
dask-kubernetes:
0.11.0
-->
2021.3.1
I think the main change has been in the upgrade of
dask-kubernetes
but their changelog is non-existent
• I've mainly been looking at the git diff here • Line 263 of this PR I think also references the creation of a pod with a scheduler running, not sure if that explains this change though
my runconfig is as follows if helpful.py
k
@Marvin archive “Dask-kubernetes Upgrade to 2021.3.1 creates Dask Scheduler in New Pod”
k
Hey @David Elliott, thanks for the detailed writeup. I don’t think anything on Prefect changed that would lead to this behavior. I agree with your thoughts that it seems to stem from dask-kubernetes being upgraded. With that said, I don;’t have any other advice than to downgrade for now if this is breaking. Also, I think our max version for distributed and dask is 2021.5.0 .
👍 1
Does this break your setup in any way?
d
Cool all good - really I just wanted to check if you had visibility over the change (or if it was intended) and also flag the RBAC change for the docs. I should be able to pin at a lower version for now - will give that a go and revert back if any issues! Also btw I just ran
pip install "prefect[aws,kubernetes]"==0.15.3
and it's installed distributed + dask version
2021.7.1
- I think it just takes latest atm
k
Gotcha. Thanks for mentioning!
d
OK so here's what I've found (mainly in case it's helpful for anyone else..!) • in distributed
2021.1.0
they introduced this change, which causes this issue where prefect can't create an ephemeral pod due to a name attribute error • that bug got fixed in dask-kubernetes
2021.3.0
(it handles the new name attribute properly), but that's also the version of dask-kubernetes which splits out the dask scheduler into its own pod (as I've described in my original post) ◦ so we have to keep dask-kubernetes pinned to
0.11.0
(the prior version) to keep the scheduler within the prefect job • and the fix for the above change is to keep distributed pinned to
2020.12.0
, prior to the above name attribute change • however in pinning distributed to 2020.12.0, we get a few dask compatibility issues (one is this but there are others) with newer versions of dask, meaning we have to pin dask to
2021.2.0
◦ (ie any version of dask >
2021.2.0
doesn't work with distributed
2020.12.0
) So in summary, the latest working versions I've found which keep the scheduler in the prefect-job are as follows: •
prefect==0.15.3
distributed==2020.12.0
dask-kubernetes==0.11.0
dask==2021.2.0
Have ran a test flow on this setup - the scheduler is still in the preject-job, and the pods get spawned properly, and the scheduler shuts down properly 👌
k
Just wondering what issues having the scheduler out of the prefect-job brings? The pod won’t shut down? (worker pods also?)
d
In fairness I've not extensively tested it, but I'm pretty sure the scheduler used to be separate and then it got moved into the prefect-job a few months back so it seemed like unexpected behaviour... This is minor, but from a debug perspective it's also much harder to find which pod is the scheduler vs the worker pods as they're all named the same - for me I have 4 workers, so trial + error looking into up to 5 pods is ok, but if you had tonnes of workers it'd be a real pain to find your scheduler Oh and yes you're right - when I did run it with the scheduler in its own pod, the
prefect-job-xx
threw an error on shutdown - the flow still completed, but it wasn't a graceful shutdown
👍 1