Hey folks! Seeing a strange issue when using `Kub...
# ask-community
c
Hey folks! Seeing a strange issue when using
KubernetesRun
amd
DaskExecutor (KubeCluster)
on AKS. I submit my flow and it runs, I can see the Dask Scheduler pod stand up, but then nothing happens. For example, the flow image provided shows a simple flow that has been running for 17(now 20)+ hours 😮 You can find the flow definition here https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/add-k8s-cluster/flow_test/manual_flow.py And you can find the agent conf here https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/add-k8s-cluster/prefect_agent_conf.yaml
👀 1
cc. @Kevin Kho
r
Maybe the requested resources are not available? For example, when requesting 4GB of RAM, but the node on which the prefect agent runs only provides less than that per pod, than prefect might wait until resources are available, which is never the case 🤷
c
I thought that, but it actually spins off a new pod to be able to run the scheduler
So unless there's a funky rule where the scheduler has to be in the same pod as the tasks?
Which seems a bit weird
r
can you try it again with:
Copy code
def make_cluster():
    return KubeCluster(
        make_pod_spec(
            image=os.environ["AZURE_BAKERY_IMAGE"],
            labels={
                "flow": flow_name
            },
            memory_limit=None,
            memory_request=None
        )
    )
c
Can certainly give it a go
r
I have no experience with AKS, but removing all resource requests would exclude that root cause from the problem space.
c
Still similar behaviour, or at least, for now. Included the events for my workspace, you can see me cancel the long running one at the bottom, with the new dask worker & prefect job being spun up nearer the top
It gets far enough to log out the dask daskboard and then seemingly does the square root of jack 💩
marvin 1
r
Hmm, I think prefect support would be appreciated here 🤷
c
Yeah haha, I think someone is on the case, but thanks for giving it a go!
🙂 1
m
Hello @ciaran! Do you see any logs for this flow?
c
Hey @Mariia Kerimova here's the dump of them 🙂
👀 1
Nothing that screams 'Oh no' 🤣
r
Do you use
dask
with e.g.
da.arange
or something similar by any chance?
c
Nope, just a
Copy code
logger = prefect.context.get("logger")
    <http://logger.info|logger.info>("Hello, Cloud")
    return "hello result"
🤔 1
m
Just letting you know that I could reproduce your issue, but still investigating what's going on.
c
Well that's always a good sign if it can be reproduced 😅
Okay @Sean Harkins and myself just sat down for an hour or so going over this in our little Development Seed bubble. What's changed is a few things:
Copy code
executor=DaskExecutor(
    cluster_class="dask_kubernetes.KubeCluster",
    cluster_kwargs={
        "pod_template": make_pod_spec(
            image=os.environ["AZURE_BAKERY_IMAGE"],
            labels={
                "flow": flow_name
            },
            memory_limit=None,
            memory_request=None,
        )
    },
    adapt_kwargs={"maximum": 10}
)
Namely
str
name of
KubeCluster
,
pod_template
in the
cluster_kwargs
(this was really not an obvious one), finally the sprinkle of magic at the end was
adapt_kwargs
. So maybe having not listed
n_workers
meant that dask just never started to schedule work. Which is a bit funky. It's not a required parameter so you'd assume it'd have some kind of OK default
@Mariia Kerimova just an fyi 😄
m
Wow, it wasn't obvious to me. Thank you so much for sharing. I'm going to test it, and we should mention this in docs.
✅ 1
k
I opened an issue and tagged you. Thanks @ciaran and feel free to chime in there!
✅ 1
j
Same issue. I had workflows running until 29th April. Workflow as simple as this is not working.
import prefect
from prefect import task, Flow
from prefect.storage import GitLab
from prefect.run_configs import KubernetesRun
@task
def say_hello():
logger = prefect.context.get("logger")
<http://logger.info|logger.info>("Hello, Cloud!")
print('Hello, Cloud!')
with Flow("hello-world") as flow:
say_hello()
flow.storage = GitLab(
repo="repo/location",
path="path_to_workflow.py",
access_token_secret="secret"
)
flow.run_config = KubernetesRun(labels=['some-label'])