Hello, We are facing an annoying issue while using...
# ask-community
r
Hello, We are facing an annoying issue while using prefect with prefect server. Here is the precise stack: • prefect 0.15.12 (but issue was also often observed with previous version) • prefect server 2022.01.12 • executor DaskExecutor with cluster_class being KubeCluster • using kubernetes runner • and azure as storage What we observe is that sometime our temporary dask cluster which is started by the prefect run is not tear down at the end of the run. It's kinda annoying because if we don't pay attention it means we keep using the VM resources for nothing. Has anyone faced a similar issue?
a
it’s the first time I see this. Since
KubeCluster
configuration seems to be an issue here, can you share how did you define it? The storage and run config definition may be helpful as well (just redact any private info)
r
Executor:
Copy code
DaskExecutor(
            cluster_class='dask_kubernetes.KubeCluster',
            cluster_kwargs={'pod_template': 'dask_worker_pod_template.yaml',
                            'scheduler_pod_template': scheduler_pod_template,
                            'namespace': 'namespace',
                            'n_workers': 1,
                            },
            adapt_kwargs={'minimum': 3,
                          'maximum': 10,
                          'interval': '60000 ms'}
        )
Runner:
Copy code
KubernetesRun(env=envs, image=image, job_template_path=job_template_path)
Storage:
Copy code
Azure(container='flows')
a
Do you attach the executor directly to the Flow object? Perhaps the delete permission is missing in the RBAC? I have an example here
r
Yeah I attached the executor to the flow before registering it
I would have to check the RBAC but my experience is that sometime the cluster is tear down properly, sometime not
a
I think attaching it directly to the Flow object is better - at least when using HPC this was an issue, I can send a thread if you’re interested
Copy code
with Flow("name", executor=DaskExecutor(...))
mysterious 😄 permissions could be the issue so cross-checking RBAC might be helpful - you can also check this blog post
r
Well, I don't want to attached my executor in the with statement. Because for local development, and tests, we are using a Local Execcutor.
a
Maybe you can set it as a function?
Copy code
...
executor = get_executor(local=False) # or True for local
with Flow("name", executor=executor) as flow:
r
So FYI I checked and we have all the verbs for pods
a
@Romain do you happen to have some (potentially unclosed) database connections in your flow? We saw some users were facing exactly the same issue when the database connection was not properly handled within the flow. To solve that, make sure that you use database connections only within your tasks and that you close those, and if you want to share database connection between tasks, you can leverage the resource manager as described in this post.
r
@Ana I review our flows, and we use Prefect task
PostgresFetch
for database queries, so according to the source, the db connection is handle right there, without having to do anything else. But having said that, I wonder how you figure this issue out? This might help us to identify our problem next time it happens.
a
I walked through a history of similar request in the past and the database connection was the problem there. Btw it's a best practice to avoid tagging users directly :) I think you might have just tagged the wrong person
So if you ensure the database connection gets properly closed, the cluster cleanup issue should get resolved.