Hello We are facing an annoying issue while using prefect wi Prefect Community #ask-community

Hello, We are facing an annoying issue while using...

Romain

01/26/2022, 12:17 PM

Hello, We are facing an annoying issue while using prefect with prefect server. Here is the precise stack: • prefect 0.15.12 (but issue was also often observed with previous version) • prefect server 2022.01.12 • executor DaskExecutor with cluster_class being KubeCluster • using kubernetes runner • and azure as storage What we observe is that sometime our temporary dask cluster which is started by the prefect run is not tear down at the end of the run. It's kinda annoying because if we don't pay attention it means we keep using the VM resources for nothing. Has anyone faced a similar issue?

Anna Geller

01/26/2022, 12:21 PM

it’s the first time I see this. Since

KubeCluster

configuration seems to be an issue here, can you share how did you define it? The storage and run config definition may be helpful as well (just redact any private info)

Romain

01/26/2022, 12:39 PM

Executor:

Copy code

DaskExecutor(
            cluster_class='dask_kubernetes.KubeCluster',
            cluster_kwargs={'pod_template': 'dask_worker_pod_template.yaml',
                            'scheduler_pod_template': scheduler_pod_template,
                            'namespace': 'namespace',
                            'n_workers': 1,
                            },
            adapt_kwargs={'minimum': 3,
                          'maximum': 10,
                          'interval': '60000 ms'}
        )

Runner:

Copy code

KubernetesRun(env=envs, image=image, job_template_path=job_template_path)

Storage:

Copy code

Azure(container='flows')

Anna Geller

01/26/2022, 12:46 PM

Do you attach the executor directly to the Flow object? Perhaps the delete permission is missing in the RBAC? I have an example here

Romain

01/26/2022, 12:47 PM

Yeah I attached the executor to the flow before registering it

Romain

01/26/2022, 12:50 PM

I would have to check the RBAC but my experience is that sometime the cluster is tear down properly, sometime not

Anna Geller

01/26/2022, 12:50 PM

I think attaching it directly to the Flow object is better - at least when using HPC this was an issue, I can send a thread if you’re interested

Copy code

with Flow("name", executor=DaskExecutor(...))

Anna Geller

01/26/2022, 12:52 PM

mysterious 😄 permissions could be the issue so cross-checking RBAC might be helpful - you can also check this blog post

Romain

01/26/2022, 12:57 PM

Well, I don't want to attached my executor in the with statement. Because for local development, and tests, we are using a Local Execcutor.

Anna Geller

01/26/2022, 12:59 PM

Maybe you can set it as a function?

Copy code

...
executor = get_executor(local=False) # or True for local
with Flow("name", executor=executor) as flow:

Romain

01/26/2022, 1:16 PM

So FYI I checked and we have all the verbs for pods

Anna Geller

01/26/2022, 6:11 PM

@Romain do you happen to have some (potentially unclosed) database connections in your flow? We saw some users were facing exactly the same issue when the database connection was not properly handled within the flow. To solve that, make sure that you use database connections only within your tasks and that you close those, and if you want to share database connection between tasks, you can leverage the resource manager as described in this post.

Romain

01/27/2022, 6:48 AM

@Ana I review our flows, and we use Prefect task

PostgresFetch

for database queries, so according to the source, the db connection is handle right there, without having to do anything else. But having said that, I wonder how you figure this issue out? This might help us to identify our problem next time it happens.

Anna Geller

01/27/2022, 10:51 AM

I walked through a history of similar request in the past and the database connection was the problem there. Btw it's a best practice to avoid tagging users directly :) I think you might have just tagged the wrong person

Anna Geller

01/27/2022, 10:52 AM

So if you ensure the database connection gets properly closed, the cluster cleanup issue should get resolved.

4 Views

Open in Slack

Previous Next