Hi guys We are experiencing problems with the Kubernetes age Prefect Community #ask-community

Hi guys, We are experiencing problems with the Ku...

Dave

01/07/2022, 2:21 PM

Hi guys, We are experiencing problems with the Kubernetes agent and deploying flows, due to timeout for k8s agent client. Its related to https://github.com/PrefectHQ/prefect/pull/5066. We are still encountering this error which means that the affected flows ends up being delayed approximate 15 minutes, which is rather critical in our use case and not a viable solution. I moved the trace log into the thread with some more info. Hope somebody has some ideas 🙏

Dave

01/07/2022, 2:21 PM

I have also followed this thread: https://prefect-community.slack.com/archives/CL09KU1K7/p1635843898280300. The issue looks also super related to this one in AKS: https://github.com/Azure/AKS/issues/1052, but that has been solved a while ago. So I would assume this is already being used by the k8s-agent.

k8s_agent_log.txt

Anna Geller

01/07/2022, 2:37 PM

I believe the issue is not related to AKS load balancing timeout, but rather to the Kubernetes job specification. Did you check why you are getting this error?

Copy code

AttributeError: 'V1Job' object has no attribute 'name'

Prefect can deploy a namespaced job if the kind of job is a plain Kubernetes job, it cannot deploy kind V1Job. Can you share your flow definition that generated this log and the job spec to check it?

Dave

01/07/2022, 2:40 PM

Thank you for taking your time and reply 👏 Its pretty much the simplest flow I could come up with:

Copy code

import requests
import pendulum
from prefect import Flow, task
from prefect.storage.docker import Docker
from prefect.schedules import Schedule, clocks

@task
def fetch():
    r = requests.get('<http://google.com>')
    if not r.ok:
        raise Exception(r.text)
    return r.text

schedule = Schedule(clocks=[clocks.CronClock("0/5 * * * *", pendulum.parse('2022-01-07T00:00'))])

with Flow("test_google", schedule) as flow:
    _ = fetch()


flow.storage = Docker(
        registry_url='<url>',
        image_name='<name>',
        dockerfile='<Dockerfile>',
    )
flow.storage.add_flow(flow)
flow.storage.build(push=True)
flow.register(project_name='<Project>', labels=['<Label>'])

Anna Geller

01/07/2022, 2:45 PM

thanks for sharing, but I was mostly interested in the Kubernetes part. How do you define your KubernetesRun run configuration?

Dave

01/07/2022, 2:52 PM

Ahh, gotcha! Its as follows:

Copy code

from os import path
from typing import List, Union
from prefect.executors import LocalExecutor
from prefect.run_configs import KubernetesRun

def get_default_run_config(
    image: str = '<Image>',
    labels: List[str] = ['<Label>],
    job_spec_filename='job_spec_default.yaml',
    env: Union[dict, None] = None
):
    return KubernetesRun(
        job_template_path=path+job_spec_filename,
        image=image,
        labels=labels,
        env=env,
    )
flow.run_config = get_default_run_config(env=env)

job_spec_default.yaml

Kevin Kho

01/07/2022, 3:06 PM

Did setting the

cloud.agent.kubernetes_keep_alive

in the PR you linked not work for you?

Ivan Kuznetsov

01/07/2022, 3:15 PM

(I work with Dave) We didn't see that we had to enable this setting in order for the fix from

0.15.8

to start working. Thank you! We have enabled this setting and so far it looks very good!

🙌 1

Dave

01/07/2022, 3:28 PM

The error Anna is talking about is indeed something else, it looks like a flow_run is stuck, which was created wrongly a long time ago. The flow can't be found and therefore not deleted via graphql. Do you guys know how we can clean the state of the agent or something like it or simply remove this not_found flow_run?

Copy code

mutation {
  delete_flow_run(input: {flow_run_id: "a27fe499-4287-4ebe-86d0-c38543a2577a"}) {
    success
    error
  }
}

Kevin Kho

01/07/2022, 3:32 PM

I think you can just cancel the flow run in the UI and then delete the associated job if it doesnt delete? The agent doesnt hold state so if the flow is no longer running, the agent should be fine

upvote 1

Anna Geller

01/07/2022, 3:33 PM

the flow run URL should be

Copy code

<https://cloud.prefect.io/yourteamname/flow-run/a27fe499-4287-4ebe-86d0-c38543a2577a>

Dave

01/07/2022, 5:37 PM

Thank you, unfortunately we can't locate the flow, but we will try and figure it out! Thank you!

Anna Geller

01/07/2022, 5:42 PM

you’re very welcome!

12 Views

Open in Slack

Previous Next