Hi guys, We are experiencing problems with the Ku...
# ask-community
d
Hi guys, We are experiencing problems with the Kubernetes agent and deploying flows, due to timeout for k8s agent client. Its related to https://github.com/PrefectHQ/prefect/pull/5066. We are still encountering this error which means that the affected flows ends up being delayed approximate 15 minutes, which is rather critical in our use case and not a viable solution. I moved the trace log into the thread with some more info. Hope somebody has some ideas 🙏
I have also followed this thread: https://prefect-community.slack.com/archives/CL09KU1K7/p1635843898280300. The issue looks also super related to this one in AKS: https://github.com/Azure/AKS/issues/1052, but that has been solved a while ago. So I would assume this is already being used by the k8s-agent.
a
I believe the issue is not related to AKS load balancing timeout, but rather to the Kubernetes job specification. Did you check why you are getting this error?
Copy code
AttributeError: 'V1Job' object has no attribute 'name'
Prefect can deploy a namespaced job if the kind of job is a plain Kubernetes job, it cannot deploy kind V1Job. Can you share your flow definition that generated this log and the job spec to check it?
d
Thank you for taking your time and reply 👏 Its pretty much the simplest flow I could come up with:
Copy code
import requests
import pendulum
from prefect import Flow, task
from prefect.storage.docker import Docker
from prefect.schedules import Schedule, clocks

@task
def fetch():
    r = requests.get('<http://google.com>')
    if not r.ok:
        raise Exception(r.text)
    return r.text

schedule = Schedule(clocks=[clocks.CronClock("0/5 * * * *", pendulum.parse('2022-01-07T00:00'))])

with Flow("test_google", schedule) as flow:
    _ = fetch()


flow.storage = Docker(
        registry_url='<url>',
        image_name='<name>',
        dockerfile='<Dockerfile>',
    )
flow.storage.add_flow(flow)
flow.storage.build(push=True)
flow.register(project_name='<Project>', labels=['<Label>'])
a
thanks for sharing, but I was mostly interested in the Kubernetes part. How do you define your KubernetesRun run configuration?
d
Ahh, gotcha! Its as follows:
Copy code
from os import path
from typing import List, Union
from prefect.executors import LocalExecutor
from prefect.run_configs import KubernetesRun

def get_default_run_config(
    image: str = '<Image>',
    labels: List[str] = ['<Label>],
    job_spec_filename='job_spec_default.yaml',
    env: Union[dict, None] = None
):
    return KubernetesRun(
        job_template_path=path+job_spec_filename,
        image=image,
        labels=labels,
        env=env,
    )
flow.run_config = get_default_run_config(env=env)
k
Did setting the
cloud.agent.kubernetes_keep_alive
in the PR you linked not work for you?
i
(I work with Dave) We didn't see that we had to enable this setting in order for the fix from
0.15.8
to start working. Thank you! We have enabled this setting and so far it looks very good!
🙌 1
d
The error Anna is talking about is indeed something else, it looks like a flow_run is stuck, which was created wrongly a long time ago. The flow can't be found and therefore not deleted via graphql. Do you guys know how we can clean the state of the agent or something like it or simply remove this not_found flow_run?
Copy code
mutation {
  delete_flow_run(input: {flow_run_id: "a27fe499-4287-4ebe-86d0-c38543a2577a"}) {
    success
    error
  }
}
k
I think you can just cancel the flow run in the UI and then delete the associated job if it doesnt delete? The agent doesnt hold state so if the flow is no longer running, the agent should be fine
upvote 1
a
the flow run URL should be
Copy code
<https://cloud.prefect.io/yourteamname/flow-run/a27fe499-4287-4ebe-86d0-c38543a2577a>
d
Thank you, unfortunately we can't locate the flow, but we will try and figure it out! Thank you!
a
you’re very welcome!