Hey folks :wave: Does anyone have any recommendat...
# ask-community
c
Hey folks 👋 Does anyone have any recommendations/good practices for debugging/monitoring Prefect & K8s? I'm in a world of hurt at the mo where if a Prefect Job fails, I lose the pods and then I cannot access the logs. On AKS I can use Log Workspaces, but even then it's an awful experience as you have to try and query via timeranges and labels to identify the containers that just ran, then get the logs that way. On something like ECS and non-k8s, I can see all my terminated tasks and immediately click through to CloudWatch
👀 1
f
We have our Kubernetes cluster hosted on GCP and can view/access the logs there in our Prefect's postgres db
c
Ah, we're not deploying Prefect, just the agent. We're using Prefect Cloud
f
My bad 🙂
c
No worries, I should have been more explicit!
f
Usually @Kevin Kho helps me with my many questions 😂, maybe he can help you out here
k
My K8s knowledge is not strong so we have to wait for Tyler 😅
c
🤣 Haha I can wait!
t
Hiya Ciaran! You can use --disable-job-deletion flag if you don't want the agent to clean up jobs. I wonder, if you added a prestop hook to sleep 200 seconds on the job config using a run_config job template, maybe you could add an "afterlife" to the jobs for debugging purposes. If that works, it gives me some more ideas
it looks like that's a flag in the start command, not the install command--that means you'd have to edit the manifest generated by install. Alternatively, you can add an env var DELETE_FINISHED_JOBS=False to your agent containers
i'm not sure how/ if log workspaces translates kubernetes labels but there are certain labels on the job pods that represent Prefect attributes--you might want to use those in log hunting if they're accessible (prefect.io/flow_id and prefect.io/flow_run_id particularly)
c
Ah, that
DELETE_FINISHED_JOBS
flag does the ticket for the Prefect Job @Tyler Wanner thanks! My next challenge, should you wish to lose some hair, is how do I do this for the Dask containers my flow spins off? That's where our real pain points are. Trying to find the container for the jobs that ran as our errors are usually with Dask.
🚀 1
Looks like the Dask worker pods don't get assigned the
flow_run_id
label
Turns out that
flow
tag is one I made, I wonder if I can access the flow runs ID to define this in my
DaskExecutor
KubeCluster
config
Is there a way I can access the
flow_run_id
for something like:
Copy code
DaskExecutor(
        cluster_class="dask_kubernetes.KubeCluster",
        cluster_kwargs={
            "pod_template": make_pod_spec(
                image=os.environ["AZURE_BAKERY_IMAGE"],
                labels={"flow": flow_name},
                memory_limit=None,
                memory_request=None,
                env={
                    "AZURE_STORAGE_CONNECTION_STRING": os.environ[
                        "FLOW_STORAGE_CONNECTION_STRING"
                    ]
                },
            )
        },
        adapt_kwargs={"maximum": 10},
    )
? Currently that
flow
tag is defined in the flows
.py
file, but I guess trying to resolve
flow_run_id
is another kettle of fish?
t
hmm not sure how to get to that flow_run_id into the executor definition
c
Hmmm. It'd be super handy 😅 The most granular selection for labels I can use so far for getting the dask logs of a flow run are the flow name
But if I've got 100s, that's gonna be a struggle
t
Sorry I haven't been able to provide an update here 😞 Could you possibly open an issue so we don't lose this thread?
c
Sure, will do!
🙏 1
t
thank you very much that's a very helpful issue
1