Hey folks wave Does anyone have any recommendations good pra Prefect Community #ask-community

Hey folks :wave: Does anyone have any recommendat...

ciaran

05/28/2021, 2:23 PM

Hey folks 👋 Does anyone have any recommendations/good practices for debugging/monitoring Prefect & K8s? I'm in a world of hurt at the mo where if a Prefect Job fails, I lose the pods and then I cannot access the logs. On AKS I can use Log Workspaces, but even then it's an awful experience as you have to try and query via timeranges and labels to identify the containers that just ran, then get the logs that way. On something like ECS and non-k8s, I can see all my terminated tasks and immediately click through to CloudWatch

👀 1

Fabrice Toussaint

05/28/2021, 2:40 PM

We have our Kubernetes cluster hosted on GCP and can view/access the logs there in our Prefect's postgres db

ciaran

05/28/2021, 2:41 PM

Ah, we're not deploying Prefect, just the agent. We're using Prefect Cloud

Fabrice Toussaint

05/28/2021, 2:41 PM

My bad 🙂

ciaran

05/28/2021, 2:42 PM

No worries, I should have been more explicit!

Fabrice Toussaint

05/28/2021, 2:42 PM

Usually @Kevin Kho helps me with my many questions 😂, maybe he can help you out here

Kevin Kho

05/28/2021, 2:43 PM

My K8s knowledge is not strong so we have to wait for Tyler 😅

ciaran

05/28/2021, 2:46 PM

🤣 Haha I can wait!

Tyler Wanner

05/28/2021, 9:31 PM

Hiya Ciaran! You can use --disable-job-deletion flag if you don't want the agent to clean up jobs. I wonder, if you added a prestop hook to sleep 200 seconds on the job config using a run_config job template, maybe you could add an "afterlife" to the jobs for debugging purposes. If that works, it gives me some more ideas

Tyler Wanner

05/28/2021, 9:35 PM

it looks like that's a flag in the start command, not the install command--that means you'd have to edit the manifest generated by install. Alternatively, you can add an env var DELETE_FINISHED_JOBS=False to your agent containers

Tyler Wanner

05/28/2021, 9:37 PM

i'm not sure how/ if log workspaces translates kubernetes labels but there are certain labels on the job pods that represent Prefect attributes--you might want to use those in log hunting if they're accessible (prefect.io/flow_id and prefect.io/flow_run_id particularly)

ciaran

06/01/2021, 10:17 AM

Ah, that

DELETE_FINISHED_JOBS

flag does the ticket for the Prefect Job @Tyler Wanner thanks! My next challenge, should you wish to lose some hair, is how do I do this for the Dask containers my flow spins off? That's where our real pain points are. Trying to find the container for the jobs that ran as our errors are usually with Dask.

🚀 1

ciaran

06/01/2021, 10:32 AM

Looks like the Dask worker pods don't get assigned the

flow_run_id

label

ciaran

06/01/2021, 10:32 AM

ciaran

06/01/2021, 10:58 AM

Turns out that

flow

tag is one I made, I wonder if I can access the flow runs ID to define this in my

DaskExecutor

KubeCluster

config

ciaran

06/01/2021, 2:51 PM

Is there a way I can access the

flow_run_id

for something like:

Copy code

DaskExecutor(
        cluster_class="dask_kubernetes.KubeCluster",
        cluster_kwargs={
            "pod_template": make_pod_spec(
                image=os.environ["AZURE_BAKERY_IMAGE"],
                labels={"flow": flow_name},
                memory_limit=None,
                memory_request=None,
                env={
                    "AZURE_STORAGE_CONNECTION_STRING": os.environ[
                        "FLOW_STORAGE_CONNECTION_STRING"
                    ]
                },
            )
        },
        adapt_kwargs={"maximum": 10},
    )

? Currently that

flow

tag is defined in the flows

.py

file, but I guess trying to resolve

flow_run_id

is another kettle of fish?

Tyler Wanner

06/01/2021, 11:16 PM

hmm not sure how to get to that flow_run_id into the executor definition

ciaran

06/02/2021, 12:25 PM

Hmmm. It'd be super handy 😅 The most granular selection for labels I can use so far for getting the dask logs of a flow run are the flow name

ciaran

06/02/2021, 12:25 PM

But if I've got 100s, that's gonna be a struggle

Tyler Wanner

06/04/2021, 2:08 AM

Sorry I haven't been able to provide an update here 😞 Could you possibly open an issue so we don't lose this thread?

ciaran

06/04/2021, 10:34 AM

Sure, will do!

ciaran

06/04/2021, 11:26 AM

@Tyler Wanner https://github.com/PrefectHQ/prefect/issues/4617 done and done

🙏 1

Tyler Wanner

06/04/2021, 2:44 PM

thank you very much that's a very helpful issue

✅ 1

2 Views

Open in Slack

Previous Next