ciaran

    ciaran

    1 year ago
    Hey folks 👋 Does anyone have any recommendations/good practices for debugging/monitoring Prefect & K8s? I'm in a world of hurt at the mo where if a Prefect Job fails, I lose the pods and then I cannot access the logs. On AKS I can use Log Workspaces, but even then it's an awful experience as you have to try and query via timeranges and labels to identify the containers that just ran, then get the logs that way. On something like ECS and non-k8s, I can see all my terminated tasks and immediately click through to CloudWatch
    Fabrice Toussaint

    Fabrice Toussaint

    1 year ago
    We have our Kubernetes cluster hosted on GCP and can view/access the logs there in our Prefect's postgres db
    ciaran

    ciaran

    1 year ago
    Ah, we're not deploying Prefect, just the agent. We're using Prefect Cloud
    Fabrice Toussaint

    Fabrice Toussaint

    1 year ago
    My bad 🙂
    ciaran

    ciaran

    1 year ago
    No worries, I should have been more explicit!
    Fabrice Toussaint

    Fabrice Toussaint

    1 year ago
    Usually @Kevin Kho helps me with my many questions 😂, maybe he can help you out here
    Kevin Kho

    Kevin Kho

    1 year ago
    My K8s knowledge is not strong so we have to wait for Tyler 😅
    ciaran

    ciaran

    1 year ago
    🤣 Haha I can wait!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    Hiya Ciaran! You can use --disable-job-deletion flag if you don't want the agent to clean up jobs. I wonder, if you added a prestop hook to sleep 200 seconds on the job config using a run_config job template, maybe you could add an "afterlife" to the jobs for debugging purposes. If that works, it gives me some more ideas
    it looks like that's a flag in the start command, not the install command--that means you'd have to edit the manifest generated by install. Alternatively, you can add an env var DELETE_FINISHED_JOBS=False to your agent containers
    i'm not sure how/ if log workspaces translates kubernetes labels but there are certain labels on the job pods that represent Prefect attributes--you might want to use those in log hunting if they're accessible (prefect.io/flow_id and prefect.io/flow_run_id particularly)
    ciaran

    ciaran

    1 year ago
    Ah, that
    DELETE_FINISHED_JOBS
    flag does the ticket for the Prefect Job @Tyler Wanner thanks! My next challenge, should you wish to lose some hair, is how do I do this for the Dask containers my flow spins off? That's where our real pain points are. Trying to find the container for the jobs that ran as our errors are usually with Dask.
    Looks like the Dask worker pods don't get assigned the
    flow_run_id
    label
    Turns out that
    flow
    tag is one I made, I wonder if I can access the flow runs ID to define this in my
    DaskExecutor
    KubeCluster
    config
    Is there a way I can access the
    flow_run_id
    for something like:
    DaskExecutor(
            cluster_class="dask_kubernetes.KubeCluster",
            cluster_kwargs={
                "pod_template": make_pod_spec(
                    image=os.environ["AZURE_BAKERY_IMAGE"],
                    labels={"flow": flow_name},
                    memory_limit=None,
                    memory_request=None,
                    env={
                        "AZURE_STORAGE_CONNECTION_STRING": os.environ[
                            "FLOW_STORAGE_CONNECTION_STRING"
                        ]
                    },
                )
            },
            adapt_kwargs={"maximum": 10},
        )
    ? Currently that
    flow
    tag is defined in the flows
    .py
    file, but I guess trying to resolve
    flow_run_id
    is another kettle of fish?
    Tyler Wanner

    Tyler Wanner

    1 year ago
    hmm not sure how to get to that flow_run_id into the executor definition
    ciaran

    ciaran

    1 year ago
    Hmmm. It'd be super handy 😅 The most granular selection for labels I can use so far for getting the dask logs of a flow run are the flow name
    But if I've got 100s, that's gonna be a struggle
    Tyler Wanner

    Tyler Wanner

    1 year ago
    Sorry I haven't been able to provide an update here 😞 Could you possibly open an issue so we don't lose this thread?
    ciaran

    ciaran

    1 year ago
    Sure, will do!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    thank you very much that's a very helpful issue