Hi, I've got a problem running flows from Prefect ...
# prefect-server
d
Hi, I've got a problem running flows from Prefect Server in Kubernetes. It was working at one point but now the flows gets stuck in a
Submitted for execution
state. I found these docs and tried debugging/restarting services with them but no luck:
Copy code
<https://docs.prefect.io/orchestration/faq/debug.html#my-flow-is-stuck-in-a-submitted-state>
<https://discourse.prefect.io/t/why-is-my-flow-stuck-in-a-submitted-state/201>
I also added these flags to the kubernetes prefect agent:
--log-level DEBUG --disable-job-deletion
That gave me more detail but still no indication of what the problem is. Any help would be appreciated, thanks!
This is the log from the agent after submitting a flow:
Copy code
[2022-06-22 21:00:01,562] DEBUG - agent | Querying for ready flow runs...
DEBUG:agent:Found 1 ready flow run(s): {'26c959b4-9edd-47c9-98e7-e3997f2dc186'}
[2022-06-22 21:00:01,601] DEBUG - agent | Found 1 ready flow run(s): {'26c959b4-9edd-47c9-98e7-e3997f2dc186'}
[2022-06-22 21:00:01,601] DEBUG - agent | Retrieving metadata for 1 flow run(s)...
DEBUG:agent:Retrieving metadata for 1 flow run(s)...
DEBUG:agent:Submitting flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186 for deployment...
[2022-06-22 21:00:01,629] DEBUG - agent | Submitting flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186 for deployment...
[2022-06-22 21:00:01,629] DEBUG - agent | Sleeping flow run poller for 0.25 seconds...
DEBUG:agent:Sleeping flow run poller for 0.25 seconds...
[2022-06-22 21:00:01,631] INFO - agent | Deploying flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186 to execution environment...
INFO:agent:Deploying flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186 to execution environment...
[2022-06-22 21:00:01,631] DEBUG - agent | Updating flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186 state from Scheduled -> Submitted...
DEBUG:agent:Updating flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186 state from Scheduled -> Submitted...
DEBUG:agent:Creating namespaced job prefect-job-c7b44f14
[2022-06-22 21:00:01,815] DEBUG - agent | Creating namespaced job prefect-job-c7b44f14
DEBUG:agent:Job prefect-job-c7b44f14 created
INFO:agent:Completed deployment of flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186
[2022-06-22 21:00:01,851] DEBUG - agent | Job prefect-job-c7b44f14 created
[2022-06-22 21:00:01,851] INFO - agent | Completed deployment of flow run 26c959b4-9edd-47c9-98e7-e3997f2dc186
[2022-06-22 21:00:01,892] DEBUG - agent | Querying for ready flow runs...
reading that log - it almost seems like the flow is working correctly and completes but it never gets 'marked' as complete in the UI?
Completed deployment of flow
k
Wondering if it’s pointing at the right endpoint thne
d
which endpoint should it be pointing at?
this is from the agent startup:
Copy code
dflake@dflake-thinkpad:~/$ kubectl logs arte-prefect-agent-7d94bfd6b4-tsxql -n arte-prefect 

[2022-06-22 20:57:48,996] DEBUG - agent | Environment variables: []
[2022-06-22 20:57:48,996] DEBUG - agent | Max polls: None
[2022-06-22 20:57:48,996] DEBUG - agent | Agent address: <http://0.0.0.0:8080>
[2022-06-22 20:57:48,996] DEBUG - agent | Log to Cloud: True
[2022-06-22 20:57:48,996] DEBUG - agent | Prefect backend: server
[2022-06-22 20:57:48,998] DEBUG - agent | Namespace: arte-prefect
[2022-06-22 20:57:48,998] INFO - agent | Registering agent...
[2022-06-22 20:57:49,213] INFO - agent | Registration successful!
[2022-06-22 20:57:49,213] DEBUG - agent | Assigned agent id: c419c65a-a229-49eb-91ee-6b17b22ac252
[2022-06-22 20:57:49,213] DEBUG - agent | Sending test query to API at '<http://arte-prefect-apollo.arte-prefect:4200/graphql>'...
[2022-06-22 20:57:49,227] DEBUG - agent | Test query successful!

 ____            __           _        _                    _
|  _ \ _ __ ___ / _| ___  ___| |_     / \   __ _  ___ _ __ | |_
| |_) | '__/ _ \ |_ / _ \/ __| __|   / _ \ / _` |/ _ \ '_ \| __|
|  __/| | |  __/  _|  __/ (__| |_   / ___ \ (_| |  __/ | | | |_
|_|   |_|  \___|_|  \___|\___|\__| /_/   \_\__, |\___|_| |_|\__|
                                           |___/

[2022-06-22 20:57:49,227] INFO - agent | Starting KubernetesAgent with labels []
[2022-06-22 20:57:49,227] INFO - agent | Agent documentation can be found at <https://docs.prefect.io/orchestration/>
[2022-06-22 20:57:49,227] INFO - agent | Waiting for flow runs...
[2022-06-22 20:57:49,227] DEBUG - agent | Sending agent heartbeat...
[2022-06-22 20:57:49,231] DEBUG - agent | Retrieving information of jobs that are currently in the cluster...
[2022-06-22 20:57:49,290] DEBUG - agent | Running thread pool with 6 workers to handle flow deployment
[2022-06-22 20:57:49,290] DEBUG - agent | Querying for ready flow runs...
[2022-06-22 20:57:49,294] DEBUG - agent | Agent API server listening on port <http://0.0.0.0:8080>
[2022-06-22 20:57:49,391] DEBUG - agent | Heartbeat succesful! Sleeping for 60.0 seconds...
[2022-06-22 20:57:49,393] DEBUG - agent | No ready flow runs found.
[2022-06-22 20:57:49,393] DEBUG - agent | Sleeping flow run poller for 0.5 seconds...
k
I believe the agent is right since it picked up the Flow. But the Flow must have an endpoint configured also right? Is it?
d
I'm not sure, is that defined in the flow itself? It's a pretty basic hello_world flow that I'm trying to run:
Copy code
from prefect import Flow, task
from prefect.storage import Azure
from prefect.run_configs import KubernetesRun


FLOW_NAME = "azure_k8s"
STORAGE = Azure(
        container="arte-prefect",
        connection_string_secret="AZURE_STORAGE_CONNECTION_STRING"
)

@task(log_stdout=True)
def hello_world():
    text = f"hello from {FLOW_NAME}"
    print(text)
    return text


with Flow(
    FLOW_NAME, storage=STORAGE, run_config=KubernetesRun(job_template_path="/root/arte-tasks/k8s/prefect-job-template.yaml"),
) as flow:
    hw = hello_world()
k
More like in the image or with an env var
Copy code
PREFECT__SERVER__ENDPOINT=...
d
ah - checking
I added in the
PREFECT__SERVER__ENDPOINT
now the k8s agent looks like this:
Copy code
spec:
      containers:
      - command:
        - bash
        - -c
        - prefect agent kubernetes start --log-level DEBUG --disable-job-deletion
        env:
        - name: PREFECT__CLOUD__API
          value: <http://arte-prefect-apollo.arte-prefect:4200/graphql>
        - name: PREFECT__SERVER__ENDPOINT
          value: <https://prefect-api.arte.adobe.net/graphql>
        - name: NAMESPACE
          value: arte-prefect
        - name: IMAGE_PULL_SECRETS
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '[]'
        - name: JOB_MEM_REQUEST
        - name: JOB_MEM_LIMIT
        - name: JOB_CPU_REQUEST
        - name: JOB_CPU_LIMIT
        - name: IMAGE_PULL_POLICY
        - name: SERVICE_ACCOUNT_NAME
          value: arte-prefect-serviceaccount
        - name: PREFECT__BACKEND
          value: server
        - name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
          value: <http://0.0.0.0:8080>
        image: prefecthq/prefect:1.1.0-python3.8
running another flow still has no luck though 😞
k
That looks pretty simple already. Do you have access to pod logs?
d
yes - I'm not seeing any change from before in the agent logs:
Copy code
dflake@dflake-thinkpad:~/adobe/git_repos/arte-tasks$ kubectl logs pod/arte-prefect-agent-5dcd9b5df9-pmkcq -n arte-prefect 

[2022-06-22 22:17:27,198] DEBUG - agent | Environment variables: []
[2022-06-22 22:17:27,199] DEBUG - agent | Max polls: None
[2022-06-22 22:17:27,199] DEBUG - agent | Agent address: <http://0.0.0.0:8080>
[2022-06-22 22:17:27,199] DEBUG - agent | Log to Cloud: True
[2022-06-22 22:17:27,199] DEBUG - agent | Prefect backend: server
[2022-06-22 22:17:27,291] DEBUG - agent | Namespace: arte-prefect
[2022-06-22 22:17:27,292] INFO - agent | Registering agent...
[2022-06-22 22:17:27,416] INFO - agent | Registration successful!
[2022-06-22 22:17:27,416] DEBUG - agent | Assigned agent id: c419c65a-a229-49eb-91ee-6b17b22ac252
[2022-06-22 22:17:27,417] DEBUG - agent | Sending test query to API at '<http://arte-prefect-apollo.arte-prefect:4200/graphql>'...
[2022-06-22 22:17:27,431] DEBUG - agent | Test query successful!

 ____            __           _        _                    _
|  _ \ _ __ ___ / _| ___  ___| |_     / \   __ _  ___ _ __ | |_
| |_) | '__/ _ \ |_ / _ \/ __| __|   / _ \ / _` |/ _ \ '_ \| __|
|  __/| | |  __/  _|  __/ (__| |_   / ___ \ (_| |  __/ | | | |_
|_|   |_|  \___|_|  \___|\___|\__| /_/   \_\__, |\___|_| |_|\__|
                                           |___/

[2022-06-22 22:17:27,431] INFO - agent | Starting KubernetesAgent with labels []
[2022-06-22 22:17:27,431] INFO - agent | Agent documentation can be found at <https://docs.prefect.io/orchestration/>
[2022-06-22 22:17:27,431] INFO - agent | Waiting for flow runs...
[2022-06-22 22:17:27,433] DEBUG - agent | Sending agent heartbeat...
[2022-06-22 22:17:27,491] DEBUG - agent | Retrieving information of jobs that are currently in the cluster...
[2022-06-22 22:17:27,492] DEBUG - agent | Running thread pool with 6 workers to handle flow deployment
[2022-06-22 22:17:27,493] DEBUG - agent | Querying for ready flow runs...
[2022-06-22 22:17:27,496] DEBUG - agent | Agent API server listening on port <http://0.0.0.0:8080>
[2022-06-22 22:17:27,591] DEBUG - agent | Heartbeat succesful! Sleeping for 60.0 seconds...
[2022-06-22 22:17:27,592] DEBUG - agent | No ready flow runs found.
[2022-06-22 22:17:27,593] DEBUG - agent | Sleeping flow run poller for 0.5 seconds...
k
Can you try adding
--show-flow-logs
to the agent start maybe?
d
I tried to do that but it didn't work, seems like that flag is only for a local agent not k8s
k
Ah that makes sense. It would be hard for k8s. But yeah you need to catch the logs of the Flow pod here, not the agent pod, I don’t think that will be helpful
d
yeah that seems right and it's also what I've been struggling with 🙂 Do you have any ideas/tips for creating/getting the logs from the flow pods in k8s?
anything in the flow itself I could add? or maybe in the flow image?
k
This should be right, but you gotta do it while the pod is alive
d
thanks I'll give it a try
@Kevin Kho, fyi - the problem turned out to be the custom image we were using for our job template
thanks so much for helping track this down!
k
Thanks but I didn’t do anything lol