https://prefect.io logo
Title
a

alex

03/21/2023, 9:45 PM
Hello! I am still running into an issue where some of my Prefect 1.0 cloud runs are stuck in a scheduled state. How can I begin triaging this? It seems that some other users were also having the same issue a few days back https://prefect-community.slack.com/archives/CM28LL405/p1678385552717259
m

Mason Menges

03/21/2023, 9:49 PM
A build up of scheduled runs could be due to an issue on the agent, do you see anything in the agent logs, it's also sometimes helpful to toggle the schedule after clearing any late runs if the scheduler has a significantly large queue to work through it's possible it could be stuck. I'd start with the agent though and see if anything shows up in the logs there.
a

alex

03/21/2023, 9:52 PM
I see a lot of logs like this
in _submit_deploy_flow_run_jobs
prefect-agent-54974798cd-z9rnr agent     flow_runs = self._get_flow_run_metadata(flow_run_ids)
prefect-agent-54974798cd-z9rnr agent   File "/usr/local/lib/python3.7/site-packages/prefect/agent/agent.py", line 688, in _get_flow_run_metadata
prefect-agent-54974798cd-z9rnr agent     result = self.client.graphql(query)
prefect-agent-54974798cd-z9rnr agent   File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 464, in graphql
prefect-agent-54974798cd-z9rnr agent     raise ClientError(result["errors"])
prefect-agent-54974798cd-z9rnr agent prefect.exceptions.ClientError: [{'path': ['flow_run', 0, 'id'], 'message': 'Cannot return null for non-nullable field flow_run.id.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
prefect-agent-2343242-kvsrn agent [2023-03-21 21:50:32,993] WARNING - kube-agent | Job 'prefect-job-231a6946' is for flow run '6640af73-b5d2-4c15-a925-17edd8a1e144' which does not exist. It will be ignored.
I have been cancelling the backlogged the flows using the UI so I am assuming the logs are related to that. I have scaled up the number of agents and also restarted it but it hasn't helped.
m

Mason Menges

03/21/2023, 10:03 PM
Hmm, Are you able to find that job (prefect-job-231a6946) on your k8s cluster and does it have any pods running/associated with it? I'd check the pod logs if possible to see if they're erroring out or crashing. Also what version of prefect is your agent running?
a

alex

03/21/2023, 10:18 PM
The agent is running
1.2.2
I don't see any jobs or pods associated with that job. I tried to use the GQL Api to get more information on a "Scheduled" flow run and this is what I see. This flow actually ran successfully yesterday but is stuck as pending when I manually triggered it.
query check_flow_run_ids {
  flow_run(where: {id: {_in: ["edb2bb1e-43be-4d05-9f10-875bff72afab"]}}) {
    id
    state
    created
    end_time
    state_message
    name
    labels
    agent {
      id
    }
    flow_id
    times_resurrected
    
  }
}
{
  "data": {
    "flow_run": [
      {
        "id": "edb2bb1e-43be-4d05-9f10-875bff72afab",
        "state": "Scheduled",
        "created": "2023-03-21T22:07:13.959532+00:00",
        "end_time": null,
        "state_message": "Flow run scheduled.",
        "name": "fancy-carp",
        "labels": [
          "a",
          "b",
          "c"
        ],
        "agent": null,
        "flow_id": "a166f70b-cbf1-4d4c-9858-8d1bd4401d82",
        "times_resurrected": 0
      }
    ]
  }
}
I can see that an agent with superset labels is active
{
        "id": "6c6461de-659c-447d-9c08-432fc47d4773",
        "name": "mall-data-kube-agent",
        "labels": [
          "label2",
          "a",
          "b",
          "na-build-index",
          "na-build-index-dev",
          "c",
          "label1",
        ],
        "last_queried": "2023-03-21T22:17:50.783869+00:00"
      },
m

Matt Conger

03/22/2023, 12:57 AM
Hey @alex just want to confirm you have cycled the schedule. Did you you happen to change anything (labels, change agents, change config, change run config, re-deploy flows, flow-group, etc.) the day the flow began getting stuck?
a

alex

03/22/2023, 4:14 PM
I have not cycled the schedules, but this impacts multiple flows with different schedules. I can give that a try. I did not make any changes to the flows, the flow version has been the same for a few months and the agent has had the same labels too (it has been restarted, scaled up etc.)
This issue began around March 9
m

Matt Conger

03/22/2023, 7:36 PM
Hey Alex, I think it would be beneficial to set up a call to go over this and troubleshoot if you are interested. Would you be able to send me an email at matthew@prefect.io and I can send you a link for some possible meeting time? Thanks
🙏 1
a

alex

03/23/2023, 10:09 PM
I managed to resolve this issue This error
prefect-agent-747c65c767-hvc4q agent Traceback (most recent call last):
prefect-agent-747c65c767-hvc4q agent   File "/usr/local/lib/python3.9/site-packages/prefect/agent/agent.py", line 328, in _submit_deploy_flow_run_jobs
prefect-agent-747c65c767-hvc4q agent     flow_runs = self._get_flow_run_metadata(flow_run_ids)
prefect-agent-747c65c767-hvc4q agent   File "/usr/local/lib/python3.9/site-packages/prefect/agent/agent.py", line 688, in _get_flow_run_metadata
prefect-agent-747c65c767-hvc4q agent     result = self.client.graphql(query)
prefect-agent-747c65c767-hvc4q agent   File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 465, in graphql
prefect-agent-747c65c767-hvc4q agent     raise ClientError(result["errors"])
prefect-agent-747c65c767-hvc4q agent prefect.exceptions.ClientError: [{'path': ['flow_run', 0, 'id'], 'message': 'Cannot return null for non-nullable field flow_run.id.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
meant that the agent was unable to execute any flow runs. This the query that the agent executes.
query oof  {
    flow_run(where: { id: { _in: ["my-flow-ids"...] }, _or: [{ state: { _eq: "Scheduled" } }, { state: { _eq: "Running" }, task_runs: { state_start_time: { _lte: "2023-03-23T21:41:38.836741+00:00" } } }] }) {
        id
        version
        state
        serialized_state
        parameters
        scheduled_start_time
        run_config
        name
        flow {
            storage
            version
            environment
            core_version
            id
            name
        }
        task_runs(where: { state_start_time: { _lte: "2023-03-23T21:41:38.836741+00:00" } }) {
            serialized_state
            version
            id
            task_id
        }
    }
}
Two of the flow_ids passed to the query were leading to the error above. When I removed the
flow {
            storage
            version
            environment
            core_version
            id
            name
        }
clause from the query, it actually worked fine, including returning ids for the trouble some flow runs. I used the delete_flow_run mutation to delete the two runs (the cancel flow run was failing with another id error) and my agent is working fine now. Hopefully this can help your team identify the root cause of what looks like a data inconsistency or api issue and prevent it for the future.
I am facing this again now