alex
03/21/2023, 9:45 PMMason Menges
03/21/2023, 9:49 PMalex
03/21/2023, 9:52 PMin _submit_deploy_flow_run_jobs
prefect-agent-54974798cd-z9rnr agent flow_runs = self._get_flow_run_metadata(flow_run_ids)
prefect-agent-54974798cd-z9rnr agent File "/usr/local/lib/python3.7/site-packages/prefect/agent/agent.py", line 688, in _get_flow_run_metadata
prefect-agent-54974798cd-z9rnr agent result = self.client.graphql(query)
prefect-agent-54974798cd-z9rnr agent File "/usr/local/lib/python3.7/site-packages/prefect/client/client.py", line 464, in graphql
prefect-agent-54974798cd-z9rnr agent raise ClientError(result["errors"])
prefect-agent-54974798cd-z9rnr agent prefect.exceptions.ClientError: [{'path': ['flow_run', 0, 'id'], 'message': 'Cannot return null for non-nullable field flow_run.id.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
prefect-agent-2343242-kvsrn agent [2023-03-21 21:50:32,993] WARNING - kube-agent | Job 'prefect-job-231a6946' is for flow run '6640af73-b5d2-4c15-a925-17edd8a1e144' which does not exist. It will be ignored.
I have been cancelling the backlogged the flows using the UI so I am assuming the logs are related to that. I have scaled up the number of agents and also restarted it but it hasn't helped.Mason Menges
03/21/2023, 10:03 PMalex
03/21/2023, 10:18 PM1.2.2
I don't see any jobs or pods associated with that job.
I tried to use the GQL Api to get more information on a "Scheduled" flow run and this is what I see. This flow actually ran successfully yesterday but is stuck as pending when I manually triggered it.
query check_flow_run_ids {
flow_run(where: {id: {_in: ["edb2bb1e-43be-4d05-9f10-875bff72afab"]}}) {
id
state
created
end_time
state_message
name
labels
agent {
id
}
flow_id
times_resurrected
}
}
{
"data": {
"flow_run": [
{
"id": "edb2bb1e-43be-4d05-9f10-875bff72afab",
"state": "Scheduled",
"created": "2023-03-21T22:07:13.959532+00:00",
"end_time": null,
"state_message": "Flow run scheduled.",
"name": "fancy-carp",
"labels": [
"a",
"b",
"c"
],
"agent": null,
"flow_id": "a166f70b-cbf1-4d4c-9858-8d1bd4401d82",
"times_resurrected": 0
}
]
}
}
I can see that an agent with superset labels is active
{
"id": "6c6461de-659c-447d-9c08-432fc47d4773",
"name": "mall-data-kube-agent",
"labels": [
"label2",
"a",
"b",
"na-build-index",
"na-build-index-dev",
"c",
"label1",
],
"last_queried": "2023-03-21T22:17:50.783869+00:00"
},
Matt Conger
03/22/2023, 12:57 AMalex
03/22/2023, 4:14 PMMatt Conger
03/22/2023, 7:36 PMalex
03/23/2023, 10:09 PMprefect-agent-747c65c767-hvc4q agent Traceback (most recent call last):
prefect-agent-747c65c767-hvc4q agent File "/usr/local/lib/python3.9/site-packages/prefect/agent/agent.py", line 328, in _submit_deploy_flow_run_jobs
prefect-agent-747c65c767-hvc4q agent flow_runs = self._get_flow_run_metadata(flow_run_ids)
prefect-agent-747c65c767-hvc4q agent File "/usr/local/lib/python3.9/site-packages/prefect/agent/agent.py", line 688, in _get_flow_run_metadata
prefect-agent-747c65c767-hvc4q agent result = self.client.graphql(query)
prefect-agent-747c65c767-hvc4q agent File "/usr/local/lib/python3.9/site-packages/prefect/client/client.py", line 465, in graphql
prefect-agent-747c65c767-hvc4q agent raise ClientError(result["errors"])
prefect-agent-747c65c767-hvc4q agent prefect.exceptions.ClientError: [{'path': ['flow_run', 0, 'id'], 'message': 'Cannot return null for non-nullable field flow_run.id.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]
meant that the agent was unable to execute any flow runs.
This the query that the agent executes.
query oof {
flow_run(where: { id: { _in: ["my-flow-ids"...] }, _or: [{ state: { _eq: "Scheduled" } }, { state: { _eq: "Running" }, task_runs: { state_start_time: { _lte: "2023-03-23T21:41:38.836741+00:00" } } }] }) {
id
version
state
serialized_state
parameters
scheduled_start_time
run_config
name
flow {
storage
version
environment
core_version
id
name
}
task_runs(where: { state_start_time: { _lte: "2023-03-23T21:41:38.836741+00:00" } }) {
serialized_state
version
id
task_id
}
}
}
Two of the flow_ids passed to the query were leading to the error above. When I removed the
flow {
storage
version
environment
core_version
id
name
}
clause from the query, it actually worked fine, including returning ids for the trouble some flow runs.
I used the delete_flow_run mutation to delete the two runs (the cancel flow run was failing with another id error) and my agent is working fine now.
Hopefully this can help your team identify the root cause of what looks like a data inconsistency or api issue and prevent it for the future.