i ran a flow of flows process that was supposed to...
# prefect-community
a
i ran a flow of flows process that was supposed to kick off 12,000 flows on GKE. it started 10,000 flows, and then failed with the message
Copy code
Pod prefect-job-5e3af599-tl2xs failed. No container statuses found for pod
where can i look for a more detailed message? any idea what actually caused it to fail? also, of those jobs, all but one passed. the one that failed had the same message (no container statuses found).
discourse 1
k
Is the pod still up? You could look for the pod logs
a
its not. i think it went down right away when it failed. any way to see logs on old pods?
k
I think the pod still needs to exist. I assume it doesn;t?
a
no it doesnt
k
I know this is hard to pair, but that does that have a corresponding run in Prefect Cloud?
So this happens when there is a Failed pod. Prefect is the one that makes the log here For example here , there is an underlying issue with the pod
a
hmm ok. im not seeing other errors like he is, but it does seem pretty similar
a
It looks like your flow run was Submitted to your infrastructure and Prefect: • submitted a Kubernetes job for the flow run, • the flow run started (moved to a Running state) • but then something failed - could be that the container image couldn't be pulled from the container registry, Prefect flow couldn't be pulled from storage or there was some issue in allocating resources for the run) as a result, the flow run was marked as Failed - this is my understanding. You may tackle this issue using a flow-level state handler - if you see this specific type of error, create a new flow run of this flow to sort of "restart/retrigger" the entire process Are you on Prefect Cloud? Can you send an example flow run ID to check the logs and confirm?
upvote 1
a
i am on prefect cloud. 94fb6788-3052-4c41-9a38-579d584c6fd7 is a flow id
and it did start. it ran some tasks successfully, then ran a mapped task 10000 times successfully before it failed. i. would rather not restart the entire process, but i would like to be able to restart from the point of failure
a
thanks for providing more info. I guess I was confused when you said in the original message that you triggered 10,000 flow runs, but it looks like it's rather a single flow run with 10,000 mapped task runs, correct? let me check the logs
a
the mapped task is a create_flow_run task
a
I see, so the ID you sent me is the flow run ID of the parent flow run that triggered 10,000 child flow runs via mapped create_flow_run task?
a
yes
a
Thanks, the logs are helpful:
Copy code
"Failed to set task state with error: ClientError([{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID dbd483e5-a5f9-4155-b0a5-a6e96e6e8c2b: provided a running state but associated flow run 94fb6788-3052-4c41-9a38-579d584c6fd7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])\nTraceback (most recent call last):\n  File \""/usr/local/lib/python3.9/site-packages/prefect/engine/cloud/task_runner.py\"", line 91, in call_runner_target_handlers\n    state = self.client.set_task_run_state(\n  File \""/usr/local/lib/python3.9/site-packages/prefect/client/client.py\"", line 1598, in set_task_run_state\n    result = self.graphql(\n  File \""/usr/local/lib/python3.9/site-packages/prefect/client/client.py\"", line 473, in graphql\n    raise ClientError(result[\""errors\""])\nprefect.exceptions.ClientError: [{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID dbd483e5-a5f9-4155-b0a5-a6e96e6e8c2b: provided a running state but associated flow run 94fb6788-3052-4c41-9a38-579d584c6fd7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]""}"
It looks like there are too many flow-runs queued up and the execution layer cannot process them all at once. I wonder if you could try setting some concurrency limit to mitigate this? E.g. perhaps setting a concurrency limit of 500 for this child flow run to ensure that your execution layer can better handle this rather than getting all those child flow runs being created all at once?
a
i thought i was seeing those after the pod failed. so the tasks had been queued up but the flow had already failed
but ya that does make sense. is there a limit on what the execution layer can process at once? there were never more than 30 runs going at once (i actually was going to ask about that too, it took 4 hours to create all those flow runs, and i was wondering why it took so long)
a
is there a limit on what the execution layer can process at once?
there isn't, it depends on what infrastructure (your K8s cluster) can handle
i was wondering why it took so long
You had the right intuition, it takes some time to: • create the underlying K8s job for each flow run • and communicate state updates with the Prefect backend. Queuing them using concurrency limits could help prevent them to get submitted at once (it would kind of batch submit them as a result) which could mitigate that some flow runs get stuck
a
how do i queue with concurrency limits?
k
You can add a limit on the label of the flow runs like this so that there is a maximum number of how many are executing at once
a
is it the same thing? its not really a problem that so many flows are executing at once, but it seems like the issue is with creating the flow runs (if im understanding correctly)
k
This would block the new ones from creating if there are no concurrency slots open, so your burst is reduced