Andrew Lawlor
05/03/2022, 6:25 PMPod prefect-job-5e3af599-tl2xs failed. No container statuses found for pod
where can i look for a more detailed message? any idea what actually caused it to fail?
also, of those jobs, all but one passed. the one that failed had the same message (no container statuses found).Kevin Kho
05/03/2022, 6:26 PMAndrew Lawlor
05/03/2022, 6:28 PMKevin Kho
05/03/2022, 6:37 PMAndrew Lawlor
05/03/2022, 6:37 PMKevin Kho
05/03/2022, 6:39 PMAndrew Lawlor
05/03/2022, 6:59 PMAnna Geller
05/03/2022, 6:59 PMAndrew Lawlor
05/03/2022, 7:16 PMAnna Geller
05/03/2022, 7:27 PMAndrew Lawlor
05/03/2022, 7:28 PMAnna Geller
05/03/2022, 7:29 PMAndrew Lawlor
05/03/2022, 7:29 PMAnna Geller
05/03/2022, 7:36 PM"Failed to set task state with error: ClientError([{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID dbd483e5-a5f9-4155-b0a5-a6e96e6e8c2b: provided a running state but associated flow run 94fb6788-3052-4c41-9a38-579d584c6fd7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}])\nTraceback (most recent call last):\n File \""/usr/local/lib/python3.9/site-packages/prefect/engine/cloud/task_runner.py\"", line 91, in call_runner_target_handlers\n state = self.client.set_task_run_state(\n File \""/usr/local/lib/python3.9/site-packages/prefect/client/client.py\"", line 1598, in set_task_run_state\n result = self.graphql(\n File \""/usr/local/lib/python3.9/site-packages/prefect/client/client.py\"", line 473, in graphql\n raise ClientError(result[\""errors\""])\nprefect.exceptions.ClientError: [{'path': ['set_task_run_states'], 'message': 'State update failed for task run ID dbd483e5-a5f9-4155-b0a5-a6e96e6e8c2b: provided a running state but associated flow run 94fb6788-3052-4c41-9a38-579d584c6fd7 is not in a running state.', 'extensions': {'code': 'INTERNAL_SERVER_ERROR'}}]""}"
It looks like there are too many flow-runs queued up and the execution layer cannot process them all at once. I wonder if you could try setting some concurrency limit to mitigate this? E.g. perhaps setting a concurrency limit of 500 for this child flow run to ensure that your execution layer can better handle this rather than getting all those child flow runs being created all at once?Andrew Lawlor
05/03/2022, 7:38 PMAnna Geller
05/03/2022, 8:57 PMis there a limit on what the execution layer can process at once?there isn't, it depends on what infrastructure (your K8s cluster) can handle
i was wondering why it took so longYou had the right intuition, it takes some time to: • create the underlying K8s job for each flow run • and communicate state updates with the Prefect backend. Queuing them using concurrency limits could help prevent them to get submitted at once (it would kind of batch submit them as a result) which could mitigate that some flow runs get stuck
Andrew Lawlor
05/04/2022, 2:39 PMKevin Kho
05/04/2022, 2:43 PMAndrew Lawlor
05/04/2022, 2:57 PMKevin Kho
05/04/2022, 3:05 PM