I am having an issue that is sporadic. I can't create a MRE, because it (seems to) occur randomly. I am more hoping that someone can point me at other discussion or github issues where this is discussed. I have been searching through discourse, slack and gh issues to find a possible solution.
We will run a dozen or so flows at the same time. Sometimes one or two will be marked as failed because a number of tasks crashed with no error. Soemtimes a flow will be marked as crashed when I can see in GKE that it's still running? Has anyone encountered similar issues?
prefect 2, gke, kubernetes jobs, using dask task runner. parent flow starts deployments and waits for them to show status.
Is our agent potentially too small? 0.5vcpu and 2Gi mem.