https://prefect.io logo
#prefect-community
Title
# prefect-community
n

Nick Coy

09/19/2022, 4:12 PM
Hello, I cannot seem to run more than one k8 job at a time. If I try to kick off two jobs the first runs but the second job fails with a 500 error.
1
n

Nate

09/19/2022, 4:17 PM
Hi @Nick Coy, It's certainly possible to run more than one k8s job at a time - can you give a minimal example of what's failing in your case? it would also be helpful to know whether you're talking about prefect 1 or 2
n

Nick Coy

09/19/2022, 4:24 PM
Hi @Nate I am on prefect 2.4.0 and pretty much followed this guide except for adding a non default service account and namespace. I am running on GKE. here is the traceback
Copy code
Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 227, in _submit_run_and_capture_errors result = await infrastructure.run(task_status=task_status) File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 230, in run job_name = await run_sync_in_worker_thread(self._create_job, manifest) File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 57, in run_sync_in_worker_thread return await anyio.to_thread.run_sync(call, cancellable=True) File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 442, in _create_job job = batch_client.create_namespaced_job(self.namespace, job_manifest) File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request return <http://self.rest_client.POST|self.rest_client.POST>(url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 275, in POST return self.request("POST", url, File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 234, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (500)

Reason: Internal Server Error

HTTP response headers: HTTPHeaderDict({'Audit-Id': '2d947d59-9c20-4c13-b486-ad6dd5d0ca81', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '533151b7-d867-4fe4-b076-855b6b27cb51', 'X-Kubernetes-Pf-Prioritylevel-Uid': '239c3359-fc1f-4d28-a5a1-0648fe373cef', 'Date': 'Mon, 19 Sep 2022 16:10:35 GMT', 'Content-Length': '264'})

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"The POST operation against Job.batch could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"POST","group":"batch","kind":"Job"},"code":500}
n

Nate

09/19/2022, 4:33 PM
hmm, I would recommend using the helm chart to deploy the agent on k8s (we'll try to get a recipe / guide for this soon) I can't say I know why you'd be getting timeouts on your cluster here (it could be because the recipe you linked is referencing old prefect image versions?), but using the helm chart has worked well for me and minimizes opportunities for misconfiguration of your cluster If you go this route, feel free to ask any clarifying questions here!
n

Nick Coy

09/19/2022, 4:59 PM
ok, I will try out using helm charts. Ill let you know if I have any questions.
👍 1
@Nate I wanted to circle back on this. I found the someone with a similar issue. It was caused by hardcoding the pod name in the base_manifest for the k8 job. Once I removed name from metadata I have no issue running multiple jobs at once.
Copy code
from prefect.infrastructure import KubernetesJob
from prefect.blocks.system import Secret

secret_prefect_url = Secret.load("prefect-api-url")
secret_prefect_api = Secret.load("prefect-api-key")

base_manifest = {
  "apiVersion": "batch/v1",
  "kind": "Job",
  "metadata": {
    "name": "k8-job-flow",
    "namespace": "prefect",
    "labels": {}
  },
  "spec": {
    "ttlSecondsAfterFinished": 180,
    "template": {
      "spec": {
        "serviceAccountName": "prefect2",
        "parallelism": 1,
        "completions": 1,
        "restartPolicy": "Never",
        "containers": [
          {
            "name": "prefect-job",
            "env": []
          }
        ]
      }
    }
  }
}

k8_job = KubernetesJob(
image='<http://us.gcr.io/hl-database/prefect_docker:latest|us.gcr.io/hl-database/prefect_docker:latest>',
job=base_manifest,
env={'PREFECT_API_URL':secret_prefect_url.get(), "PREFECT_API_KEY":secret_prefect_api.get()},
image_pull_policy="IfNotPresent",
name="k8-job-flow",
namespace='prefect',
service_account_name='prefect2',
command=["python", "-m", "prefect.engine" ],
pod_watch_timeout_seconds=300
# finished_job_ttl=300, not in current prefect release, should be in next realease after 2.4.0
)

if __name__ == "__main__":
    k8_job.save(name='k8-job-flow',overwrite=True)
n

Nate

09/19/2022, 8:43 PM
ahh nice catch - thanks for surfacing that here!
🙂 1
5 Views