I am wondering if someone can help me. I was inter...
# ask-community
t
I am wondering if someone can help me. I was interested in running some distributed model training on kubernetes using Horovod. I was wondering how I would go about incorporating that into a prefect flow. I am assuming I would use something like RunNamespacedJob (i.e., https://docs.prefect.io/api/latest/tasks/kubernetes.html#runnamespacedjob) with the kubernetes manifest for an
MPIJob
(from the https://github.com/kubeflow/mpi-operator library) given in the body of the task. Is that roughly correct?
a
@Thomas Furmston it looks like this operator is not even a Kubernetes job but rather of a
CustomResourceDefinition
kind - so I would suspect that RunNamespacedJob wouldn’t work. But perhaps you can wrap this operator into a deployment and use CreateNamedspacedDeployment? just an idea
t
Sorry, that was my mistake I meant a MPIJob resource, e.g., https://github.com/kubeflow/mpi-operator#creating-an-mpi-job
and putting that in the body argument.
It seems like I am on the right lines though, right?
a
you can try, but I’m not 100% sure because it’s also a custom resource, not a “normal” Kubernetes job:
Copy code
kind: MPIJob
t
I believe it would necessary for me to have the controller for the mpi-operator installed on the kubernetes cluster.
This is what you mean?
a
What
RunNamespacedJob
does is connecting to the “job” client API and creating a namespaced job.
Copy code
api_client_job = cast(
            client.BatchV1Api, get_kubernetes_client("job", kubernetes_api_key_secret)
        )
...
api_client_job.create_namespaced_job(
            namespace=namespace, body=body, **kube_kwargs
        )
But MPIJob is not a normal job so not sure if this task will work but you can try and report back.
t
ok, I see
a
if you need some examples of how this task can work in a flow, here are two examples https://github.com/anna-geller/packaging-prefect-flows/tree/master/flows_task_library
t
ok, great.
That helps a lot!
ok, it seems it does not work
Installing the mpi-operator into the cluster and changing your sample script to
MPIJob
results in
Copy code
HTTP response headers: HTTPHeaderDict({'Audit-Id': '6510c974-e9b3-4808-86df-ae627a4e3a9b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '78f8cc9e-4703-43f5-a87d-fdddab35f617', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e8404112-029b-4412-aa5c-3b75d0022519', 'Date': 'Wed, 15 Dec 2021 15:12:07 GMT', 'Content-Length': '291'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"MPIJob in version \"v1\" cannot be handled as a Job: no kind \"MPIJob\" is registered for version \"batch/v1\" in scheme \"<http://k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\|k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\>"","reason":"BadRequest","code":400}
Basically as you expected @Anna Geller
wait I didn't change the api 🤦‍♂️
I'll try again
👍 1
Copy code
kubernetes.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '3c22aa77-2d6e-4524-94d1-1b5e418f5388', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '78f8cc9e-4703-43f5-a87d-fdddab35f617', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e8404112-029b-4412-aa5c-3b75d0022519', 'Date': 'Wed, 15 Dec 2021 15:18:40 GMT', 'Content-Length': '308'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"MPIJob in version \"v2beta1\" cannot be handled as a Job: no kind \"MPIJob\" is registered for version \"<http://kubeflow.org/v2beta1\|kubeflow.org/v2beta1\>" in scheme \"<http://k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\|k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\>"","reason":"BadRequest","code":400}
Seemingly it still does not work though
a
I would be interested to hear from the community if anyone was trying to run Kubeflow jobs from Prefect. So far I only heard stories like this when people replaced Kubeflow with Prefect.