Hello I have a flow that I run on my laptop using DockerCont Prefect Community #ask-community

Hello, I have a flow that I run on my laptop using...

Paco Ibañez

10/28/2022, 2:23 PM

Hello, I have a flow that I run on my laptop using DockerContainer infrastructure and takes around 4 minutes to complete. When I run that same flow on a kubernetes cluster using KubernetesJob it times out after 30 minutes (limit that I set in my flow). Any tips on how I could troubleshoot what is going on? I am using the default job template.

✅ 1

Christopher Boyd

10/28/2022, 3:08 PM

does it run, or does it stay in pending? Is it actually executing? What size nodes is your cluster using? With the default job template, there should be no default cpu / memory parameters When you run it locally, how many resources are being given to the docker container?

Paco Ibañez

10/28/2022, 3:20 PM

yeah, it runs until it times out

Finished in state TimedOut('Flow run exceeded timeout of 1800.0 seconds

. the nodes are 8 cores 32 Gb. The flow calls a DS model to get some predictions and it looks like that is what it is taking most of the time. the model is ~500 Mb

Paco Ibañez

10/28/2022, 3:21 PM

Do you think it is a matter of hw resources? are you aware of any prefect specific performance issues when running on k8s?

Christopher Boyd

10/28/2022, 3:37 PM

I’d look at the metrics of the cluster and see what kind of utilization / load it’s under when this flow executes. There are no “prefect performance” issues specifically, it’s just an execution environment for a job. A good recommendation though if you know the size / resource constraints, would be to use a customized job template and assign min/max cpu/memory constraints

Christopher Boyd

10/28/2022, 3:38 PM

There’s no specific / intrinsic reason it would run slower, I think it would be a matter of looking at the cluster utilization when it runs, and look at logs

Christopher Boyd

10/28/2022, 3:38 PM

When you say it’s running, is it actually computing / doing something

Christopher Boyd

10/28/2022, 3:39 PM

and just not finishing

Paco Ibañez

10/28/2022, 3:39 PM

yeah it is computing, I added additional logging to confirm

Paco Ibañez

10/28/2022, 3:40 PM

thanks a lot for your help I will dig around!

Christopher Boyd

10/28/2022, 3:49 PM

if you aren’t already using it, you could considering installing prometheus / grafana, so you can track performance / cluster health

Paco Ibañez

10/28/2022, 4:12 PM

would this be the right way of adding requests to my jobs without defining a custom manifest

Copy code

customizations=[{
                "op": "add",
                "path": "/spec/template/spec/resources",
                "value": {"requests": {"memory": "2Gi", "cpu": "2"}},
            }]
            infrastructure = KubernetesJob(
                image = image,
                customizations = customizations,
                finished_job_ttl = 1*60*60, # one hour
            )

Paco Ibañez

10/28/2022, 4:28 PM

also tried this but I cant see the requests in the pod

Copy code

k8s_job = KubernetesJob.base_job_manifest()
            k8s_job['spec']['template']['spec']['resources'] = {"requests": {"memory": "8Gi", "cpu": "2"}}
            infrastructure = KubernetesJob(
                image = image,
                job = k8s_job,

Christopher Boyd

10/28/2022, 4:29 PM

I have some notes and examples on this , but I need a few minutes to get them back to you

Christopher Boyd

10/28/2022, 4:29 PM

I’ll update here shortly with some working examples

Paco Ibañez

10/28/2022, 4:30 PM

no rush, thanks for you help. you guys are awesome! great community!

🙌 1

Christopher Boyd

10/28/2022, 5:08 PM

I think you got it right, this is what I have in my notes for the customization:

Copy code

customizations=[
    {
        "op": "add",
        "path": "/spec/imagePullSecrets",
        "value": [{'name': 'dockerhub'}],
    },
    {
        "op": "add",
        "path": "/spec/template/spec/resources",
        "value": {"limits": {"memory": "8Gi", "cpu": "4000m"}},
    }
    ],

Christopher Boyd

10/28/2022, 5:09 PM

Alternatively, if you are building it from a job template as part of a deployment: https://discourse.prefect.io/t/creating-and-deploying-a-custom-kubernetes-infrastructure-block/1531

Christopher Boyd

10/28/2022, 5:09 PM

Copy code

spec:
  template:
    spec:
      completions: 1
      containers: # the first container is required
        - env: []
          name: prefect-job
          image: prefecthq/prefect:2.3.0-python3.9
          imagePullPolicy: "IfNotPresent"
          resources:
		    requests:
		      memory: "64Mi"
		      cpu: "250m"
		    limits:
		      memory: "128Mi"
		      cpu: "500m"
      parallelism: 1
      restartPolicy: Never

Paco Ibañez

10/28/2022, 5:11 PM

thanks!

3 Views

Open in Slack

Previous Next