https://prefect.io logo
Title
p

Paco Ibañez

10/28/2022, 2:23 PM
Hello, I have a flow that I run on my laptop using DockerContainer infrastructure and takes around 4 minutes to complete. When I run that same flow on a kubernetes cluster using KubernetesJob it times out after 30 minutes (limit that I set in my flow). Any tips on how I could troubleshoot what is going on? I am using the default job template.
1
c

Christopher Boyd

10/28/2022, 3:08 PM
does it run, or does it stay in pending? Is it actually executing? What size nodes is your cluster using? With the default job template, there should be no default cpu / memory parameters When you run it locally, how many resources are being given to the docker container?
p

Paco Ibañez

10/28/2022, 3:20 PM
yeah, it runs until it times out
Finished in state TimedOut('Flow run exceeded timeout of 1800.0 seconds
. the nodes are 8 cores 32 Gb. The flow calls a DS model to get some predictions and it looks like that is what it is taking most of the time. the model is ~500 Mb
Do you think it is a matter of hw resources? are you aware of any prefect specific performance issues when running on k8s?
c

Christopher Boyd

10/28/2022, 3:37 PM
I’d look at the metrics of the cluster and see what kind of utilization / load it’s under when this flow executes. There are no “prefect performance” issues specifically, it’s just an execution environment for a job. A good recommendation though if you know the size / resource constraints, would be to use a customized job template and assign min/max cpu/memory constraints
There’s no specific / intrinsic reason it would run slower, I think it would be a matter of looking at the cluster utilization when it runs, and look at logs
When you say it’s running, is it actually computing / doing something
and just not finishing
p

Paco Ibañez

10/28/2022, 3:39 PM
yeah it is computing, I added additional logging to confirm
thanks a lot for your help I will dig around!
c

Christopher Boyd

10/28/2022, 3:49 PM
if you aren’t already using it, you could considering installing prometheus / grafana, so you can track performance / cluster health
p

Paco Ibañez

10/28/2022, 4:12 PM
would this be the right way of adding requests to my jobs without defining a custom manifest
customizations=[{
                "op": "add",
                "path": "/spec/template/spec/resources",
                "value": {"requests": {"memory": "2Gi", "cpu": "2"}},
            }]
            infrastructure = KubernetesJob(
                image = image,
                customizations = customizations,
                finished_job_ttl = 1*60*60, # one hour
            )
also tried this but I cant see the requests in the pod
k8s_job = KubernetesJob.base_job_manifest()
            k8s_job['spec']['template']['spec']['resources'] = {"requests": {"memory": "8Gi", "cpu": "2"}}
            infrastructure = KubernetesJob(
                image = image,
                job = k8s_job,
c

Christopher Boyd

10/28/2022, 4:29 PM
I have some notes and examples on this , but I need a few minutes to get them back to you
I’ll update here shortly with some working examples
p

Paco Ibañez

10/28/2022, 4:30 PM
no rush, thanks for you help. you guys are awesome! great community!
🙌 1
c

Christopher Boyd

10/28/2022, 5:08 PM
I think you got it right, this is what I have in my notes for the customization:
customizations=[
    {
        "op": "add",
        "path": "/spec/imagePullSecrets",
        "value": [{'name': 'dockerhub'}],
    },
    {
        "op": "add",
        "path": "/spec/template/spec/resources",
        "value": {"limits": {"memory": "8Gi", "cpu": "4000m"}},
    }
    ],
Alternatively, if you are building it from a job template as part of a deployment: https://discourse.prefect.io/t/creating-and-deploying-a-custom-kubernetes-infrastructure-block/1531
spec:
  template:
    spec:
      completions: 1
      containers: # the first container is required
        - env: []
          name: prefect-job
          image: prefecthq/prefect:2.3.0-python3.9
          imagePullPolicy: "IfNotPresent"
          resources:
		    requests:
		      memory: "64Mi"
		      cpu: "250m"
		    limits:
		      memory: "128Mi"
		      cpu: "500m"
      parallelism: 1
      restartPolicy: Never
p

Paco Ibañez

10/28/2022, 5:11 PM
thanks!