There has to be someone who was able to set tolerations node Prefect Community #prefect-kubernetes

There has to be someone who was able to set tolera...

Eric

10/27/2023, 7:01 PM

There has to be someone who was able to set tolerations / nodeSelectors in their deployment; can someone explain where they put that? I currently put it in my

prefect.yaml

file to create a deployment, in my Deployment's configuration tab I see:

Copy code

{
  ...
  "job_manifest": {
    "spec": {
      "template": {
        "spec": {
          "tolerations": [
            {
              "key": "dedicated",
              "value": "asyncjobs",
              "effect": "NoSchedule",
              "operator": "Equal"
            }
          ],
          "nodeSelector": {
            "kube/nodetype": "asyncjobs"
          }
        }
      }
    }
  }
}

However the flow run pods created with this deployment don't have any of these values propagated. Not sure if I'm just setting this incorrectly

Kevin Grismore

10/27/2023, 7:04 PM

Hey Eric, what does your work pool's advanced tab look like on the Edit page?

Eric

10/27/2023, 7:05 PM

Yeap, not seeing the tolerations or nodeSelector in that tab!

Kevin Grismore

10/27/2023, 7:06 PM

In order to override the default job template, variables need to be added to the template. Let me grab an example for you.

Eric

10/27/2023, 7:06 PM

Would love an example, thank you

Kevin Grismore

10/27/2023, 7:11 PM

Copy code

{
  "variables": {
    "type": "object",
    "properties": {
      "tolerations": {
        "type": "array",
        "title": "Tolerations"
      },
      "env": {
        "type": "object",
        "title": "Environment Variables",
        "description": "Environment variables to set when starting a flow run.",
        "additionalProperties": {
          "type": "string"
        }
      },
      "name": {
        "type": "string",
        "title": "Name",
        "description": "Name given to infrastructure created by a worker."
      },


...

  "job_configuration": {
    "env": "{{ env }}",
    "name": "{{ name }}",
    "labels": "{{ labels }}",
    "command": "{{ command }}",
    "namespace": "{{ namespace }}",
    "job_manifest": {
      "kind": "Job",
      "spec": {
        "template": {
          "spec": {
            "tolerations": "{{ tolerations }}",
            "containers": [
              {
                "env": "{{ env }}",
                "args": "{{ command }}",
                "name": "prefect-job",
                "image": "{{ image }}",
                "imagePullPolicy": "{{ image_pull_policy }}"
              }
            ],
            "completions": 1,
            "parallelism": 1,
            "restartPolicy": "Never",
            "serviceAccountName": "{{ service_account_name }}"
          }
        },
        "backoffLimit": 0,
        "ttlSecondsAfterFinished": "{{ finished_job_ttl }}"
      },
      "metadata": {
        "labels": "{{ labels }}",
        "namespace": "{{ namespace }}",
        "generateName": "{{ name }}-"
      },
      "apiVersion": "batch/v1"
    },
    "stream_output": "{{ stream_output }}",
    "cluster_config": "{{ cluster_config }}",
    "job_watch_timeout_seconds": "{{ job_watch_timeout_seconds }}",
    "pod_watch_timeout_seconds": "{{ pod_watch_timeout_seconds }}"
  }
}

Kevin Grismore

10/27/2023, 7:12 PM

so I've added a

tolerations

variable that will appear in the UI as Tolerations, and a place in the template to override from my deployment in

"{{ tolerations }}"

Eric

10/27/2023, 7:12 PM

So this is the state I want to get to, but where do I set things in my

prefect

file to propagate these configurations?

Jamie Zieziula

10/27/2023, 7:13 PM

following up on this thread - Kevin is 100% right. variables will be needed if you want to provide these non-default values to the job template

Jamie Zieziula

10/27/2023, 7:14 PM

if these values wont change between deployments in a single workpool - you can hard code them on your advanced tab

Kevin Grismore

10/27/2023, 7:14 PM

You have to edit your work pool Advanced tab to look like this first. Once you save it, you'll get a place in the Defaults tab to enter default tolerations in your work pool.

Eric

10/27/2023, 7:15 PM

I really, really don't want to be making config edits via the UI if I can help it. Is there a way of programatically doing it?

Eric

10/27/2023, 7:15 PM

I need to repeat this for different work pools

Kevin Grismore

10/27/2023, 7:18 PM

Then, you can override them from a deployment too:

Copy code

deployments:
- name: demo
  version: null
  tags: []
  description: null
  schedule: {}
  flow_name: null
  entrypoint: flow.py:hello
  parameters: {}
  work_pool:
    name: k8s-demo
    work_queue_name: null
    job_variables:
      tolerations:
        - key: dedicated
          value: asyncjobs
          effect: NoSchedule
          operator: Equal

Jamie Zieziula

10/27/2023, 7:19 PM

Eric

10/27/2023, 7:20 PM

So do I have to create a Prefect Variable? I haven't touched that much

Eric

10/27/2023, 7:21 PM

I wanted to just see one job actually run, so I added the node selector and toleration in the UI and it worked. So now I just need to figure out how to templatize this!

Kevin Grismore

10/27/2023, 7:22 PM

Nope, just make the work pool advanced page look like what I shared and then save it, and you can start overriding the values via your deployments.

Eric

10/27/2023, 7:23 PM

Ah, I see. So you are setting the work pool template the {toleratations} variable and then setting that in the deployment config file

Kevin Grismore

10/27/2023, 7:24 PM

Exactly!

Eric

10/27/2023, 7:25 PM

Thank you, will give that a shot! Thanks for all the people who helped support; I don't think I would have ever figured that out on my own

💙 1

Eric

10/27/2023, 10:33 PM

I tried adding

Copy code

"tolerations": "{{ tolerations }}",
              "nodeSelector": "{{ node_selector }}",

to my work-pool base json, but I keep getting an error from the UI saying it failed to udpate the work-pool but doesn't say why

Eric

10/27/2023, 10:33 PM

In my browser's network tab I see

Copy code

The variables specified in the job configuration template must be present as properties in the variables schema. Your job configuration uses the following undeclared variable(s): node_selector ,tolerations.

Eric

10/27/2023, 10:35 PM

I see They have to be defined above

Kevin Grismore

10/27/2023, 10:35 PM

Yep, you can see in the example for tolerations I posted earlier, I added it to the variables object at the very beginning of the json.

Eric

10/27/2023, 10:40 PM

I think I broke something, because I can't access the Advanced configuration tab from my work pool now. I wonder if its because I set tolerations as a list? I also get a 500 error when I try to create a deployment using that work-pool

Eric

10/27/2023, 10:51 PM

I tried removing the deployment and the work-pool; was able to recreate a workpool and see its getting pinged, but when I try to create a deployment I get

Copy code

? Would you like to build a custom Docker image for this deployment? [y/n] (n): 
Traceback (most recent call last):
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 255, in coroutine_wrapper
    return call()
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 382, in __call__
    return self.result()
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 282, in result
    return self.future.result(timeout=timeout)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 168, in result
    return self.__get_result()
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 345, in _run_async
    result = await coro
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/cli/deploy.py", line 249, in deploy
    await _run_single_deploy(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/utilities.py", line 51, in with_injected_client
    return await fn(*args, **kwargs)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/cli/deploy.py", line 550, in _run_single_deploy
    deployment_id = await client.create_deployment(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/orchestration.py", line 1479, in create_deployment
    response = await <http://self._client.post|self._client.post>(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/httpx/_client.py", line 1848, in post
    return await self.request(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/base.py", line 285, in send
    response.raise_for_status()
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/base.py", line 138, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url

Eric

10/27/2023, 10:58 PM

Ah! I used type "list" instead of "array" and that was accepted by the UI but then broke something because it wasn't the right type. Deleted and replaced both the work pool and the deployment, and now am able to kick off a flow run

sonic 1

Krystal

01/08/2024, 12:26 PM

Hello, I am currently trying to do the same thing with prefect to add toleration to the worker pod, but running into some issues. My template has

"tolerations": "{{ tolerations }}",

and my prefect.yaml file looks like this

Copy code

deployments:
  - name: "testing"
    schedule: null
    entrypoint: "flows/test.py:test"
    work_pool:
      name: "gpu-work-pool"
      job_variables:
        image: xxx
        tolerations:
          - effect: NoSchedule
            key: gpu
            operator: Exists
        nodeSelector:
          karpenter.sh/provisioner-name: gpu

However, when I do a deploy, I am constantly getting an error of

Copy code

Response: {'detail': 'Error creating deployment: <ValidationError: "[{\'effect\': \'NoSchedule\', \'key\': \'gpu\', \'operator\': \'Exists\'}] is not of type \'object\'">'}

How can I resolve this please? @Kevin Grismore

Kevin Grismore

01/08/2024, 12:56 PM

can you share the relevant parts of your work pool config? my original example was incorrect, so I've gone back and fixed it. under the variables section, I believe the type of tolerations should be

array

Krystal

01/08/2024, 1:32 PM

@Kevin Grismore Ah amazing, once I've changed it to array it worked. I have another question. So I tried to redeploy the worker by passing a different template, but it seems like if the workpool already exits, the config wouldn't get overwritten. I am using helm to deploy the prefect worker. Is there a way for me to do a force reapply of the workpool config even if it exists? Thanks

Kevin Grismore

01/08/2024, 1:41 PM

The only thing you need to do to change a work pool config is edit it in the UI and save it. Then your worker will grab and use it for subsequent runs of deployments. If that's not what you're trying to achieve then I'm not sure I understand the question.

Krystal

01/08/2024, 1:42 PM

Yeah changing it in the UI works, but is there anyway I can avoid doing that? We are trying to follow the CI/CD flow where all the configurations of the workpools should be done via helm

15 Views

Open in Slack

Previous Next