There has to be someone who was able to set tolera...
# prefect-kubernetes
e
There has to be someone who was able to set tolerations / nodeSelectors in their deployment; can someone explain where they put that? I currently put it in my
prefect.yaml
file to create a deployment, in my Deployment's configuration tab I see:
Copy code
{
  ...
  "job_manifest": {
    "spec": {
      "template": {
        "spec": {
          "tolerations": [
            {
              "key": "dedicated",
              "value": "asyncjobs",
              "effect": "NoSchedule",
              "operator": "Equal"
            }
          ],
          "nodeSelector": {
            "kube/nodetype": "asyncjobs"
          }
        }
      }
    }
  }
}
However the flow run pods created with this deployment don't have any of these values propagated. Not sure if I'm just setting this incorrectly
k
Hey Eric, what does your work pool's advanced tab look like on the Edit page?
e
Yeap, not seeing the tolerations or nodeSelector in that tab!
k
In order to override the default job template, variables need to be added to the template. Let me grab an example for you.
e
Would love an example, thank you
k
Copy code
{
  "variables": {
    "type": "object",
    "properties": {
      "tolerations": {
        "type": "array",
        "title": "Tolerations"
      },
      "env": {
        "type": "object",
        "title": "Environment Variables",
        "description": "Environment variables to set when starting a flow run.",
        "additionalProperties": {
          "type": "string"
        }
      },
      "name": {
        "type": "string",
        "title": "Name",
        "description": "Name given to infrastructure created by a worker."
      },


...

  "job_configuration": {
    "env": "{{ env }}",
    "name": "{{ name }}",
    "labels": "{{ labels }}",
    "command": "{{ command }}",
    "namespace": "{{ namespace }}",
    "job_manifest": {
      "kind": "Job",
      "spec": {
        "template": {
          "spec": {
            "tolerations": "{{ tolerations }}",
            "containers": [
              {
                "env": "{{ env }}",
                "args": "{{ command }}",
                "name": "prefect-job",
                "image": "{{ image }}",
                "imagePullPolicy": "{{ image_pull_policy }}"
              }
            ],
            "completions": 1,
            "parallelism": 1,
            "restartPolicy": "Never",
            "serviceAccountName": "{{ service_account_name }}"
          }
        },
        "backoffLimit": 0,
        "ttlSecondsAfterFinished": "{{ finished_job_ttl }}"
      },
      "metadata": {
        "labels": "{{ labels }}",
        "namespace": "{{ namespace }}",
        "generateName": "{{ name }}-"
      },
      "apiVersion": "batch/v1"
    },
    "stream_output": "{{ stream_output }}",
    "cluster_config": "{{ cluster_config }}",
    "job_watch_timeout_seconds": "{{ job_watch_timeout_seconds }}",
    "pod_watch_timeout_seconds": "{{ pod_watch_timeout_seconds }}"
  }
}
so I've added a
tolerations
variable that will appear in the UI as Tolerations, and a place in the template to override from my deployment in
"{{ tolerations }}"
e
So this is the state I want to get to, but where do I set things in my
prefect
file to propagate these configurations?
j
following up on this thread - Kevin is 100% right. variables will be needed if you want to provide these non-default values to the job template
if these values wont change between deployments in a single workpool - you can hard code them on your advanced tab
k
You have to edit your work pool Advanced tab to look like this first. Once you save it, you'll get a place in the Defaults tab to enter default tolerations in your work pool.
e
I really, really don't want to be making config edits via the UI if I can help it. Is there a way of programatically doing it?
I need to repeat this for different work pools
k
Then, you can override them from a deployment too:
Copy code
deployments:
- name: demo
  version: null
  tags: []
  description: null
  schedule: {}
  flow_name: null
  entrypoint: flow.py:hello
  parameters: {}
  work_pool:
    name: k8s-demo
    work_queue_name: null
    job_variables:
      tolerations:
        - key: dedicated
          value: asyncjobs
          effect: NoSchedule
          operator: Equal
j
^
e
So do I have to create a Prefect Variable? I haven't touched that much
I wanted to just see one job actually run, so I added the node selector and toleration in the UI and it worked. So now I just need to figure out how to templatize this!
k
Nope, just make the work pool advanced page look like what I shared and then save it, and you can start overriding the values via your deployments.
e
Ah, I see. So you are setting the work pool template the {toleratations} variable and then setting that in the deployment config file
k
Exactly!
e
Thank you, will give that a shot! Thanks for all the people who helped support; I don't think I would have ever figured that out on my own
💙 1
I tried adding
Copy code
"tolerations": "{{ tolerations }}",
              "nodeSelector": "{{ node_selector }}",
to my work-pool base json, but I keep getting an error from the UI saying it failed to udpate the work-pool but doesn't say why
In my browser's network tab I see
Copy code
The variables specified in the job configuration template must be present as properties in the variables schema. Your job configuration uses the following undeclared variable(s): node_selector ,tolerations.
I see They have to be defined above
k
Yep, you can see in the example for tolerations I posted earlier, I added it to the variables object at the very beginning of the json.
e
I think I broke something, because I can't access the Advanced configuration tab from my work pool now. I wonder if its because I set tolerations as a list? I also get a 500 error when I try to create a deployment using that work-pool
I tried removing the deployment and the work-pool; was able to recreate a workpool and see its getting pinged, but when I try to create a deployment I get
Copy code
? Would you like to build a custom Docker image for this deployment? [y/n] (n): 
Traceback (most recent call last):
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/utilities/asyncutils.py", line 255, in coroutine_wrapper
    return call()
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 382, in __call__
    return self.result()
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 282, in result
    return self.future.result(timeout=timeout)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 168, in result
    return self.__get_result()
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/_internal/concurrency/calls.py", line 345, in _run_async
    result = await coro
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/cli/deploy.py", line 249, in deploy
    await _run_single_deploy(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/utilities.py", line 51, in with_injected_client
    return await fn(*args, **kwargs)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/cli/deploy.py", line 550, in _run_single_deploy
    deployment_id = await client.create_deployment(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/orchestration.py", line 1479, in create_deployment
    response = await <http://self._client.post|self._client.post>(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/httpx/_client.py", line 1848, in post
    return await self.request(
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/base.py", line 285, in send
    response.raise_for_status()
  File "/Users/erickim/venv/inari_py/lib/python3.9/site-packages/prefect/client/base.py", line 138, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url
Ah! I used type "list" instead of "array" and that was accepted by the UI but then broke something because it wasn't the right type. Deleted and replaced both the work pool and the deployment, and now am able to kick off a flow run
sonic 1
k
Hello, I am currently trying to do the same thing with prefect to add toleration to the worker pod, but running into some issues. My template has
"tolerations": "{{ tolerations }}",
and my prefect.yaml file looks like this
Copy code
deployments:
  - name: "testing"
    schedule: null
    entrypoint: "flows/test.py:test"
    work_pool:
      name: "gpu-work-pool"
      job_variables:
        image: xxx
        tolerations:
          - effect: NoSchedule
            key: gpu
            operator: Exists
        nodeSelector:
          karpenter.sh/provisioner-name: gpu
However, when I do a deploy, I am constantly getting an error of
Copy code
Response: {'detail': 'Error creating deployment: <ValidationError: "[{\'effect\': \'NoSchedule\', \'key\': \'gpu\', \'operator\': \'Exists\'}] is not of type \'object\'">'}
How can I resolve this please? @Kevin Grismore
k
can you share the relevant parts of your work pool config? my original example was incorrect, so I've gone back and fixed it. under the variables section, I believe the type of tolerations should be
array
k
@Kevin Grismore Ah amazing, once I've changed it to array it worked. I have another question. So I tried to redeploy the worker by passing a different template, but it seems like if the workpool already exits, the config wouldn't get overwritten. I am using helm to deploy the prefect worker. Is there a way for me to do a force reapply of the workpool config even if it exists? Thanks
k
The only thing you need to do to change a work pool config is edit it in the UI and save it. Then your worker will grab and use it for subsequent runs of deployments. If that's not what you're trying to achieve then I'm not sure I understand the question.
k
Yeah changing it in the UI works, but is there anyway I can avoid doing that? We are trying to follow the CI/CD flow where all the configurations of the workpools should be done via helm