Hello, I have successfully deployed a Kubernetes P...
# prefect-community
g
Hello, I have successfully deployed a Kubernetes Prefect Agent on Azure Kubernetes Cluster. I am trying to run a simple flow that utilizes a LocalDaskExecutor on the AKS Virtual Nodes. For this, I am using a custom job template for the pod, because it needs some customized node selectors and tolerations that Azure publishes. the following is snippet of my job_template:
Copy code
job_template={
            "apiVersion": "batch/v1",   
            "kind": "Job",
            "spec": {
                "template": {
                    "metadata": {
                        "labels": {
                            "execution-model": "serverless"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "name": "flow"
                            }
                        ],
                        "nodeSelector": {
                            "execution-model": "serverless"
                        },
                        "tolerations": [
                            {
                                "key": "<http://virtual-kubelet.io/provider|virtual-kubelet.io/provider>",
                                "operator": "Exists"
                            }
                        ]
                    }
                }
            }
However the flow fails. When i ran kubectl get events. I notice the following output:
Copy code
Warning   ProviderCreateFailed   pod/prefect-job-XXXXX-XXXXX   ACI does not support providing args without specifying the command. Please supply both command and args to the pod spec.
Just some more information - I also ran the same flow successfully on a alternate deployment on AWS EKS Fargate, using an AWS Kubernetes Agent. Any guidance is really appreciated :)
discourse 1
k
Not sure if this is directly a prefect issue, seems like maybe some kind of k8s “protection” logic. Doesn’t like when you pass ARGs but dont specify a CMD https://stackoverflow.com/questions/62997153/dask-kubernetes-aks-azure-virtual-nodes
upvote 1
it may be cloud specific (i.e. not enabled in AWS)
g
Interesting, thanks for your prompt response!
k
Yeah Kyle looks right here that this is not a Prefect thing necessarily
m
This not related to Prefect, at least not directly. This is a specific issue that pops up when using Azure AKS virtual nodes (see link posted by Kyle above). The reason it works on AWS fargate is that both of these services are different under the hood.
upvote 1
a
There are some great answers here already, but if you still haven't solved it yet Gaurav, could you share your Dockerfile of the image you use, as well as your flow object configuration i.e. storage, run config, and the executor? The SO answer doesn't seem right to me as it discusses an issue with dask
KubeCluster
, while you mentioned you are using a
LocalDaskExecutor
which should just use local threads and processes rather than a KubeCluster. For troubleshooting, I'd recommend taking it more step-by-step and trying a simple hello world flow on AKS with no custom job template first (just using the Prefect base image) to ensure your Prefect AKS setup is working fine before moving to custom job templates and Dask.
m
The SO answer discusses the underlying issue but indeed, the issue popped up in a different context (that just happens to be dask
KubeCluster
). The real problem is that if you want to use AKS with virtual nodes, not only do you have to add custom annotations/node selectors, but you also have to supply both a command and args to the pod spec (which is the exact error message you got). Apparently, ACI uses the combination under the hood and would lead to unexpected behaviour if you supply one without the other see https://github.com/virtual-kubelet/azure-aci/pull/11
upvote 1
So you have to add these to the custom job template
g
Hello Anna and team! I have been following your amazing tutorial series on medium, and to highlight my issue in a simpler way I made a very basic version of the repo myself to follow along. It is hosted at: https://github.com/k1ngAlakazam/prefect-k8s I ran this flow on (1) AWS EKS provisioned nodes, (2) AWS EKS Fargate and also (3) Azure AKS provisioned nodes. It works everywhere apart from the (4) Azure AKS virtual nodes. I also ran the test with LocalExecutor only, it worked on Azure AKS Provisioned Nodes but failed on virtual nodes. Matthias - I did come across that GH article above, but apologies I really did not figure out what values are needed for the command parameter. Is it some sort of no-op command like "/bin/bash -c" ? The template I tried to use is here: https://github.com/k1ngAlakazam/prefect-k8s/blob/main/flow_utilities/prefect_configs.py Thank you
🙌 1
m
Looking at the job spec in the repo, I think you have to add these (not entirely sure about the args though): https://github.com/PrefectHQ/prefect/blob/master/src/prefect/agent/kubernetes/job_spec.yaml#L15L16
🙏 1
Been doing a little digging in the code and it turns out that when you supply a custom job spec, the agent only sets args but no command (see https://github.com/PrefectHQ/prefect/blob/master/src/prefect/agent/kubernetes/agent.py#L550). With the custom job spec you supplied, ACI fails because there was no command set
upvote 1
a
first of all, thank you so much for sharing your repository. It's just so immensely valuable for troubleshooting. I think you've implemented everything correctly, and usually, you shouldn't have to set the command and args on the flow container spec - the way you did it is correct, and it should work IMO. And the fact that it works perfectly fine on AWS EKS and on AKS without virtual nodes only confirms that. it could be that there are some quirks on AKS with virtual noes - worth trying uncommenting those lines (even though you probably have already tried that 😅)
what also comes to my mind is that perhaps you need to add the
image
and
imagePullPolicy
to your job template spec instead of on the normal KubernetesRun args? basically moving those two values after line 25 because I saw it being set here this way. If you do this, the spec should have all the info it needs:
Copy code
def set_run_config() -> RunConfig:
    return KubernetesRun(
        labels=["azure"],
        cpu_request="2",
        memory_request="4G",
        job_template={
            "apiVersion": "batch/v1",
            "kind": "Job",
            "spec": {
                "template": {
                    "metadata": {
                        "labels": {
                            "execution-model": "serverless"
                        }
                    },
                    "spec": {
                        "containers": [
                            {
                                "name": "flow",
                                "image": "purplebeast786/dummy:latest",
                                "imagePullPolicy": "IfNotPresent",
                                "command": ["/bin/sh", "-c"],
                                "args": ["prefect execute flow-run"]
                            }
                        ],
                        "nodeSelector": {
                            "execution-model": "serverless"
                        },
                        "tolerations": [
                            {
                                "key": "<http://virtual-kubelet.io/provider|virtual-kubelet.io/provider>",
                                "operator": "Exists"
                            }
                        ]
                    }
                }
            }
        }
    )
m
@Anna Geller there are indeed some quirks when using AKS virtual nodes, namely that when you set
spec.container.args
, which is done in by the agent when deploying the flow (see my comment above), then you have to specify
spec.container.command
too (which doesn't happen by default). So if you uncomment that one alone, it should be fine…
upvote 1
🙏 1
g
Thanks for the responses! I added the command, image and imagepullpolicy in my job spec and ran the container (https://github.com/k1ngAlakazam/prefect-k8s/blob/main/flow_utilities/prefect_configs.py). However, whats interesting is that it tries to create the pod and it terminates itself after a few minutes and I dont see any logs on the prefect job UI as well. Its still stuck in running state on the UI.
Here are the results of
kubectl get events
and the UI:
It also seems to reschedule the job again via lazarus in a while:
m
And are you able to inspect the logs of the pod?
g
I am trying to do just that 😕 It stays in the waiting state for a long time, and then directly goes to terminated.
I ran the command:
kubectl logs -c flow --follow <pod name>
while it was in the running state. And i see the following output: (looks like just the default CLI help that shows up when running a wrong command?) - got terminated shortly after.
m
Indeed, there must be something wrong with the command and args then…
Could you perhaps try to uncomment the args too? Just to see what happens…
g
Just tried that, with the same result as above 😞
m
If you managed to make it work on EKS or AKS without virtual nodes, can you retrieve the exact job manifest from these jobs? What are the commands and args?
upvote 1
Another thing you could try is to replace the command with what’s supplied in the entrypoint https://github.com/PrefectHQ/prefect/blob/master/Dockerfile#L35
g
(1) When i ran the above linked entrypoint as the command without args for virtual nodes. it created the container but when i tried to view the logs nothing came up as an output. The job terminated after a few minutes and prefect logs reported:
Copy code
Pod prefect-job-13744367-thlwn failed.
	Container 'flow' state: waiting
(2) Im not sure what you meant by get the manifest of the job when running somewhere else. I ran the job on AKS provisioned node pool, by commenting out the command, args and the taints. It worked successfully as expected, but the logs of the pod were just the output that shows in prefect:
a
I think Matthias meant: since running the same on AKS without virtual nodes has worked fine, you could run it there again, check the exact manifest of the Kubernetes job that led to a successful Kubernetes flow run job pod, and use it as a basis for AKS with virtual nodes honestly, I don't have any other recommendations since you seem to be doing everything correctly. Do you happen to have a support plan on Azure? maybe this way, you could ask them if they see anything weird here it if they have any recommendations. Is there an option for you to not use virtual nodes? given that everything else is working fine perhaps giving up on virtual nodes is the best option? 😄 just mentioning this as a valid possibility
g
Ah i see what he meant now 🙂 I have raised a support ticket with Azure already, but they mostly seem to indicate that if you can run a pod on the virtual nodes successfully then its most likely a problem with the software/third party tool i am using (which is why im bouncing back and forth :\). I would definitely use provisioned nodes but our use case prefers serverless to save costs on client environments.
👍 1
They also mentioned, if we are trying to get access to docker api on the host itself by any chance which might have some limitations/problems around it.
230 Views