We have batch jobs written in other languages and ...
# prefect-server
c
We have batch jobs written in other languages and packed into container and write our task mainly using
RunKubernetesJob
. But everytime I run the flow using
KunernetesAgent
, the agent always spin up new job, which in my case is useless and waste of resource. Since the flow is mainly about spin up new k8s job, I want
KubernetesAgent
run it directly. Please advise me the way to do that.
k
The agents never run things directly except for Local Agent. You could try using a Local Agent inside a Kubernetes pod and letting it run inside but I don’t think that set-up is right.
RunNamespacedJob
is really intended to spin up a new pod. If you have jobs in other languages, you can try calling them directly inside your Flow with the
ShellTask
Copy code
shell = ShellTask()
with Flow(...) as flow:
    shell(...)
so you can call them through the command line. I think running them all on the agent is less efficient right? Because that means your agent pod needs all of the resources to run the batch jobs. If you leave the agent lightweight, then you can just create a pod to run the Flow. And then instead of
RunNamespacedJob
, just use the
ShellTask
to invoke those all those programs inside the container.
c
ShellTask
doesn't help much. 100% tasks in my flow is
RunKubernetesJob
, that's why I believe the agent will have no workload at all, and spin up kuber job to just run that kind of flow is wasting resources
k
When you say
RunKubernetesJob
, are you referring to
KubernetesRun
or
RunNamespacedJob
? Could you tell me why the
ShellTask
doesnt help?
c
My code is as follow:
Copy code
flow = Flow("test", tasks=[RunKubernetesJob(), RunKubernetesJob(), RunKubernetesJob()], run_config=KubernetesRun())
As you can see, my flow is all about spin up another kuber job, there is no computation there. I cannot pack every batch job in a single container because it is inefficient to do that and may cause conflicts
k
Ah ok I think understand what you are saying. You are saying that each job needs a job, but why even have the Flow pod if the Agent can just kick off jobs directly? It’s just the Flow pod that you are saying is not efficient right?
c
yup, absolutely
k
I understand what you are saying, but the Agent and Flow just have different concerns. Agent is programmed to kick off Flow Runs while the Flow is made to submit tasks. So in order for the agent to kick off these processes, you them to be Flows with one task (maybe ShellTask) to start off the job
🙌 1
c
I believe I can based on LocalAgent and KubernetesAgent to create a new Agent class that do the job like I said, can @Anna Geller advise me more about this?
a
@Chu Lục Ninh Kevin provided an excellent explanation but I can try to clarify more. Prefect has this separation of concerns that each agent has a method called
deploy_flow
. This method decides how the compute infrastructure for the flow run should be deployed. Then, when it comes to where and how your task runs get executed, this is what executor decides. If you use the default
LocalExecutor
, then all your task runs are running within the same execution environment as the flow run, here the flow run pod deployed as a Kubernetes job. If you would use e.g. a
DaskExecutor
, then your task runs would be shipped to Dask workers for execution. When running dask on Kubernetes, you could e.g. use
KubeCluster
class to spin up a Dask cluster on Kubernetes. In your use case, each of your flow runs gets deployed as Kubernetes job and since you designed your tasks to run as separate Kubernetes jobs via a Kubernetes task, each of your task also gets deployed into a separate pod. But there is no way around still having a flow run pod - this is an entirely separate concern as what your tasks are doing. Your task could run on Databricks if you wish, or could execute some in-warehouse SQL transformation, but Prefect flow run process needs its own process (a subprocess, a Docker container, a Kubernetes job, an ECS task)
🙌 1
c
I got it. Should I try to be creative and create new Agent that can talk directly to Kuber like
KubernetesAgent
but can
popen
to execute flow in sub-process like LocalAgent?
And I don't mean to remove the flow, I need that flow to orchestrate tasks, since my tasks still depend on each other. I just want to customize
KubernetesAgent
so it run flow directly in subprocess instead of spawn new pod for the flow
a
well, if you do that, you are kind of creating your own Prefect right? 🙂 this is kind of negative engineering we try to eliminate.
I just want to customize
KubernetesAgent
so it run flow directly in subprocess instead of spawn new pod for the flow
If that’s the case, you should use
ShellTask
rather than
RunKubernetesJob
.
ShellTask
creates a subprocess and runs some custom Linux shell command within that subprocess.
c
not really like that, I just think I can customize a little bit to support my new use case. And I think that doesn't break Prefect model at all, since we can have many types of
Agent
right?
a
We can, but not sure what you will accomplish this way. Let’s take a step back. I know you somehow don’t like the fact that flow run needs its own pod. Can you explain what is the problem with that? Do you have some budget or resource constraints? the flow run itself shouldn’t consume that much resources and your solution with a separate Kubernetes job per task to manage custom non-Python dependencies sounds like the right approach. Not sure what’s not working as you would want it to. So far it looks like you implemented it the right way, I would do the same in such use case where each task requires different libraries/dependencies
the only alternative would be Docker agent instead of Kubernetes agent and doing it this way: https://discourse.prefect.io/t/can-prefect-run-each-task-in-a-different-docker-container/434
1
c
thanks, I will look into that
Can you explain what is the problem with that? Do you have some budget or resource constraints?
We have devops team to manage and monitor the system and just don't want to pollute the logging system with so many pods spawning to just spawn another pods. Another thing is we have many other kuber batch jobs in the same namespace, and are controlling those jobs separately from Prefect, so we want to minimize the effort of monitor unnessesary jobs
a
hmm maybe label selector is what can help you organize those jobs?
Copy code
kubectl get pods -l environment=prefect
c
sure, but that only helps a little since some prefect jobs have to run in specific namespace, and I have to deal with kubernetes team who manage and monitor my jobs with other teams' non-prefect jobs too
a
not sure I understand, you can use both namespaces AND labels, they are not mutually exclusive
c
After discussing today with the team, we decided to accept the fact that Prefect will launch a kuber job to orchestrate other kuber jobs.
👍 1