I seem to not fully understand workers yet. This i...
# prefect-getting-started
d
I seem to not fully understand workers yet. This is what I have understood so far: - Workers are where "flow runs" execute. They can run locally or remotely (in the cloud). - Workers need access to code and data, either through local storage or a git repository. I have some questions: - What is the typical workflow for a data engineer using Prefect as a workflow orchestration tool and a cloud provider (e.g. Google Cloud)? For example, first you code your flows. Then you write the desired deployment configuration. Then you commit your project to Git and let CI/CD build images and deploy your flows. Then what happens? - Why do I need to run "prefect worker start --pool POOL_NAME" to start a Docker worker? Ultimately, I need Prefect installed on the machine running that command to create a worker. I would prefer for a Docker container to subscribe itself to a work queue instead of having to pull the image. - How do I handle automatic horizontal scaling for workers? - What is the reasoning behind some of Prefect's design decisions (e.g. flows, tasks, workers, etc.)?
👀 1
k
Hi dan, most of your questions are pretty related so I'll try to answer in one go. A worker can be where your flow run executes if you're using a process worker, but a process worker's capacity is limited by the hardware it's running on. A different way to think of workers is that they're infrastructure managers. They listen for scheduled flow runs, and then spin up the type of infrastructure they're associated with for your flow to run on. Other types of workers, like kubernetes or Cloud Run, will run flows in separate containers as temporary jobs, and can handle additional customizations related to that infrastructure, like CPU and memory requests. Since you mentioned GCP, let's use Cloud Run as an example. Let's say you have a cloud run worker connected to one of your work pools,
cloud-run-pool
, and you also have a deployment with
cloud-run-pool
as the assigned work pool. Let's also assume you have an image built with your code in it, and your deployment has that image name and the flow function's entrypoint specified. When that deployment is run, the worker will pick up the scheduled run and submit it to Cloud Run by hitting a Cloud Run endpoint affiliated with your GCP project with all the info in your work pool and your deployment needed to start a Cloud Run job with your image and any other customizations. One of the main benefits of using a worker in a scenario like this is that you can run it in your secure environment (like one of the compute services in your GCP project) and you don't need to store any credentials in your Prefect Cloud workspace, since the worker is reaching out to the Prefect API and starting work in an environment where it's already authenticated to do so.
🙌 1
Here's a diagram demonstrating a similar setup in kubernetes.