Hi, I have some problems understanding the concept...
# prefect-getting-started
h
Hi, I have some problems understanding the concept of work pools, queues and agents. After reading the documentation several times, I now interpret this as follows: A pool defines the infrastructure and enables the execution of flows on a distributed infrastructure. Agents can be assigned to specific queues to respond to flow requests. However, the execution of flows also works just by starting the prefect server and this contains an unhealthy default work queue in the default pool. Would this already be a minimal setup? I have two virtual machines and would like to be able to run flows on both, but do I need a new pool for this, I guess I would have to add the second server to a work queue via an agent? How could I create a new pool, since I dont get any options in the UI list (e.g. docker) - do I need to create an infrastructure block? Sorry for stupid questions.
1
j
Hi Crunch. Not stupid questions. With a new setup, forget about agents and infrastructure blocks. They are older concepts that we plan to deprecate soon (with at least a six month phase-out period). With Prefect Cloud or a Prefect server instance running in a location accessible by both VMs, you can make a single work pool in the UI. Then start a worker on each VM that polls that work pool. Leave those workers running. The worker polls the work pool it is tied too. Workers and work pools are typed by infrastructure (e.g. Docker). So if your infrastructure for your flow run is Docker in one VM and a subprocess in another VM, then you would create two work pools. When a deployment is scheduled to run, the worker will see that and kick off the flow run in your infrastructure. If you’re using a Docker work pool, the flow will run in a Docker container on the VM. Work queues can be used for prioritizing or limiting concurrent work. You may not need to think about them at all. However, if you want one VM to get more of the work, you can create two work queues in the UI and specify a priority or concurrency. Then, when you start your worker, specify that work queue in addition to the work pool. The default queue is created automatically so that many users don’t need to think about the work queue if they don’t want to. It shows as unhealthy if it hasn’t been polled by a worker recently. Does that all make sense?
Diagram that might be helpful:
h
@Jeff Hale thanks for the detailed reply. Things are becoming much clearer now and I will test this setup. One last question. We mainly have flows that can be deployed or served while running in the background. We have one flow (preparation flow) that needs to be associated with an API call and also needs to respond with a data result once it is finished. We have found that calling a deployment within the Flask API takes too long to respond once it is ready. Currently we only call the native flow function in Flask, but have issues with crashes when they are called at the same time. What would be the best practice to archive an api bound flow with result return. I am aware that I am mixing a bit of ETL with API, but we need the dashboard information for the prepare flow for the operating units. Is it e.g. possible to assign a flow to a worker without deployment. Thank you very much!
j
Short answer is that you need a deployment to associate a flow with worker. It sounds like you could probably use an automation that runs a deployment when an event is triggered (maybe through a webhook). It might be best to book a rubber duck session with Prefect engineer to talk through the specifics of your use case.
👀 1
FYI, I just made a video of an automation that runs a deployment when an event is triggered through a webhook that should be on YouTube in a few days.
h
Great , thanks for the infos. I will collect some questions and come back to the Prefect Experts.
🙌 1
j

The video is up on YouTube.

👀 1
i
Hi @Jeff Hale , I have a question regarding worker pool type
subprocess
. As the Agent will be deprecated soon, I need to migrate it to workers. However, I am facing challenges because the new deployment approach requires a Docker image, and my current setup (VM and agent running locally) doesn't allow for Docker use due to specific reasons. I tried to use
local subprocess
type but
flow.deploy
requires docker image for deploying flows. A bit confused, why worker
subprocess
can be created but can't be used with the new deployment approach. Can you clarify what I am missing here, pls? Additionally, I would greatly appreciate any guidance or recommendations on how to navigate this challenge. Thanks.
j
Hi Iryna! Using
flow.deploy
you can specify flow code storage on a git-based cloud option such as GitHub, Gitlab, or BitBucket or a cloud-based storage provider such as AWS S3, GCS, or GCP. In my quick check, it doesn’t look like a process block can be set in
flow.from_source
but I’m doing some more digging.
Using
flow.serve
might be the best option for your use case. You can see the docs here.
i
Hi @Jeff Hale, thanks for the answer. My flows are scheduled and deployed in Github Actions but execution is in VM. Serve does not probably suit the need. Is there other way to execute/deploy flows with new deployment approach? Thanks
j
Ah, if you’re using GitHub actions, you might want to store your flow code on GitHub. Would that work?
i
Yes, I currently store flow code in git and use github storage as a source. How can this help with my requirement? Can you explain, pls? thanks a lot
j
Cool. You can pass
flow.from_source
the GitHub repository URL and any credentials if it’s a private repo. See examples here.
i
I tried this approach and get error `ValueError: Work pool 'default-work-pool' does not support custom Docker images. Please use a work pool with an
image
variable in its base job template.`
default-work-pool
is process type. my code
Copy code
cron = None if cron is None or cron == "" else (CronSchedule(cron=cron, timezone=timezone))
    storage = GitHub.load("my-repo")

    flow.from_source(
        source=storage,
        entrypoint=flow_entrypoint
    ).deploy(
        name=depl_name,
        work_pool_name=work_pool_name,
        parameters=params,
        tags=tags,
        schedule=cron,
        build=False
    )
what am I doing wrong? thanks
j
I would put the GH url in like the example in the doc, instead of loading a GitHub storage block.
i
I changed the code but got same error `ValueError: Work pool 'default-work-pool' does not support custom Docker images. Please use a work pool with an
image
variable in its base job template.` code:
Copy code
flow.from_source(
        source=GitRepository(
                    url="<https://github.com/xx/my-repo.git>",
                    branch="dev",
                    credentials={
                        "access_token": Secret.load("github-personal-access-token").get()
                    }
                ),
        entrypoint=flow_entrypoint
    ).deploy(
        name=depl_name,
        work_pool_name=work_pool_name,
        parameters=params,
        tags=tags,
        schedule=cron,
        build=False
    )
j
I can reproduce. Apologies, I was not aware that it wouldn’t work with subprocess work pool. I created a feature request issue if you want to add or follow.
👀 1
i
Thanks Jeff!