Samuel Hinton
02/08/2021, 4:37 PMJosh Greenhalgh
02/08/2021, 4:39 PMfiles={pathlib.Path(__file__).parent.absolute() / "tasks.py": "/src/tasks.py"}
- theres a files arg which lets you copy files into the container and I believe adds them to PYTHONPATH
Samuel Hinton
02/08/2021, 4:43 PMJim Crist-Harif
02/08/2021, 5:05 PMGitHub
, S3
) to host your flow source itself allowing for quick updating of the flow, you'd only need to update the docker image if the dependencies or common functions changed.flow.run_config = DockerRun(image="my-shared-image")
flow.storage = GitHub("my/repo", "flows/myflow.py")
Docker
storage, but use a base image with all your dependencies/common files present already. Then all your flows will share the layers in the base image and only add a layer holding the flow source itself (which should be quick, and won't take up that much space).Samuel Hinton
02/08/2021, 5:09 PMJim Crist-Harif
02/08/2021, 5:10 PMimage
kwarg somewhere).Samuel Hinton
02/08/2021, 5:13 PMJim Crist-Harif
02/08/2021, 5:13 PMSamuel Hinton
02/08/2021, 5:16 PMJim Crist-Harif
02/08/2021, 5:18 PMcloudpickle
to distribute tasks around, which can distribute code defined in the main flow runner alone to the workers, without requiring the source be available.
What this means for you as a user:
• Usually you want all containers started for a flow run to run the same image. This simplifies debugging, and removes one-more-thing to think about. Not strictly required, but it's best practice.
• If your task makes use of an external library (anything that you import into your script defining your flow), that library needs to be available in the image.
• Any code written in the script defining your flow can be pulled and distributed by prefect's storage mechanism, and thus doesn't have to be part of the image.from myutils.stuff import a_helper # myutils needs to be in the image
# This script itself doesn't need to be in the image
@task
def mytask(a):
return a_helper(a)
with Flow("example") as flow:
...
On second thought, given the import is in the main flow itself, Im
guessing the file would have to be present on both the dark worker image
(so the tasks can use it) and the agent that runs the flow itself (as its imported when running the flow). Would that be correct?Agents never import or run user code, the agent only needs access to
prefect
itself.
When an agent receives a flow to run, it will start a new process (referred to as the flow-runner process) that will run the flow. This will run your image provided and load your flow from storage. If you're using a dask executor, this may also start (or connect) to other processes to distribute your flow run. In that case, those other processes should also use the same image.Samuel Hinton
02/08/2021, 5:24 PMJim Crist-Harif
02/08/2021, 5:32 PMDockerRun
run config describing details of the container to use for that flow run, and a Storage
object describing where to load the flow source from
• The docker agent kicks off a new docker container using the config in the DockerRun
object (this may specify an image to use, etc...).
• This container starts the flow runner process. It pulls the flow source from the provided Storage
object and loads the flow.
• The flow runner sees it has a DaskExecutor
configured. By default this starts a local dask cluster (in the same container), parallelizing your flow across several processes. No extra config is needed here. If you passed in a cluster_class
to create a temporary external dask cluster, you'll also need to pass in the image to use for ensuring the environments match. Likewise if you passed in an address
to connect to an existing external cluster, you'll want to make sure the existing cluster has a compatible environment. Lots of flexibility here, which makes it hard to describe. See https://docs.prefect.io/orchestration/flow_config/executors.html#daskexecutor for more info.
• The flow runner starts processing the flow. Tasks will be sent out to the dask workers (which run either in the same container, or in external containers depending on your config). Imports used by these tasks need to be available wherever the dask workers are running, but anything defined in the flow script file itself doesn't need to be.Samuel Hinton
02/08/2021, 5:37 PMSean Talia
02/08/2021, 5:39 PMDockerRun
config and S3
for storage; we're recommending that everyone who wants to use prefect build their runconfig image using the common boilerplate one i've created as their base image, so that all of the behind-the-scenes stuff that can (and should) be abstracted away from them will immediately be availableJim Crist-Harif
02/08/2021, 5:46 PMwhat is the difference between DaskExecutor (with no external address) and LocalDaskExecutor?Dask supports a few different scheduling backends (the original local scheduler we wrote, and the later distributed backend which also can run locally).
LocalDaskExecutor
uses the original local scheduler, DaskExecutor
uses the distributed backend.
For prefect users, I'd hope that the description here (https://docs.prefect.io/orchestration/flow_config/executors.html#choosing-an-executor) is sufficient for choosing an executor, are there more details we should add here?
And a Docker RunConfig wont have any issues with the whole DockerInDocker thing?The docker agent needs to run somewhere it can kick off docker containers (so if you're running it in a docker container itself you'd need a docker-in-docker setup). The prefect flow runs themselves don't know they're in docker, so no docker-in-docker issues there.
Sean Talia
02/08/2021, 5:48 PMSamuel Hinton
02/08/2021, 5:57 PMJim Crist-Harif
02/08/2021, 6:15 PMLocalDaskExecutor
for running all your tasks in a threadpool if you do have opportunities for parallelism within a flow. This works great for network heavy code, while being lighterweight than a full DaskExecutor
. If you don't have opportunities for parallelism within a single flow though then a LocalExecutor
makes sense.Samuel Hinton
02/08/2021, 6:19 PMJim Crist-Harif
02/08/2021, 6:21 PMSamuel Hinton
02/08/2021, 6:43 PMKarolína Bzdušek
02/09/2021, 7:55 AMSamuel Hinton
02/09/2021, 9:03 AMJim Crist-Harif
02/09/2021, 6:01 PMprefect agent local start --no-hostname-label
), or add a label to your local run and the agent
flow.run_config = LocalRun(labels=["my-local-label"])
and to start the agent
prefect agent local start --label my-local-label
See https://docs.prefect.io/orchestration/flow_config/run_configs.html#labels for more info.Karolína Bzdušek
02/10/2021, 12:09 PM