Hey again - yesterday i presented the results of m...
# ask-community
t
Hey again - yesterday i presented the results of my Prefect PoC to my team and my team lead said they think we should wrap all DS code with docker containers and use those as "blackbox" steps instead of directly invoking python code from the flow -- am i right in my understand that if we do that we lose some of the advantages of Prefect like being able to easily map output of docker runs to input of the next tasks - or caching/persistence of results etc. and we'll need to do all these things manually ourselves?
a
It’s up to your team to decide. You’re 100% correct that using containers as black box doesn’t take advantage of Prefect’s granular visibility into your workflow and being able to react to specific states like get notified when a specific task fails, attaching results, caching, having retries, restarts, etc. You would not lose just some advantages, but MOST advantages of Prefect. If our docs and blog posts don’t explain it well enough, perhaps you can share some of those videos with your team so that they can understand the problem a bit better: •

https://www.youtube.com/watch?v=TlawR_gi8-Y&t=1s

https://www.youtube.com/watch?v=wejJzGQ4XDo

t
@Anna Geller yes, we’re well aware of the fact that a blackbox wouldn’t offer visibility in prefect into what’s happening inside it --- but that’s not exactly my question my question has more to do with the fact that a Docker container in itself does not have any
input
or
output
interfaces, and the Prefect docker tasks don’t add in that functionality either when it wraps them our flows can have many other tasks besides the ML task itself, my team lead suggested that the ML tasks are wrapped in containers -- not that everything around them (e.g. pulling from snowflake, sending results to some other service) also happens in containers necessarily what i’m trying to verify is that if a task lives in a docker container, then - as it’s currently designed - Prefect offers only a low-level interaction with that Docker container and offers no functional interaction with whatever happens inside it
a
that’s correct
👍 1
t
ok - and another concern was raised now in a meeting with our DevOps — that the default container image used for the
KubernetesRun
jobs - might contain too many dependencies? i.e. it needs to contain by default for example a docker daemon in order to be able to run
docker
tasks, etc. - is that the case? (cause they were wondering if we’d need to define a different image per flow in order to save resources) or are the dependencies dynamically inferred somehow?
a
in general, installing packages at runtime will slow down all your flow runs because it requires install every time before the actual flow run can start. We have the option to add extra pip packages, but again, this may slow down all your flow runs so baking all your dependencies into your image is more advisable for performance reasons:
Copy code
from prefect.run_configs import KubernetesRun

flow.run_config = KubernetesRun(env={"EXTRA_PIP_PACKAGES": "scikit-learn matplotlib"})
t
ah ok, i was talking not about pip dependencies, but rather things like external dependencies to python like a
docker daemon
(from what you’re saying and from the docs i understand that the general recommendation is to use dedicated images for flows based on the dependencies they need) but i’m still curious about whether every image that inherits from the default prefect image would also run a
docker daemon
— and suddenly i realized (from when i was working on the POC) that you said the image doesn’t actually run docker at all and instead only interfaces with it using the socket (and relies on there being some daemon running on the machine) so, in general - if we wanted to run
Docker
tasks (e.g. pull image, run container, wait on container, etc.) - we’d have to launch the daemon on the image (used by the flow) ourselves, right?
also, something that’s not clear to me about this doc: https://docs.prefect.io/orchestration/recipes/k8s_docker_sidecar.html isn’t the
PullImage
task never actually invoked in the flow in the example code?
a
It is invoked here .
t
wait, i’m confused about the syntax here 🤔 the python var
image
isn’t a reference to a task? (rather than the result of one)? and doesn’t it need to be executed itself (like
create_container
is invoked with paranthesis
create_container(…)
in the body of the
flow
block )?
a
test it for yourself - replace it with your image and test out
I think thew syntax from the docs should work just fine because the “image” task is passed and called via data dependencies
t
even if it would work , i don’t understand why it does 🙂 why invoking
CreateContainer(...)
yields a task reference which can then be placed in a var named
create_container
(this isn’t the result of executing the task, it’s just a reference to the task that was just defined) while for
PullImage
- doing the exact same thing supposedly immediately yields a result — even though it seems to be executed outside of the
flow
block? i feel like i’m missing something about how Prefect derives the actual DAG flow from the
create_container(image)
invocation, i.e. - that it implicitly understands it needs to execute some unnamed task that was created using
PullImage(…)
even though it’s never explicitly invoked within the flow block….
i.e. would doing :
Copy code
pull_image = PullImage(...)

with Flow(...) as flow:
   image = pull_image()
   container_id = create_container(image)
   ...
be 100% equivalent?
a
yes, this should be equivalent. The easiest to see it is when you visualize your flow - note that both 1 and 2 result in identical computational graph:
Copy code
pull_image = PullImage(
    docker_server_url="<tcp://localhost:2375>",
    repository="prefecthq/prefect",
    tag="latest",
)
create_container = CreateContainer(
    docker_server_url="<tcp://localhost:2375>",
    image_name="prefecthq/prefect:latest",
    command='''python -c "from prefect import Flow; f = Flow('empty'); f.run()"''',
)
start_container = StartContainer(docker_server_url="<tcp://localhost:2375>")
wait_on_container = WaitOnContainer(docker_server_url="<tcp://localhost:2375>")
# We pass `trigger=always_run` here so the logs will always be retrieved, even
# if upstream tasks fail
get_logs = GetContainerLogs(
    docker_server_url="<tcp://localhost:2375>", trigger=always_run
)

with Flow("Docker sidecar example") as flow:
    # Create and start the docker container
    container_id = create_container(pull_image)
    started = start_container(container_id=container_id)
    # Once the docker container has started, wait until it's completed and get the status
    status_code = wait_on_container(container_id=container_id, upstream_tasks=[started])
    # Once the status code has been retrieved, retrieve the logs
    logs = get_logs(container_id=container_id, upstream_tasks=[status_code])

flow.visualize() # 1.


with Flow("Docker sidecar example") as flow:
    # Create and start the docker container
    image = pull_image()
    container_id = create_container(image)
    started = start_container(container_id=container_id)
    # Once the docker container has started, wait until it's completed and get the status
    status_code = wait_on_container(container_id=container_id, upstream_tasks=[started])
    # Once the status code has been retrieved, retrieve the logs
    logs = get_logs(container_id=container_id, upstream_tasks=[status_code])

flow.visualize()  # 2.
t
hmm interesting 🤔 thx for the example, i have to admit i prefer the explicit style of describing the flow, this implicit style is not 100% clear to me (e.g. are there edge cases where it would break, etc.) so i guess we'll just use the explicit style for now
💯 1
a
The Zen of Python: “Explicit is better than implicit.”
t
zen
😂 1
and just to close out on the actual topic of the thread -- i guess this recipe in your docs: https://docs.prefect.io/orchestration/recipes/k8s_docker_sidecar.html describes how one should go about running docker tasks for flows being executed in K8s. And - it's not strictly a Prefect-specific question, but - I wonder if there's any advantage/disadvantage you guys identify in doing this rather than (for example) launching external jobs using the Kubernetes
RunNamespacedJob
? (assuming i could just launch the same image exactly as a job in k8s instead of using the sidecar?)
a
I wouldn’t view this sidecar recipe as some example of best practices. It’s just one of many possible implementations. If you want to run a Kubernetes job, you should either put your logic into your flow that ends up being deployed as a Kubernetes job, or use the RunNamespacedJob - this will make things easier than sidecar imo.
t
@Anna Geller the assumption is that that logic already has to run as a docker (e.g. cause it’s a NodeJS script that requires a lot of resources and so on), so it’s basically a question of whether to run that docker in the same job as the flow (using sidecar) or as another k8s job my main obstacle is just that Prefect doesn’t have any native interaction with these (e.g. in Airflow, the
KubernetesPodOperator
allows for “native” [XCOM] communication to and from the k8s pod via a
json
in the container) i wish there was a similar functionality for
Docker
or
Kubernetes
run tasks 🤔
a
If you want to push some small amount of data from one pod to another, you can leverage the KV Store. This will have the same effect as XCOM. You can send your key value pairs using Python, as well as the CLI or API.