Hey again yesterday i presented the results of my Prefect Po Prefect Community #ask-community

Hey again - yesterday i presented the results of m...

Tom Klein

12/22/2021, 11:51 AM

Hey again - yesterday i presented the results of my Prefect PoC to my team and my team lead said they think we should wrap all DS code with docker containers and use those as "blackbox" steps instead of directly invoking python code from the flow -- am i right in my understand that if we do that we lose some of the advantages of Prefect like being able to easily map output of docker runs to input of the next tasks - or caching/persistence of results etc. and we'll need to do all these things manually ourselves?

Anna Geller

12/22/2021, 12:01 PM

It’s up to your team to decide. You’re 100% correct that using containers as black box doesn’t take advantage of Prefect’s granular visibility into your workflow and being able to react to specific states like get notified when a specific task fails, attaching results, caching, having retries, restarts, etc. You would not lose just some advantages, but MOST advantages of Prefect. If our docs and blog posts don’t explain it well enough, perhaps you can share some of those videos with your team so that they can understand the problem a bit better: •

https://www.youtube.com/watch?v=TlawR_gi8-Y&t=1s▾

•

https://www.youtube.com/watch?v=wejJzGQ4XDo▾

Tom Klein

12/22/2021, 12:57 PM

@Anna Geller yes, we’re well aware of the fact that a blackbox wouldn’t offer visibility in prefect into what’s happening inside it --- but that’s not exactly my question my question has more to do with the fact that a Docker container in itself does not have any

input

output

interfaces, and the Prefect docker tasks don’t add in that functionality either when it wraps them our flows can have many other tasks besides the ML task itself, my team lead suggested that the ML tasks are wrapped in containers -- not that everything around them (e.g. pulling from snowflake, sending results to some other service) also happens in containers necessarily what i’m trying to verify is that if a task lives in a docker container, then - as it’s currently designed - Prefect offers only a low-level interaction with that Docker container and offers no functional interaction with whatever happens inside it

Anna Geller

12/22/2021, 2:15 PM

that’s correct

👍 1

Tom Klein

12/22/2021, 2:22 PM

ok - and another concern was raised now in a meeting with our DevOps — that the default container image used for the

KubernetesRun

jobs - might contain too many dependencies? i.e. it needs to contain by default for example a docker daemon in order to be able to run

docker

tasks, etc. - is that the case? (cause they were wondering if we’d need to define a different image per flow in order to save resources) or are the dependencies dynamically inferred somehow?

Anna Geller

12/22/2021, 2:31 PM

in general, installing packages at runtime will slow down all your flow runs because it requires install every time before the actual flow run can start. We have the option to add extra pip packages, but again, this may slow down all your flow runs so baking all your dependencies into your image is more advisable for performance reasons:

Copy code

from prefect.run_configs import KubernetesRun

flow.run_config = KubernetesRun(env={"EXTRA_PIP_PACKAGES": "scikit-learn matplotlib"})

Tom Klein

12/22/2021, 2:46 PM

ah ok, i was talking not about pip dependencies, but rather things like external dependencies to python like a

docker daemon

(from what you’re saying and from the docs i understand that the general recommendation is to use dedicated images for flows based on the dependencies they need) but i’m still curious about whether every image that inherits from the default prefect image would also run a

docker daemon

— and suddenly i realized (from when i was working on the POC) that you said the image doesn’t actually run docker at all and instead only interfaces with it using the socket (and relies on there being some daemon running on the machine) so, in general - if we wanted to run

Docker

tasks (e.g. pull image, run container, wait on container, etc.) - we’d have to launch the daemon on the image (used by the flow) ourselves, right?

Tom Klein

12/22/2021, 3:34 PM

also, something that’s not clear to me about this doc: https://docs.prefect.io/orchestration/recipes/k8s_docker_sidecar.html isn’t the

PullImage

task never actually invoked in the flow in the example code?

Anna Geller

12/22/2021, 3:46 PM

It is invoked here .

Tom Klein

12/22/2021, 4:12 PM

wait, i’m confused about the syntax here 🤔 the python var

image

isn’t a reference to a task? (rather than the result of one)? and doesn’t it need to be executed itself (like

create_container

is invoked with paranthesis

create_container(…)

in the body of the

flow

block )?

Tom Klein

12/22/2021, 4:15 PM

here: https://docs.prefect.io/api/latest/tasks/docker.html#pullimage it says that this task returns :

Anna Geller

12/22/2021, 4:15 PM

test it for yourself - replace it with your image and test out

Anna Geller

12/22/2021, 4:16 PM

I think thew syntax from the docs should work just fine because the “image” task is passed and called via data dependencies

Tom Klein

12/22/2021, 4:19 PM

even if it would work , i don’t understand why it does 🙂 why invoking

CreateContainer(...)

yields a task reference which can then be placed in a var named

create_container

(this isn’t the result of executing the task, it’s just a reference to the task that was just defined) while for

PullImage

- doing the exact same thing supposedly immediately yields a result — even though it seems to be executed outside of the

flow

block? i feel like i’m missing something about how Prefect derives the actual DAG flow from the

create_container(image)

invocation, i.e. - that it implicitly understands it needs to execute some unnamed task that was created using

PullImage(…)

even though it’s never explicitly invoked within the flow block….

Tom Klein

12/22/2021, 4:21 PM

i.e. would doing :

Copy code

pull_image = PullImage(...)

with Flow(...) as flow:
   image = pull_image()
   container_id = create_container(image)
   ...

be 100% equivalent?

Anna Geller

12/22/2021, 6:09 PM

yes, this should be equivalent. The easiest to see it is when you visualize your flow - note that both 1 and 2 result in identical computational graph:

Copy code

pull_image = PullImage(
    docker_server_url="<tcp://localhost:2375>",
    repository="prefecthq/prefect",
    tag="latest",
)
create_container = CreateContainer(
    docker_server_url="<tcp://localhost:2375>",
    image_name="prefecthq/prefect:latest",
    command='''python -c "from prefect import Flow; f = Flow('empty'); f.run()"''',
)
start_container = StartContainer(docker_server_url="<tcp://localhost:2375>")
wait_on_container = WaitOnContainer(docker_server_url="<tcp://localhost:2375>")
# We pass `trigger=always_run` here so the logs will always be retrieved, even
# if upstream tasks fail
get_logs = GetContainerLogs(
    docker_server_url="<tcp://localhost:2375>", trigger=always_run
)

with Flow("Docker sidecar example") as flow:
    # Create and start the docker container
    container_id = create_container(pull_image)
    started = start_container(container_id=container_id)
    # Once the docker container has started, wait until it's completed and get the status
    status_code = wait_on_container(container_id=container_id, upstream_tasks=[started])
    # Once the status code has been retrieved, retrieve the logs
    logs = get_logs(container_id=container_id, upstream_tasks=[status_code])

flow.visualize() # 1.


with Flow("Docker sidecar example") as flow:
    # Create and start the docker container
    image = pull_image()
    container_id = create_container(image)
    started = start_container(container_id=container_id)
    # Once the docker container has started, wait until it's completed and get the status
    status_code = wait_on_container(container_id=container_id, upstream_tasks=[started])
    # Once the status code has been retrieved, retrieve the logs
    logs = get_logs(container_id=container_id, upstream_tasks=[status_code])

flow.visualize()  # 2.

Tom Klein

12/22/2021, 6:19 PM

hmm interesting 🤔 thx for the example, i have to admit i prefer the explicit style of describing the flow, this implicit style is not 100% clear to me (e.g. are there edge cases where it would break, etc.) so i guess we'll just use the explicit style for now

💯 1

Anna Geller

12/22/2021, 6:23 PM

The Zen of Python: “Explicit is better than implicit.”

Tom Klein

12/22/2021, 8:59 PM

zen

😂 1

Tom Klein

12/22/2021, 9:50 PM

and just to close out on the actual topic of the thread -- i guess this recipe in your docs: https://docs.prefect.io/orchestration/recipes/k8s_docker_sidecar.html describes how one should go about running docker tasks for flows being executed in K8s. And - it's not strictly a Prefect-specific question, but - I wonder if there's any advantage/disadvantage you guys identify in doing this rather than (for example) launching external jobs using the Kubernetes

RunNamespacedJob

? (assuming i could just launch the same image exactly as a job in k8s instead of using the sidecar?)

Anna Geller

12/22/2021, 9:56 PM

I wouldn’t view this sidecar recipe as some example of best practices. It’s just one of many possible implementations. If you want to run a Kubernetes job, you should either put your logic into your flow that ends up being deployed as a Kubernetes job, or use the RunNamespacedJob - this will make things easier than sidecar imo.

Tom Klein

12/24/2021, 9:49 AM

@Anna Geller the assumption is that that logic already has to run as a docker (e.g. cause it’s a NodeJS script that requires a lot of resources and so on), so it’s basically a question of whether to run that docker in the same job as the flow (using sidecar) or as another k8s job my main obstacle is just that Prefect doesn’t have any native interaction with these (e.g. in Airflow, the

KubernetesPodOperator

allows for “native” [XCOM] communication to and from the k8s pod via a

json

in the container) i wish there was a similar functionality for

Docker

Kubernetes

run tasks 🤔

Anna Geller

12/24/2021, 10:58 AM

If you want to push some small amount of data from one pod to another, you can leverage the KV Store. This will have the same effect as XCOM. You can send your key value pairs using Python, as well as the CLI or API.

45 Views

Open in Slack

Previous Next