Hey all, question here about data-related tasks t...
# ask-community
t
Hey all, question here about data-related tasks that are not python-based, e.g. written in NodeJS - today we have these wrapped as Dockers (using a Dockerfile we made) and deployed using our general CI/CD system to K8s as scheduled jobs I started toying with the idea of breaking them into components and migrating them to
Prefect
for the various benefits it could offer i saw that there's support for "docker" and "kubernetes" tasks but since my DevOps knowledge is kind of limited i was wondering if (by any chance) there's some examples of that kind of usage laying around somewhere or if you could at least give your thoughts about whether what I'm thinking even makes sense?
a
Hey Tom, great question! Prefect does have the ability to run tasks that interact with Docker images and containers. There’s a collection of Docker tasks in the Prefect Task library. The documentation for that is here: https://docs.prefect.io/api/latest/tasks/docker.html. In particular you could use the
StartContainer
task to run a container from a pre-built Docker image.
upvote 1
a
@Tom Klein adding to Alex’s answer: for Kubernetes, there is a
RunNamespacedJob
task and here are two examples of how it can be used: https://github.com/anna-geller/packaging-prefect-flows/tree/master/flows_task_library
here is an example for the Docker tasks Alex mentioned: https://docs.prefect.io/orchestration/recipes/k8s_docker_sidecar.html
t
Hmm, if I understand correctly - assuming our (Docker) images are already being built and stored automatically (using our CI/CD) in the image registry, we need to use this, no? (We are in-fact using ECR for all our "official" service deployments in the company) https://github.com/anna-geller/packaging-prefect-flows/blob/master/flows_no_build/docker_script_kubernetes_run_custom_ecr_image.py
basically all i'm trying to do is move the orchestration (scheduling, execution context within a larger DAG, retries, etc.) from the built-in K8s/helm/whatever one to
Prefect
but i'm fine with the build-process being handled by a CI/CD system unrelated to Docker and wired into our GitHub account
but then i'm left with the problem of knowing where the image is, exactly, in ECR, right? So, from my understanding of the "philosophy" here -- i would need to add (or rather, ask our DevOps to add) a step in our CI/CD system that reflects the image path/id/whatever back to Prefect, somehow? e.g. to the K/V store? or am i thinking about it all wrong?
k
I think you would know the location ahead of time because when you push it, you can specify the name and tag of that image in ECR
t
@Kevin Kho but it's done automatically in our CI/CD system (
CodeFresh
) which i'm not really involved in, i.e. - our DevOps (manually) wire some github repo to be built on every commit, and automatically be pushed into ECR upon success They then (manually) create (upon request from us) some K8s job (or long-running service, or whatever) based on said image (this is done once per job/service/etc. , as part of its initial setup) and then when we "deploy" (using our deploy dashboard) - the latest image is taken from ECR and pushed to k8s using their DevOps magic at which point is the "ahead of time" you're referring to, relative to this process?
k
When DevOps pushes the image to ECR, they have to push it under a certain registry, image name, and tag so you just need to find the location they are pushing to right?
But yeah if you really don,t KV store is certainly an option to retrieve the image address
t
@Kevin Kho right - but it's done automatically, so - we should "intervene" in the CI/CD process, add a 4th step (first three being -
git clone
,
docker build
,
push to ECR
) that somehow "sends this" (this being the image path) somewhere that Prefect will know of, right?
there must be some best-practice here that i'm missing - i.e. - i would have had the same problem if i was using Airflow, no?
k
I think even if it’s done automatically, that push to ECR has the destination hardcoded? But yeah if not, then you can send to the KV store. You would need your CI/CD environment to be authenticated and then you can use it.
t
@Kevin Kho i have no idea what's the current naming policy our DevOps applies to images, for us (engineers) it's transparent, we push stuff to git and then eventually it gets built and later we can deploy it 🙂 the wiring is behind the scenes
k
Uhh I think the best practice might be if you had more control over the image building and uploading, but I understand that it’s another team in your organization. Persisting that somewhere really seems like the only option here to automate it
💯 1
t
we (data team) can always do it ourselves but i have a hunch that it's gonna be a bad idea to kind of split our CI/CD system in half into two separate ones
even with control over image building i'm not sure i fully understand the best practice. how generally do big companies who use Airflow (or whatever) combine their CI/CD with an orchestration system which is distinct from it? obviously at least one of them needs to know about the other, the question is if you're supposed to make the CI/CD gnostic of the orchestration layer or if you're supposed to somehow make the orchestration layer gnostic of the CI/CD process (or whatever) or is it sort of like two separate universes and you have a CI/CD for software and a parallel CI/CD process for data-tasks?
k
I think so too because a simpler way to state this (ignoring CI/CD) is “there is some team that does some work and I need to know the result”, which just feels wrong. For CI/CD you can check this thread on how people deploy their flows with CI/CD
🙏 1
t
@Kevin Kho this thread seems to be about a CI/CD process for the flows themselves. e.g. - i wrote a new flow, how do i now push it Prefect as part of an automatic process that maybe includes testing, code-reviews, etc. my question is about a slightly different level of operations - i'm fine with having no ci/cd around flow deployment (for now) but i still need some good way of exeucting non-python code as part of DAGs, with the tasks being (for example) a custom docker image
i.e. - i'm fine with ALL of these steps being manual (for now). it still leaves me with the problem of executing arbitrary images stored on ECR as part of prefect DAGS (flows) this step (running a known pre-built image) cannot be manual, because I myself don't have access or visibility to ECR and i'm not going to probe into (and surely not ask our devops) it every time i want to use some image as a task i suppose that for simplicity's sake it could be replaced with an HTTP invocation of the image instead of direct K8s invocation but i'm not sure of all of the implications of that... for example, it would mean the image would have to act as a service rather than as a stand-alone job
a
@Tom Klein if the image tag is generated during CI/CD, then you know this information at the time when CI/CD runs. You could then use the KV store, as Kevin suggested, to store this repository URL as a key-value pair that would be used by your flow. This way, you always get the latest version generated by CI. You would need to ask your DevOps to include this in your CI yaml file:
Copy code
steps:
      - run: pip install prefect
      - run: prefect auth login --key $PREFECT_API_KEY
      - run: export PREFECT__CLOUD__USE_LOCAL_SECRETS=false && prefect kv set YOUR_FLOW_ECR_IMAGE_URL "<http://XXXX.dkr.ecr.us-east-1.amazonaws.com/image_name:tag|XXXX.dkr.ecr.us-east-1.amazonaws.com/image_name:tag>"
🙏 1
the key and value are just examples
t
@Anna Geller thanks that's pretty clear. My question is then just if this is considered "best practice" or not 😁 Based on your experience is it more common for data orchestration to have its own separate ci/CD or image building process, distinct from that of the rest of the product? I guess it's kind of unorthodox to have data processes which aren't in python to begin with but still..
Prefect itself executes python tasks as docker images no? (still trying to wrap my head around the architecture)
a
@Tom Klein absolutely! You data pipelines are important and they deserve having their own CI/CD process! 🙂 given that you don’t know the image tags until the CI step, there is really no other way than injecting this value during CI. And KVStore in Prefect happens to be an extremely convenient mechanism to pass that information along to your flows
t
@Anna Geller hmm, not sure our DevOps would agree with that statement, since we're still a small startup and they (for example) are in charge of matters like security, resource management, infrastructure budgeting, etc. etc. --- so splitting it up in two possibly introduces a new array of concerns that would be hard for them to manage if it happens outside the (singular) CI/CD and ECR they work with not to mention that we ourselves don't want to add these concerns to our own process, we want to be concerned with
which
things run, or
when
or in
what
order, but not
where
or
how
...
a
whatever works for your team best, do that. The same thing could be done from a mono-repo, right? There is no right or wrong here.
t
ya, you're right... it's just because we're really stepping into a new field (for us, and for the company itself) i'm just trying to adhere to industry best-practices , so trying to understand what's more sane and what's less sane to do (in our particular case, which is probably similar to a lot of other companies of our size) 😆
👍 1
alright, i think i know what i'm missing - our Deploy dashboard already must have (or rely on ) some kind of mapping between
service-name
to its ECR image, so i guess we can talk to our DevOps to make it so we utilize the same mapping for Prefect... and then we'll just need to know the name/type of the container we're interested in, and not the actual ECR path
k
That sounds like a good approach