https://prefect.io logo
m

Michael Warnock

07/17/2021, 4:12 PM
Dependency best practices question: I have an existing repo
feature-generator
which contains both worker/orchestration logic and the code for doing the work. I added task and flow definitions to it, but with github storage, the flow can't find the other modules in that repo (I've seen https://github.com/PrefectHQ/prefect/discussions/4776 and understand this is intentional). My question is how best to structure things so that my flow can use that repo's code, but also execute a parameterized run from
feature-generator
, on commit, through CI (because that's how we start the job right now). Obviously, I can make
feature-generator
a package and depend on it from a new
flows
repo, but to have
feature-generator
start the run would create a circular dependency. Would you split it into three repos, with one of them just being responsible for executing the flow? I don't love that idea, but maybe that's best practices?
k

Kevin Kho

07/17/2021, 4:36 PM
Hey @Michael Warnock, I think this set-up can work. and in fact I have heard one of the advantages of Prefect is the ability to include compute logic and orchestration logic together. First is how to make
feature-generator
available to the flow. Because you have a module already, we normally recommend putting stuff in a Docker container (as a module), copying it over, and using
pip install -e .
. I have a minimal example for this one that might help if you haven’t done it yet. After the image is hosted somewhere, the flow will grab that image and run it on top of that container. I think the CI/CD process can start the run on commit. You would register the flow, and then run using the CLI of Prefect. (
prefect register ….
,
prefect run ...
). This would give you the registration and the one time run. A very common complaint is that the Docker requires a re-build with this approach. It’s not particularly bad with the Docker cache, but some users rely on the image hash being the same every time. If ever you want a setup that allows you to not always build, you can put all of the dependencies in the Docker container and leave that static. When the flow runs, it will pull that image and run on top of it. In this setup, the dependencies are decoupled from the flow. But it seems if you want them side by side, you need to do the re-builds.
m

Michael Warnock

07/17/2021, 4:38 PM
so, the storage type would be Docker?
k

Kevin Kho

07/17/2021, 4:42 PM
Yes that’s right. Storage would be Docker, and then you can use
DockerRun
as the Run Configuration seen here to specify that image.
m

Michael Warnock

07/17/2021, 4:45 PM
let me make sure I understand... I keep the code the flow depends on, the flow and task definitions, and the code that executes the flow in one repo, but I avoid the circular dependency by having my CI run
prefect register & run
as CLI utils?
I'm actually testing with DockerRun right now, but I chose GitHub storage, and it couldn't find the local modules. I'll try Docker storage, and Coiled as you suggest.
Err, sorry; the circular dependency wouldn't be there because I'm not splitting the flow code out. Why use the CLI as opposed to python?
k

Kevin Kho

07/17/2021, 4:49 PM
I’m not sure you will hit a circular reference, but I may be missing something. I have this example where I have a minimal package. As long as those files inside the package do that imports right, you don’t get a circular import.
m

Michael Warnock

07/17/2021, 4:50 PM
right- I was confused. Thanks 🙂
k

Kevin Kho

07/17/2021, 4:50 PM
You can totally have a Python script and use that instead of the CLI. We have a GraphQL API you can it, and you can also use the
Client
to create a flow run with
client.create_flow_run(flow_id)
. And then run the python script with
python ____.py
in your CI/CD.
m

Michael Warnock

07/17/2021, 4:51 PM
that's what I'm doing; just went down the wrong road with github storage and ECSRun
k

Kevin Kho

07/17/2021, 4:52 PM
Coiled has free credits too so it’s free to get started with. They have their own Docker container so let me look at bit how to reconcile your Docker Storage with their Docker container for the Dask Client and Workers
m

Michael Warnock

07/17/2021, 4:54 PM
ah- didn't foresee that problem; thanks! We'll have to go 'pro' as soon as I have it working because we need gpu. Looks like the way to go though; I don't fancy maintaining a Dask cluster.
k

Kevin Kho

07/17/2021, 5:02 PM
Oh I was just looking at the docs and saw Coiled cluster has gpu. Attaching GPUs to a container is actually not so simple in Prefect for DockerRun because
docker-py
doesn’t expose the flag to add it. I feel like in the case of GPU, maybe we should rely on Coiled to have the Docker image and dependencies. You can then just continue to use Github storage and that would run on top of the Coiled image for the cluster.
The question then becomes how we get your custom module on the Docker container behind the Dask cluster. They call this
software_environment
. I am not seeing anything immediate in their docs. Want me to ask on their Slack for you?
m

Michael Warnock

07/17/2021, 5:05 PM
Sure- that would be extremely helpful!
k

Kevin Kho

07/17/2021, 5:05 PM
Ah I found it, this is how you add a package on Github.
So in summary, use the default
RunConfiguration
or
LocalRun
, have all of the Docker stuff be handled by Coiled’s software_environment. Install your custom module with the link above. This makes your library on all Dask workers along with other dependencies. Specify that software environment when you choose an executor. This all lets you stay with Github Storage.
m

Michael Warnock

07/17/2021, 5:11 PM
does it, though? I still have a chicken-egg problem with the task depending on local modules, which you're suggesting be installed from private packages; to do that, I have to break it into two repos, and I'm left with my original question.
k

Kevin Kho

07/17/2021, 5:12 PM
Yeah I’m thinking about this more. Where would you run your agent in production?
m

Michael Warnock

07/17/2021, 5:13 PM
Doesn't the temporary dask cluster config you linked to take care of that?
k

Kevin Kho

07/19/2021, 4:29 PM
I chatted with the Prefect team and left some questions on the Coiled Slack. Here is a summary of everything. 1. Using GPUs is difficult with Docker-py because of the SDK. This means you need to use LocalRun 2. The Coiled team said that if you map a task that needs a GPU, it should be able to find the available GPU on the cluster. 3. If the Prefect
map
ever becomes a bottleneck to efficient resource management, we would have to move to Dask code and some Dask mechanisms such as annotations that would allow you to specify resources. Prefect would then orchestrate the Dask code. 4. You used GithubStorage and ran into module issues previously. You would need to install the module on the Dask workers/schedulers. You would need to package your scripts as a Python module and install it like this on the Coiled Slack. (Coiled can help you with that if you need more advice) 5. The agent would need the modules installed. This can be avoided by importing the modules inside your tasks so that the import is deferred. Let me know if you have any questions.
m

Michael Warnock

07/19/2021, 4:58 PM
Can you clarify? Given #1 and #3, am I to understand that my mapped task would run one at a time unless I rewrite it using dask for parallelism instead of prefect mapping?
k

Kevin Kho

07/19/2021, 5:05 PM
No the expectation is that it will parallelize and use all the GPUs available 👍. A use case for 3 is if you have heterogenous clusters, some have CPUs only and some have GPUs. In that case where you need to direct the job to a machine with a GPU, then you would need to use Dask
annotations
and move away from the Prefect map.
m

Michael Warnock

07/19/2021, 5:06 PM
ok, sounds good; thanks
3 Views