Michael Warnock

    Michael Warnock

    1 year ago
    Dependency best practices question: I have an existing repo
    feature-generator
    which contains both worker/orchestration logic and the code for doing the work. I added task and flow definitions to it, but with github storage, the flow can't find the other modules in that repo (I've seen https://github.com/PrefectHQ/prefect/discussions/4776 and understand this is intentional). My question is how best to structure things so that my flow can use that repo's code, but also execute a parameterized run from
    feature-generator
    , on commit, through CI (because that's how we start the job right now). Obviously, I can make
    feature-generator
    a package and depend on it from a new
    flows
    repo, but to have
    feature-generator
    start the run would create a circular dependency. Would you split it into three repos, with one of them just being responsible for executing the flow? I don't love that idea, but maybe that's best practices?
    Kevin Kho

    Kevin Kho

    1 year ago
    Hey @Michael Warnock, I think this set-up can work. and in fact I have heard one of the advantages of Prefect is the ability to include compute logic and orchestration logic together. First is how to make
    feature-generator
    available to the flow. Because you have a module already, we normally recommend putting stuff in a Docker container (as a module), copying it over, and using
    pip install -e .
    . I have a minimal example for this one that might help if you haven’t done it yet. After the image is hosted somewhere, the flow will grab that image and run it on top of that container. I think the CI/CD process can start the run on commit. You would register the flow, and then run using the CLI of Prefect. (
    prefect register ….
    ,
    prefect run ...
    ). This would give you the registration and the one time run. A very common complaint is that the Docker requires a re-build with this approach. It’s not particularly bad with the Docker cache, but some users rely on the image hash being the same every time. If ever you want a setup that allows you to not always build, you can put all of the dependencies in the Docker container and leave that static. When the flow runs, it will pull that image and run on top of it. In this setup, the dependencies are decoupled from the flow. But it seems if you want them side by side, you need to do the re-builds.
    Michael Warnock

    Michael Warnock

    1 year ago
    so, the storage type would be Docker?
    Kevin Kho

    Kevin Kho

    1 year ago
    Yes that’s right. Storage would be Docker, and then you can use
    DockerRun
    as the Run Configuration seen here to specify that image.
    Michael Warnock

    Michael Warnock

    1 year ago
    let me make sure I understand... I keep the code the flow depends on, the flow and task definitions, and the code that executes the flow in one repo, but I avoid the circular dependency by having my CI run
    prefect register & run
    as CLI utils?
    I'm actually testing with DockerRun right now, but I chose GitHub storage, and it couldn't find the local modules. I'll try Docker storage, and Coiled as you suggest.
    Err, sorry; the circular dependency wouldn't be there because I'm not splitting the flow code out. Why use the CLI as opposed to python?
    Kevin Kho

    Kevin Kho

    1 year ago
    I’m not sure you will hit a circular reference, but I may be missing something. I have this example where I have a minimal package. As long as those files inside the package do that imports right, you don’t get a circular import.
    Michael Warnock

    Michael Warnock

    1 year ago
    right- I was confused. Thanks 🙂
    Kevin Kho

    Kevin Kho

    1 year ago
    You can totally have a Python script and use that instead of the CLI. We have a GraphQL API you can it, and you can also use the
    Client
    to create a flow run with
    client.create_flow_run(flow_id)
    . And then run the python script with
    python ____.py
    in your CI/CD.
    Michael Warnock

    Michael Warnock

    1 year ago
    that's what I'm doing; just went down the wrong road with github storage and ECSRun
    Kevin Kho

    Kevin Kho

    1 year ago
    Coiled has free credits too so it’s free to get started with. They have their own Docker container so let me look at bit how to reconcile your Docker Storage with their Docker container for the Dask Client and Workers
    Michael Warnock

    Michael Warnock

    1 year ago
    ah- didn't foresee that problem; thanks! We'll have to go 'pro' as soon as I have it working because we need gpu. Looks like the way to go though; I don't fancy maintaining a Dask cluster.
    Kevin Kho

    Kevin Kho

    1 year ago
    Oh I was just looking at the docs and saw Coiled cluster has gpu. Attaching GPUs to a container is actually not so simple in Prefect for DockerRun because
    docker-py
    doesn’t expose the flag to add it. I feel like in the case of GPU, maybe we should rely on Coiled to have the Docker image and dependencies. You can then just continue to use Github storage and that would run on top of the Coiled image for the cluster.
    The question then becomes how we get your custom module on the Docker container behind the Dask cluster. They call this
    software_environment
    . I am not seeing anything immediate in their docs. Want me to ask on their Slack for you?
    Michael Warnock

    Michael Warnock

    1 year ago
    Sure- that would be extremely helpful!
    Kevin Kho

    Kevin Kho

    1 year ago
    Ah I found it, this is how you add a package on Github.
    So in summary, use the default
    RunConfiguration
    or
    LocalRun
    , have all of the Docker stuff be handled by Coiled’s software_environment. Install your custom module with the link above. This makes your library on all Dask workers along with other dependencies. Specify that software environment when you choose an executor. This all lets you stay with Github Storage.
    Michael Warnock

    Michael Warnock

    1 year ago
    does it, though? I still have a chicken-egg problem with the task depending on local modules, which you're suggesting be installed from private packages; to do that, I have to break it into two repos, and I'm left with my original question.
    Kevin Kho

    Kevin Kho

    1 year ago
    Yeah I’m thinking about this more. Where would you run your agent in production?
    Michael Warnock

    Michael Warnock

    1 year ago
    Doesn't the temporary dask cluster config you linked to take care of that?
    Kevin Kho

    Kevin Kho

    1 year ago
    I chatted with the Prefect team and left some questions on the Coiled Slack. Here is a summary of everything.1. Using GPUs is difficult with Docker-py because of the SDK. This means you need to use LocalRun 2. The Coiled team said that if you map a task that needs a GPU, it should be able to find the available GPU on the cluster. 3. If the Prefect
    map
    ever becomes a bottleneck to efficient resource management, we would have to move to Dask code and some Dask mechanisms such as annotations that would allow you to specify resources. Prefect would then orchestrate the Dask code. 4. You used GithubStorage and ran into module issues previously. You would need to install the module on the Dask workers/schedulers. You would need to package your scripts as a Python module and install it like this on the Coiled Slack. (Coiled can help you with that if you need more advice) 5. The agent would need the modules installed. This can be avoided by importing the modules inside your tasks so that the import is deferred. Let me know if you have any questions.
    Michael Warnock

    Michael Warnock

    1 year ago
    Can you clarify? Given #1 and #3, am I to understand that my mapped task would run one at a time unless I rewrite it using dask for parallelism instead of prefect mapping?
    Kevin Kho

    Kevin Kho

    1 year ago
    No the expectation is that it will parallelize and use all the GPUs available 👍. A use case for 3 is if you have heterogenous clusters, some have CPUs only and some have GPUs. In that case where you need to direct the job to a machine with a GPU, then you would need to use Dask
    annotations
    and move away from the Prefect map.
    Michael Warnock

    Michael Warnock

    1 year ago
    ok, sounds good; thanks