Mark McDonald

    Mark McDonald

    1 year ago
    Hi - I'm currently upgrading to version 0.14, using ECSRun config and s3 storage. I used to use docker storage. With docker storage, I was able to define where my flows were located and executed from within my image. It seems like with S3 storage, I lose control over where the flow files are executed from because you all take care of downloading them into the image. From what I can tell, with s3 storage, the flows are being executed from inside of "tmp/" (example: /tmp/prefect-b0r890j3). Is my understanding of 'tmp' correct? Is there a way to override this location and have you download the flow files elsewhere? When I develop locally, at the root of my project, I have a directory called "src", where I store my flow files. Within "src" I also have a sub-directory called "helpers". Inside of "helpers" I store non-flow definition supporting code. If all my code were located in a single flow file, I wouldn't be concerned with where the flow is being executed from. However, because I'm working with helper code/files (like the example below), it's a challenge to not be able to control where the flow is executed from. Any advice on this? path = os.path.join(os.getcwd(), "helpers/query_info.yaml") with open(path, "r") as stream: data_loaded = yaml.safe_load(stream)
    Michael Adkins

    Michael Adkins

    1 year ago
    Hi @Mark McDonald — just wondering, why’d you switch to using S3 storage? You should be able to set your
    ECSRun
    image
    to be an image you’ve setup to contain the additional code you require.
    Mark McDonald

    Mark McDonald

    1 year ago
    yea - we may have to revert back to that, if this is a limitation of s3 storage. I don't recall what we thought the advantage of s3 storage was
    Michael Adkins

    Michael Adkins

    1 year ago
    If you version your flows and helpers separately then you can have your
    ECSRun
    image contain the helpers then the flow can still be stored in S3 and executed in the helper image.
    There are a lot of patterns here and no clear winner yet — we’re continuously trying to assess what the best way to package flows/helpers together like this is.
    Mark McDonald

    Mark McDonald

    1 year ago
    is it correct that with each flow run, you download the flow script from s3 to a new location in tmp? In which case it's not like I can copy my helpers into this tmp location
    I guess my main concern is that I want to have control over the flow execution location because I want my local development experience to mimic how it will be executed in prefect cloud. Otherwise, it's going to be confusing for my company's Prefect users
    Michael Adkins

    Michael Adkins

    1 year ago
    I’m not sure off the top of my head what you can override but you may be able to write a class that inherits S3Storage and add an implementation to
    get_flow
    that pulls down the requirements -- this is a bit of a hack though.
    Ah it also looks you could provide a custom
    task_definition
    that runs arbitrary commands before executing the flow run
    My understanding of the process: Flow registered with ECSRun type -> User runs flow in UI -> ECSAgent notices a flow is ready and pulls configuration values from the flow run config to determine what the ECS task should look like -> An ECS task is created which runs on a docker image and has an entry point of “prefect execute flow-run” and the flow run id to execute in the context -> The flow storage metadata is looked up in the ECS task -> The flow is downloaded from storage into the docker image -> The flow is executed
    It seems like you could: • Store your flow in docker storage which the ECS task will use as its image instead, install your helpers there • Create a base image for all your flows to run on, install your helpers there, use any other file-based storage for your flows • Customize one of the steps to download requirements into the ECS task before the flow run is executed
    Mark McDonald

    Mark McDonald

    1 year ago
    Thanks for the feedback. So yea, basically I do build a custom docker image that contains my helper code along with all my dependencies. I then create an ECS task definition which contains the image's repository location. I then store the flow scripts in s3 using s3 storage. The CI/CD flow kind of looks like this:
    # step 1: docker build/push/tag image
    # step 2: create ecs task definition 
    
    # step 3:
    flow.storage = S3(
            bucket=S3_BUCKET,
            key=s3_key,
            stored_as_script=True,
            local_script_path=/path/to/flow.py,
        )
    # step 4:
    flow.run_config = ECSRun(
            task_definition_arn=task_definition_arn, run_task_kwargs=run_task_kwargs
        )
    # step 5:
    flow_id = flow.register(
            labels=['dev'],
            project_name=PROJECT_NAME,
        )
    Basically, I think the idea of S3 Storage is appealing because if only flow code changes (not dependencies or helpers), then I can skip steps 1 and 2 during my CI/CD. I just have to call S3 storage on the single flow script that's changed, register the flow and you all take care of it from there. Subclassing S3Storage seems like it might work, but I agree that it doesn't feel right. I would imagine that other Prefect S3 storage users would want to have the ability to define the flow's location within their image as well. I think this should be configurable. Docker storage offers the configuration through the
    prefect_directory
    argument. Can I propose that this arg be added to s3 storage? https://github.com/PrefectHQ/prefect/blob/c8d9b9b7a6d11b9487901cd795b8f1509f355845/src/prefect/storage/docker.py#L108-L109