Hi - we have the “mono-repo” layout - we intend t...
# prefect-server
d
Hi - we have the “mono-repo” layout - we intend to keep all our flows in a package and leverage shared stuff in a common package. We are able to register flows with our AWS hosted instance, but only a very simple “stand-alone” flow executes, we cannot get ones that reference common code to run, details follow in thread… Any help much appreciated!
We have the “mono-repo” layout below, we intend to keep all our flows in the
flows
package and leverage shared stuff in
common
.
Copy code
. # <-- repo root
├── common
│   ├── __init__.py
│   ├── config.py              # Reusable config
│   ├── schedules              # Reusable schedules
│   │   ├── __init__.py
│   │   └── simple_schedule.py
│   └── tasks                  # Reusable tasks
│       └── __init__.py
└── flows
    ├── __init__.py
    └── simple_flow
        ├── __init__.py
        └── src
            ├── __init__.py
            ├── __main__.py
            ├── config.py
            ├── core.py
            ├── flow.py
            └── tasks.py
flow.py
looks a like this:
Copy code
from prefect import Flow, Parameter
from prefect.storage import GitLab

from flows.simple_flow.src import config
from flows.simple_flow.src.tasks import (
    get_source,
    decode_source,
    store_decoded_data
)

# just to prove import works
from common.schedules.simple_schedule import schedule_daily
data_source_url = Parameter("data_source_url", default=config.data_source_url)


storage = GitLab(
    host="<https://private-gitlab-host.com>",
    repo="the-repo",
    path="flows/simple_flow/src/flow.py",
    ref="main",
    access_token_secret="TOKEN_SECRET",
)


with Flow(name="simple_flow", storage=storage) as flow:
    data = get_source(data_source_url=data_source_url)
    decoded_data = decode_source(data=data)
    store_decoded_data(decoded_data=decoded_data)

if __name__ == "__main__":
    flow.register(project_name="default", add_default_labels=False)
This all works fine running a Prefect core server locally with a local agent. We’ve deployed into AWS and have an ECS agent. I have updated my local
~/.prefect/config.toml
as follows:
Copy code
backend = "server"
[server]
host = "<https://our-aws-prefect-apollo.domain.com>"
port = "443"
endpoint = "${server.host}:${server.port}"
  [server.ui]
  endpoint = "<https://our-aws-prefect-ui.domain.com>"
I register the flow from a local machine using (python interpreter):
Copy code
>>> from flows.simple_flow.src.flow import flow
>>> from prefect.run_configs import UniversalRun
>>> flow.name
'simple_flow'

>>> flow.storage
<Storage: GitLab>

>>> flow.run_config = UniversalRun(labels=["dev"])
>>> flow.register(project_name="default", labels=["dev"])
Flow URL: <https://our-aws-prefect-ui.domain.com/main/flow/53ae2776-36d1-4bed-8f9a-87ce95fad866>
 └── ID: 182b65b6-5ad1-42c4-98ea-eac767b3b867
 └── Project: default
 └── Labels: ['dev']
'182b65b6-5ad1-42c4-98ea-eac767b3b867'
This registers without error - then when I try to execute I see this in the flow LOGS:
Copy code
Failed to load and execute Flow's environment: KeyError("'__name__' not in globals")
I suppose main question I have is, when using GitLab storage, when we register the flow, does the runner pull in all the modules in the repo that the
flow.py
references? (in my example, the
config.py
,
core.py
and
tasks.py
modules in the same package as
flow.py
, as well as stuff in the top level
common
package?
k
Hi @David Charles, only Docker storage keeps the dependencies. Gitlab and script based storage only keeps the flow file. Actually for Prefect 2.0, it is on the immediate roadmap to make packaging a lot easier. Not saying this will be the exact mechanism, but cloudpickle 2.0 recently added a way to serialize custom modules. For now though, you either need to package the dependencies in a Docker container or have them in the execution environment.
d
Thanks @Kevin Kho for the clarification, that's good to know. Appreciate you getting back to me 😀
👍 1
k
we use S3 storage, but then do an ECSRun and the container thats used has the module pip installed on it (technically it also has a duplicate copy of the flow code but thats not whats used when the flow actually runs). Overall, it works quite well for us. For local development/running, it all behaves like a normal python package similar to what you’ve laid out above.
d
Interesting, thanks @Kyle McChesney - might get back to you for deets 🙂
1
t
@Kyle McChesney - how/when are you getting the module installed on the container - is it part of an external process (e.g. CI pipeline) or something else?
k
Until Kyle replies, I think he made his image ahead of time, but what I have seen other people do is use the ENTRYPOINT of the container to
git clone
and
pip install -e .
k
we have a CI process that pushes a built container version to a container repo in AWS and then registers the new flow version. So latest flow <> latest image.
d
We used Docker storage in the end. We built a
storages
module that has a
get_storage
callable. It returns an appropriate storage based on environment, so for local flow execution it defaults to
Local()
but we have options to add other storages down the line. When we register a flow (e.g. in a CI pipeline) it will get a
Docker
storage that’s been instanced with
dockerfile
and
registry_url
etc for the docker build and push that ensues. Main issue encountered doing this from a mono-repo was we wanted to have a single “parameterised” Dockerfile (i.e. using
ARG flow_name
) so we can build per-flow deps. However seems there’s a bug in
prefect/storage/docker.py
where
build_kwargs
is incorrectly passed into the Python Docker client. I’ve raised this issue: https://github.com/PrefectHQ/prefect/issues/5630
OK not a bug, just not obvious (well, to me 😂) have posted what I did wrong
k
Oh I see that’s good to know