Hi friends, I have flows using S3 storage with the...
# prefect-community
a
Hi friends, I have flows using S3 storage with the local executor and local agent, all wrapped up in a Docker image that is deployed in an ECS service (as per my wider team requirements). It works, but I'm not satisfied with the fact that we're basically reserving a fair bit of CPU and memory on the EC2 instance(s) for a bunch of flows that collectively run across only two hours each day. One option might be to switch to Docker storage with the ECS agent to run the flows as Fargate tasks. The catch is that because of our ECS setup, this may mean some kind of Docker-in-Docker solution, which seems to be a mess. Another option might be to stick with S3 storage, but make use of the Dask executor (and
dask_cloudprovider.aws.FargateCluster
plus adaptive scaling) with the local agent. My thinking is that I could separately build a Docker image that has all the required flow dependencies (they're the same for all the flows) and upload that to ECR during CI/CD, and then the local agent can pull from ECR and initialise the Dask executor with this image, while also initialising it with the flows retrieved from S3 storage. Is this feasible?
j
Hi Amanda,
The catch is that because of our ECS setup, this may mean some kind of Docker-in-Docker solution, which seems to be a mess.
I don't follow this. Are your tasks using docker as a service (perhaps building images?). If your flow doesn't need docker to run (you only need to run an image), you shouldn't need a Docker-in-Docker solution afaict.
My thinking is that I could separately build a Docker image that has all the required flow dependencies (they're the same for all the flows) and upload that to ECR during CI/CD, and then the local agent can pull from ECR and initialise the Dask executor with this image, while also initialising it with the flows retrieved from S3 storage. Is this feasible?
Yep, that should be fine. When running with dask, the workers need all dependent libraries you use, but the source of the flow itself can live elsewhere. So provided your image has all the required dependencies, the workers themselves wouldn't need access to the flow source stored in S3.
Without knowing your particulars, I'd recommend avoiding the complexities of using a dask cluster for these flows unless they can benefit from the parallelization. Using S3 for flow storage, while sharing a common docker image for dependencies, is something I've found useful for other flows though, and seems independent of your resource usage issues. Not having to build and push a new image every time a flow source changes can be nice from a workflow standpoint, and lets you optimize how (and when) your images are built and distributed.
a
Thank you
I don't follow this. Are your tasks using docker as a service (perhaps building images?). If your flow doesn't need docker to run (you only need to run an image), you shouldn't need a Docker-in-Docker solution afaict.
When the ECS task starts, it re-runs flow registration with the up-to-date flows, and in this process I understand that the flow will be serialised to storage. So, if I use Docker storage, wouldn't the ECS task need its own docker setup so that the flow's image can be built (and pushed to ECR)? Or can I somehow continue to use S3 storage with ECS Agent?
j
Or can I somehow continue to use S3 storage with ECS Agent?
All storage types should work with all agents (with the exception of
Local
storage, which only works with the local agent). S3 storage with an ECS agent is a common pattern, given both are hosted on AWS.
When the ECS task starts, it re-runs flow registration with the up-to-date flows, and in this process I understand that the flow will be serialised to storage. So, if I use Docker storage, wouldn't the ECS task need its own docker setup so that the flow's image can be built (and pushed to ECR)?
Oh wait, is this an ECS task that you wrote, not an ECS task started by prefect for running a single flow run? During a flow run docker shouldn't be needed, it should only be needed during registration. So if you register your flows elsewhere (where docker is available), no docker setup should be needed during runtime.
a
Oh wait, is this an ECS task that you wrote, not an ECS task started by prefect for running a single flow run?
Yes. As I mentioned, my wider team's requirements is for all our services to be deployed as ECS services, hence I'm running Prefect Server, the web UI, and the Prefect agent/executor as three different ECS services, thus as the Docker entrypoint for the ECS service with Prefect agent I have a shell script that runs
prefect create project
for each project that needs to exist, runs Python scripts that declare the flows that call
register_flow()
for each of them, and finally the script does
prefect agent local start
.
j
Hmmm, ok. There might be a way to do docker-in-docker on ECS (not familiar), but my first suggestion would be to build static docker images externally, push them to ECR, then use S3 only for flow storage. So no docker build required in ECS.
a
Ah, that sounds really cool. So it is basically a variation of my idea of building a Docker image with the Python dependencies during my CI/CD process and pushing to ECR, except that I can still use S3 storage for the flows themselves, and when the ECS agent finds that there is a flow to run, it will pull the image from ECR, use it to create the Fargate task for the
LocalExecutor
, and then within the ephemeral Fargate task the executor will download the flow from S3 to run it?
j
Yep!
a
Thanks!
Oh, one more thing. When I setup the ECS service for Prefect Server, I configured it as a multi-container task according to the diagram here: https://docs.prefect.io/orchestration/server/architecture.html However, later I noticed that the ECS agent docs state:
In order to use this agent with Prefect Server the server's GraphQL API endpoint must be accessible. This may require changes to your Prefect Server deployment and/or configuring the Prefect API address on the agent.
Is this true, i.e., does it mean that in fact the flows/agents must be able to directly access the graphql endpoint, contrary to the diagram?
j
Ah, I can see how that may be confusing if you're looking at the server docs. What we meant by that is that the 4200 port (which goes to apollo) needs to be exposed to fargate. Most people who deploy server are doing so locally, so
localhost:4200
on the user's computer won't be reachable from the ECS task (since it's a different machine), which is what that comment was warning about. Sounds like you've deployed Prefect server in a different way so you're probably fine.
a
Ahhh... yes. We are already exposing that port (or rather a different port on a load balancer that forwards to that port on apollo) for the prefect agent/executor ECS service, so there should be no issue then. Thanks again! Would you like me to try my hand at improving the ECS agent docs on this point?
👍 2
j
Please do!
👌 1
a
Apologies for the flurry of questions, but I realised that there's something else I'm missing: how would the prefect agent authenticate with ECR so as to pull the image from my team's private repository? I'm guessing that I'll have to specify the image in the form
aws_account_<http://id.dkr.ecr.region.amazonaws.com/flow-base:latest|id.dkr.ecr.region.amazonaws.com/flow-base:latest>
, but would it say, suffice to have
AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
environment variables available, or must I do something more/else?
j
This should be doable via IAM roles. If it's for configuring an image in ECR to use as an ECS task, I believe this is done via an execution role: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html. In the latest prefect release (0.14.5, literally just released it), this should be configurable on an
ECSRun
run config (via
execution_role_arn
), or on the agent itself (via (
--execution-role-arn
). Before this had to be done through use of a custom task definition template.
a
Ah, very timely. Thanks again!
All storage types should work with all agents (with the exception of 
Local
 storage, which only works with the local agent).
Another possible docs improvement, I think. This: https://docs.prefect.io/orchestration/flow_config/storage.html#docker makes the assertion that:
This method of Storage has deployment compatability with the Docker AgentKubernetes Agent, and Fargate Agent.
which to me gives the impression that Docker storage is required for those three agent types, whereas it seems more the case that Docker storage is not required, but makes it easier to setup those three agent types because the Docker image to supply is already available (e.g., via
prefect.context.image
)
Also, we should replace the mention of the deprecated Fargate Agent with ECS Agent. Oh, and another question to confirm: when the flow has finished, does the ECS agent automatically stop the Fargate task?
s
@Amanda Wee Apologies for jumping in between, ECS agent automatically stops the task after completing it. It appears when it starts and then vanishes once completed (that is what I noticed in my ECS Run flows 🙂) My only thing here is that how can one auto-start an ECS agent based on flows submitted and then after a timeout offset remove the ECS agent if no more flows are present...something more like a serverless setup but for fargate in ECS. Current I have to run the ECS Agent via my local machine using the CLI.
🙏 1
a
@Jimmy Le
j
Oh, and another question to confirm: when the flow has finished, does the ECS agent automatically stop the Fargate task?
When a flow finishes, the main process will complete and the container will be cleaned up by AWS. No actions required by Prefect.
My only thing here is that how can one auto-start an ECS agent based on flows submitted and then after a timeout offset remove the ECS agent if no more flows are present...something more like a serverless setup but for fargate in ECS. Current I have to run the ECS Agent via my local machine using the CLI.
A serverless model would require prefect pushes out requests to run flows - right now flows are run via agents polling prefect cloud, meaning that all traffic from inside your infrastructure is outbound and you don't need to expose any ports/auth (and all the security implications that would entail). This is part of our "hybrid model" (https://medium.com/the-prefect-blog/the-prefect-hybrid-model-1b70c7fd296) and core to how Prefect currently works. We've thought about ways to loosen this restriction, but for now an agent must be running somewhere to pick up your flow runs. Note that the agent process can be very small, since it's just IO bound and low memory - an ECS service with 0.25 cores and 512 MiB may suffice.