Hi! Trying to restrict the memory usage of each fl...
# ask-community
m
Hi! Trying to restrict the memory usage of each flow by passing
host_config
to our
DockerRun
config:
Copy code
flow.storage = Docker(
        registry_url=...,
        image_name=...,
        files={...},
        env_vars={...},
        python_dependencies=[...]
        )

client = docker.APIClient()
host_config = client.create_host_config(mem_limit=12345,
                                        mem_reservation=1234
                                        )
flow.run_config = DockerRun(host_config=host_config)

flow.executor = LocalDaskExecutor()
When registered to Cloud, this seems to be ok, since starting a Run shows the following default in Host Config:
Copy code
{
  "Memory": 12345,
  "MemoryReservation": 1234
}
However, this seems to have no effect on the the newly created flow containers.
docker stats
show that MEM USAGE keeps growing and a LIMIT that equals the total server memory.
docker inspect <CONTAINER> | grep \"Memory[\"R]
gives
Copy code
"Memory": 0,
            "MemoryReservation": 0,
What are we missing here?
1
a
you can leverage the host_config on the run configuration to specify that on a per flow basis, but it’s supposed to be a dictionary:
Copy code
from prefect.run_configs import DockerRun

run_config = DockerRun(host_config=dict(mem_limit=12345))
more info: • https://docs.prefect.io/api/latest/run_configs.html#dockerrunhttps://docker-py.readthedocs.io/en/stable/api.html#docker.api.container.ContainerApiMixin.create_host_config
m
Yes I tried exactly that first, but it also doesn't work.
It is unclear from the docs, whether mem_limit or MemoryLimit is the correct format in the host_config.
a
afaik,
mem_limit
should be used
can you share your DockerRun with host_config as dict? which executor do you use - do you use Dask by any chance?
m
Yes
flow.executor = LocalDaskExecutor()
as listed above
a
sorry, my bad, I omitted that. Is this your run_config?
Copy code
run_config = DockerRun(host_config=dict(mem_limit=12345))
I was asking about executor, because you can limit the resources by specifying fewer workers e.g.
Copy code
flow.executor = LocalDaskExecutor(scheduler="processes", num_workers=2)
Does it make sense to set limits on the executor rather than flow? Are you setting this because your flow is running out of memory?
m
Yes, I've just tested that
run_config
again, still no luck. I've also tried `LocalExecutor()`to ensure this is not a Dask issue.
We are starting many different flows (some are created dynamically) and want to avoid OOM on the server, while still using as much capacity as possible. Therefore
mem_limit
and
mem_reservation
is suited for our needs. (
num_workers
(for dask) and/or Cloud concurrency limits for Flows/Tasks won't help us because number of tasks is not proportional to the workloads of individual runs.)
a
I see, thx for sharing background on this. Did you try setting
mem_limit
as string e.g.
host_config=dict(mem_limit="128m")
?
m
Yes 🙂
Also, I can experiment with the values directly from the UI when submitting runs:
a
thx, I will look into that and try reproduce
m
Great, appreciate it.
k
I’m working on reproducing. What is your output of
pip show docker
? I just noticed
mem_reservation
was added in 4.3.1.
mem_limit
should work though. But the version will help me be sure.
upvote 1
m
pip show docker
=
5.0.3
locally where i build deploy... (?)
k
Ok I took a stab at this and I can’t replicate. I have a Flow here and registered and ran it with the Docker Agent. I then did inspect on the running container.
👍 1
upvote 1
And saw that it was set:
Copy code
"Memory": 1073741824,
            "MemoryReservation": 12345678,
👍 1
Could you try registering my Flow and running it and running the inspect?
m
After registering the flow and starting it from cloud UI, I log in to the server where the agent is running. I use
docker stats
to find the new autogenerated container name (all other container have specific names). Then
docker inspect <CONTAINER> | grep
gives:
Copy code
"Memory": 0,
            "MemoryReservation": 0,
Here is the server docker version you mentioned (if relevant):
Copy code
[server]$ docker exec -it prefect-docker-agent-xyz /bin/bash
root@970eb27929c5:/# pip show docker
Name: docker
Version: 4.4.4
and the flow container:
Copy code
[server]$ docker exec -it brave_bardeen /bin/bash
root@5f8a37ed4886:/# pip show docker
Name: docker
Version: 5.0.2
k
What is the OS of your server? And what is the output of
docker version
?
Are you on Cloud or Server? and are you running the Docker agent in a Docker container?
m
We use Prefect Cloud with on-prem CentOS servers running the agent. The agent is running in a docker container (managed by docker swarm).
Copy code
[server]$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

[server]$ docker version
Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:58:10 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:56:35 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
Copy code
$ docker inspect prefect-docker-agent-for-cloud_prefect-agent.xxx.yyy
...
        "Args": [
            "-g",
            "--",
            "entrypoint.sh",
            "sh",
            "-c",
            "prefect agent docker start --no-docker-interface --log-level DEBUG --token xxx -n ${PREFECT_NAME} --network prefect-docker-agent-for-cloud-network --volume /var/run/docker.sock:/var/run/docker.sock:ro --volume /root/.docker/:/root/.docker/:ro"
        ],
...
a
@Martin T thanks for sharing. I see what is happening. In general, Docker agent is supposed to run in a local process (not in a docker container) and this local process is a layer between Prefect backend and a Docker Daemon. This agent polls the API for new flow runs, and if there are new flow runs scheduled to run, it then creates new flow runs and deploys those as Docker containers on the same machine as the agent. What is happening in your environment is that since the Docker agent is running within a container itself (rather than a local process), your flow runs end up deployed as container, but not as individual containers for which you could control the memory limits, but rather within the agent container. You effectively have a single agent container spinning up new containers within itself (docker in docker), which causes the issues with setting memory limits that you encountered. When it comes to Docker swarm, Prefect currently doesn’t have any agent that would manage containers in a Docker swarm. So what you could try instead is starting your Docker agent in a local process:
Copy code
prefect agent docker start --label YOUR_LABEL --log-level DEBUG --key YOUR_API_KEY -n ${PREFECT_NAME} --network prefect-docker-agent-for-cloud-network --volume /var/run/docker.sock:/var/run/docker.sock:ro --volume /root/.docker/:/root/.docker/:ro
If you want more environment isolation for this process, you can run it within a virtual environment. I also noticed that: 1. You use API token rather than API key - API tokens are deprecated, so API key would be preferable 2. The --no-docker-interface flag is also deprecated (I believe it’s likely because of exactly this problem that you’re facing with Docker-in-Docker) 3. You have no labels assigned to the agent. Usually, you add a label to the agent, and then add the same label to your flow so that you can match flow with the agent for deployment - I included the label in the above command that you can start in a local process. The CLI docs provide more information on all of that: https://docs.prefect.io/api/latest/cli/agent.html#docker-start
🙌 1
m
@Anna Geller @Kevin Kho thanks for your swift support. Will talk to infra on how we host the agents. Closing for now.
Of curiosity, does the kubernetes agent also have to run as a local process, or is it allowed to run containerized? (swarm -> k8s on our roadmap)
a
Great question! Since Kubernetes API is different from the Docker API, Kubernetes agent works a bit differently. It typically runs within an individual pod, from there it polls the Prefect API for new flows, and if there are any new scheduled flow runs, it then deploys those as Kubernetes jobs. Usually each Kubernetes job (corresponding to a specific flow run) runs within an individual pod. In general, since everything in Kubernetes is an object from the API perspective, there is usually no such risk of Docker-in-Docker as with the Docker agent. And the resource allocation is a bit easier too, e.g. on KubernetesRun, you can directly set your resource limits for a flow: • cpu_limit=None, cpu_request=None, • memory_limit=None, memory_request=None
🙏 1
But the above is only if you deploy the Kubernetes agent in-cluster, but if you want, you could run Kubernetes agent as a local process, similarly to a Docker agent: https://docs.prefect.io/orchestration/agents/kubernetes.html#agent-configuration