https://prefect.io logo
Title
d

Daniel Ross

03/05/2022, 8:50 PM
Hello Prefect community, I upgraded from 0.14.22 to 0.15.13 and containers are no longer launching. I'm deployed on ECS and I can see a ConnectTimeOutError that is preventing the tasks from coming up. This is the error in question:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1911c36d90>: Failed to establish a new connection: [Errno 111] Connection refused'))
If I look at the task itself, I can see that the environment variable for PREFECT__CLOUD__API is set to http://127.0.0.1:4200. So this seems like the problem. The host it's trying to connect to is clearly wrong (since the server itself is running on an EC2 instance). So I've adjusted my ~/.prefect/config.toml to look like this:
host_ip = "my.ip.goes.here"
host_port = "4200"
host = "http://${server.host_ip}"
port = "4200"
endpoint = "${server.host}:${server.port}"
  [server.ui]
  apollo_url = "<http://my.ip.goes.here:4200/graphql>"
[cloud]
api = "${${backend}.endpoint}"
endpoint = "<https://api.prefect.io>"
graphql = "${cloud.api}/graphql"
No luck. So I added the PREFECT__CLOUD__API definition to my environment variables in the container definition. Still no luck. However, when I look at the task definition, I can see the correct (or at least intended) PREFECT__CLOUD__API environment variable there. But the variable in the task is still set to http://127.0.0.1:4200, and the problem persists! I am pretty stuck on this, and hoping that someone here has a line of sight to the solution. (This all worked without much configuration previously ... which now seems weird.) All help appreciated!
:discourse: 1
k

Kevin Kho

03/05/2022, 8:59 PM
How did you start server on 0.15.13?
prefect server start
?
I think you might just need the
expose
flag. Check the note here on the change in Prefect 0.15.5
d

Daniel Ross

03/05/2022, 9:02 PM
Yes. External db is specified and the --expose flag has been added as well.
The start command looks more or less like this:
prefect server start -d --postgres-url "postgresql:goes/here" --expose
k

Kevin Kho

03/05/2022, 9:09 PM
Ah ok there is a. lot going on in the
config.toml
, you are saying the Flow can’t communicate with the right API right? You should just need
[server]
endpoint = "YOUR_MACHINES_PUBLIC_IP:4200/graphql"
Also just making sure you can see the UI at
http://*YOUR_MACHINES_PUBLIC_IP*:8080
?
d

Daniel Ross

03/05/2022, 9:12 PM
Yeah, no issue accessing the UI
k

Kevin Kho

03/05/2022, 9:12 PM
Ok yeah you are right I think it’s just the endpoint. Pretty weird cuz nothing around this changed.
d

Daniel Ross

03/05/2022, 9:14 PM
Ok. I will adjust the endpoint and simplify the config a bit. Quick question, do I need re-register the flow once this is done, or will this be passed in after a server restart?
New config looks like this:
backend = "server"
[server]
endpoint = "my.ip.goes.here:4200/graphql"
  [server.ui]
  apollo_url = "<http://my.ip.goes.here:4200/graphql>"
[cloud]
api = "${${backend}.endpoint}"
endpoint = "<https://api.prefect.io>"
Still no luck. Restarted the server, but did not re-register the flow.
k

Kevin Kho

03/05/2022, 9:34 PM
If your flow is already in the database, I think no, but registering the flow is a good test to see if your endpoint is set correctly. The above was not meant for the Server config. This was local machine config was is connecting to the server
Server config can just be
[server]
  [server.ui]
    apollo_url = "<http://YOUR_MACHINES_PUBLIC_IP:4200/graphql>"
This blog has the info. I think you don’t need the cloud config since you are on Prefect Server anyway?
I think you server config was already fine if you could access the UI remotely though. I think it’s just your agent/flows that can’t connect right? You just need to change the endpoint on the local machine, not the server.
d

Daniel Ross

03/05/2022, 9:43 PM
I am using ECSRun, so in this case, is the local machine the container/task? That does seem to be a problem, but I am not sure why the endpoint won't propagate from the ECS task definition to the ECS task itself.
k

Kevin Kho

03/05/2022, 9:47 PM
Well I guess first question is if local run works, and if it does, then you need the env variables in the ECSRun container. How is the endpoint set on the ECS task definition?
A LocalRun test would just confirm for us everything is working well and the server is healthy
:upvote: 1
d

Daniel Ross

03/05/2022, 9:57 PM
The task definition that gets submitted to ECSRun looks more re less like this:
"networkMode": "awsvpc", 
    "cpu": "1024",
    "memory": "2048",
"containerDefinitions": [
        {
            "name": "flow",
            "environment": [
                {
                    "name": "PREFECT__CLOUD__API",
                    "value": "<http://my.ip.goes.here:4200>"
                },
                ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "my-log-group",
                    "awslogs-region": "my-region",
                    "awslogs-create-group": "true",
                    "awslogs-stream-prefix": "flow_name"
                }
            }
        }
    ]
},
run_task_kwargs = {
    "networkConfiguration": {
        "awsvpcConfiguration": {
            "subnets": [
                subnet_a
            ],
            "securityGroups": [
                sg_a
            ],
            "assignPublicIp": "ENABLED"
        }
    }
}
The environment variable is a new addition introduced to try to resolve the problem. I'll have to cobble together a quick test run and add a local agent to test the local run.
k

Kevin Kho

03/05/2022, 10:03 PM
Ah I see what you mean. Maybe you need:
PREFECT__BACKEND: "server"
also though? But yeah I think it looks like that is not being set. I think you need to define it in the Flow container.
You can try setting:
PREFECT__BACKEND: "server"
PREFECT__SERVER__ENDPOINT: <YOUR-IP>:4200
?
d

Daniel Ross

03/05/2022, 10:12 PM
Sure. Just made the changes to the registration script. It will take a couple of minutes to run through.
The PREFECT__SERVER__ENDPOINT environment variable made it in with the correct value, and the PREFECT__BACKEND variable remained set to "server", but I'm still getting the same error.
k

Kevin Kho

03/05/2022, 10:31 PM
Just making sure, you still see 127.0.0.1 as host right? If so, at that point I would just encourage the env variables be set in the container to be sure
But I mean if server endpoint makes it in, that should be it you need because you override this. I am pretty confused why it’s still 127.0.0.1. Just checking you don’t have anything configured on the run config side?
d

Daniel Ross

03/05/2022, 10:35 PM
Yeah, 127.0.0.1 is still showing. I just added the environment variable to the part of the registration script that handles the flow storage. Re-registering now.
The run config is just a basic call to ECSRun() with the task definition passed in.
Does the config file take precedence over environment variables? I am wondering if there is a config file that is causing the issue here. During the registration process I can see that the following variable is set while building the flow storage image:
PREFECT__USER_CONFIG_PATH='/opt/prefect/config.toml'
This isn't something I set.
a

Anna Geller

03/05/2022, 11:59 PM
Does the config file take precedence over environment variables?
No, env variables take precedence over
config.toml
. I think your env variables on the ECS task definition should be (
/graphql
was missing + the server backend env variable):
"containerDefinitions": [
        {
            "name": "flow",
            "environment": [
                {
                    "name": "PREFECT__CLOUD__API",
                    "value": "<http://some_ip:4200/graphql>"
                },
                {
                    "name": "PREFECT__BACKEND",
                    "value": "server"
                },
                ],
I can also see that there is no "image" in your containerDefinitions - if you don't set it explicitly, it will by default take the latest version which currently is 1.0.0. The problem with this is that your flow runs should use a Prefect version which is <= Prefect version of your Server. Otherwise, you may hit some API endpoints which don't exist in your server or got changed. Can you try something like this (full task definition example):
{
  "family": "prefectFlow",
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "512",
  "memory": "1024",
  "taskRoleArn": "arn:aws:iam::123456789:role/prefectTaskRole",
  "executionRoleArn": "arn:aws:iam::123456789:role/prefectECSAgentTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "flow",
      "image": "prefecthq/prefect:0.15.3-python3.8",
      "essential": true,
      "environment": [
        {
          "name": "AWS_RETRY_MODE",
          "value": "adaptive"
        },
        {
          "name": "AWS_MAX_ATTEMPTS",
          "value": "10"
        },
        {
          "name": "PREFECT__CLOUD__AGENT__AUTH_TOKEN",
          "value": ""
        },
        {
          "name": "PREFECT__CLOUD__API",
          "value": "<http://some_ip:4200/graphql>"
        },
        {
          "name": "PREFECT__BACKEND",
          "value": "server"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/prefectFlow",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}
If the above doesn't work, can you provide a bit more background on your setup? So far I understood: 1. You run your Server instance on some VM, probably on an EC2 instance - correct? 2. You have an ECS Agent - does it run as an ECS service or just as a local process on that same EC2 instance that you use to run your Server? 3. How did you register the flow that failed? 4. Can you share your storage and run_config definition?
d

Daniel Ross

03/06/2022, 12:18 AM
Thanks Anna. I'll try your suggestion now. In the mean time, some answers to your questions ... 1. Yes. The Prefect server is installed on an EC2 instance. 2. The ECS Agent runs as a local process on the same EC2 instance that runs the server. 3. The flow is registered through a Python script that composes the flow. It reads in the script, specifies a run config and storage, and then registers it. 4. This is what the storage looks like:
Docker(registry_url="<http://awsacctid.dkr.ecr.region.amazonaws.com|awsacctid.dkr.ecr.region.amazonaws.com>", 
                  base_image=base_image,
                  files=image_files, # the files that should be copied to the image
                  image_name="prefect-flow-storage", # using this as image name to allow only one ECR repo to be made
                  image_tag=f"{slugify(flow_name)}-{idempotency_key}", # using this as the image tag to ensure efficient storage of flows, only one stored image per flow version
                  env_vars={
                      # append top level directory to PYTHONPATH
                      "PYTHONPATH": "$PYTHONPATH:/",
                      "PREFECT__CLOUD__API": "<http://172.31.73.239:4200>",
                      "PREFECT_SERVER__ENDPOINT":"<http://172.31.73.239:4200>"
                  })
The CLOUD__API and SERVER__ENDPOINT variables are new from this troubleshooting. As for the run config, this is what it looks like:
ECSRun(
        task_definition = {
            "networkMode": "awsvpc", 
            "cpu": "1024",
            "memory": "2048",
            "containerDefinitions": [
                {
                    "name": "flow",
                    "environment": [
                        {
                            "name": "PREFECT__BACKEND",
                            "value": "server"
                        },
                        {
                            "name": "PREFECT__CLOUD__API",
                            "value": "<http://my-ip-here:4200/graphql>"
                        },
                     ],
                    "logConfiguration": {
                        "logDriver": "awslogs",
                        "options": {
                            "awslogs-group": "my log group",
                            "awslogs-region": "my region",
                            "awslogs-create-group": "true",
                            "awslogs-stream-prefix": flow_name
                        }
                    }
                }
            ]
        },
        run_task_kwargs = {
            "networkConfiguration": {
                "awsvpcConfiguration": {
                    "subnets": [
                        subnet_a
                    ],
                    "securityGroups": [
                        security_group_b
                    ],
                    "assignPublicIp": "ENABLED"
                }
            }
        }
    )
Again, the environment variables have been added throughout this troubleshooting.
In the script I am looking at, the task and execution role ARNs are defined on a FargateCluster which is provided to a DaskExecutor, which in turn is assigned as the executor for the flow. ex.
flow.executor = DaskExecutor(
        cluster_class=fargate_cluster,
        adapt_kwargs={"minimum": 1, "maximum": 10})
a

Anna Geller

03/06/2022, 1:10 AM
The configuration for your FargateCluster is something completely different than what you need to configure for your flow run, because Dask spins up a completely new cluster for your task run execution, but for the flow run execution you still need to set the roles either on the agent or on the flow’s run config. Some issues are still unaddressed in the code you shared - here are how I would “correct” it: 1. set base image explicitly to ensure you use the correct version compatible with your server (perhaps you set it right but you didn’t share what version you set) 2. PREFECT__CLOUD__API needs /graphql at the end 3. I believe you need to set your image on ECS run as well
Docker(registry_url="<http://awsacctid.dkr.ecr.region.amazonaws.com|awsacctid.dkr.ecr.region.amazonaws.com>", 
                  base_image="prefecthq/prefect:0.15.3-python3.8",
                  files=image_files, # the files that should be copied to the image
                  image_name="prefect-flow-storage", # using this as image name to allow only one ECR repo to be made
                  image_tag=f"{slugify(flow_name)}-{idempotency_key}", # using this as the image tag to ensure efficient storage of flows, only one stored image per flow version
                  env_vars={
                      # append top level directory to PYTHONPATH
                      "PYTHONPATH": "$PYTHONPATH:/",
                      "PREFECT__CLOUD__API": "<http://172.31.73.239:4200/graphql>",
                      "PREFECT_SERVER__ENDPOINT":"<http://172.31.73.239:4200/graphql>"
                  })
ECSRun:
ECSRun(image=f"{AWS_ACCOUNT_ID}.<http://dkr.ecr.us-east-1.amazonaws.com/{your_image_name}:{your_image_tag}|dkr.ecr.us-east-1.amazonaws.com/{your_image_name}:{your_image_tag}>", # from your Docker storage definition
        task_definition = {
            "networkMode": "awsvpc", 
            "cpu": "1024",
            "memory": "2048",
            "containerDefinitions": [
                {
                    "name": "flow",
                    "environment": [
                        {
                            "name": "PREFECT__BACKEND",
                            "value": "server"
                        },
                        {
                            "name": "PREFECT__CLOUD__API",
                            "value": "<http://my-ip-here:4200/graphql>"
                        },
                     ],
                    "logConfiguration": {
                        "logDriver": "awslogs",
                        "options": {
                            "awslogs-group": "my log group",
                            "awslogs-region": "my region",
                            "awslogs-create-group": "true",
                            "awslogs-stream-prefix": flow_name
                        }
                    }
                }
            ]
        },
        run_task_kwargs = {
            "networkConfiguration": {
                "awsvpcConfiguration": {
                    "subnets": [
                        subnet_a
                    ],
                    "securityGroups": [
                        security_group_b
                    ],
                    "assignPublicIp": "ENABLED"
                }
            }
        }
    )
If you need some examples to help debug it, check out some flows with the name “ecs” in it here. I also had a blog post showing docker agent setup as ECS service - if nothing else works, perhaps you can try deploying a new agent to the same subnet as your Server instance and deploy it as ECS Service
:upvote: 1
d

Daniel Ross

03/06/2022, 1:19 AM
Thanks a ton Anna! I'll work through this material.
I just wanted to post a quick follow up in case anyone reading this is interested in the outcome. The problem was fixed. I reverted the config and registration processes back to their original states, and for troubleshooting purposes, and restarted Prefect with a temporary Postgres container for the DB. That worked. If I reattached to my external Postgres it failed. When the process was working (with the temporary Postgres container) the PREFECT__CLOUD__API environment variable was properly set on the ECS task. When the process was broken (using the original external Postgres) the PREFECT__CLOUD__API environment variable was not properly set (it assumed a default - http://127.0.0.1:4200/graphql - value). In the end the problem was resolved by creating a new Postgres database for the backend. It came at the cost of losing my stored history, schedules, etc., and was far from elegant. But it was effective. Final note. I am not entirely sure why this worked. I assumed that it was a Prefect version conflict caused by something stuck in the database. Out of curiosity I intentionally built my flow storage image on 0.14.19 and tried running the flow. It worked, and the logs included a warning on the version mismatch. I upgraded the flow storage back to 0.15.13 to match the server, and it continued to work, but without the warning. So it seems a little more complicated than a straight-forward version mismatch.
k

Kevin Kho

03/07/2022, 2:11 PM
Thanks for the detailed write-up. I can’t comprehend why this would work though. Glad you have a working setup now though
a

Anna Geller

03/07/2022, 4:08 PM
Thanks for sharing 🙏 - what was missing was probably running database migration! I should have thought about that, sorry. Next time you should be able to run migration using prefect server CLI - check CLI docs or --help for details.
d

Daniel Ross

03/07/2022, 4:18 PM
Thanks 🙏 to both of you for your assistance on this issue!