https://prefect.io logo
d

David Wang

02/24/2022, 9:28 PM
Hi everyone, I’m having some trouble getting an ECS run to work with a custom docker image. Whenever I run the flow it gets stuck in the “submitted for execution” state. I’ve checked the CloudWatch logs but it only shows that the flow deployment was completed. Does anybody have any suggestions for how to resolve this problem? My suspicions are that the custom image is unable to be pulled from the private ECR repo, so does anybody know what the correct permissions I need to access ECR? Below is my RUN_CONFIG for ECSrun:
Copy code
RUN_CONFIG = ECSRun(
    labels=["dev"],
    task_role_arn=f"arn:aws:iam::xxx:role/prefectTaskRole",  # a role with S3 permissions
    execution_role_arn="arn:aws:iam::xxx:role/prefectECSAgentTaskExecutionRole",
    run_task_kwargs=dict(cluster="prefectEcsClusterDev"),
    image="<http://xxx.dkr.ecr.us-east-1.amazonaws.com/prefect-orchestration:docker-test-v1|xxx.dkr.ecr.us-east-1.amazonaws.com/prefect-orchestration:docker-test-v1>"
)
k

Kevin Kho

02/24/2022, 9:37 PM
ECS is always a pain to debug. The most common reasons are: 1. image is not architecture compatible 2. not being able to pull the image 3. not having a log group if you configure one I don’t know exact permissions but of course do check you can pull from ECR. An easy way to test though is if you can pull a public image like the default
prefecthq/prefect
one
d

David Wang

02/25/2022, 4:25 PM
When checking if I am able to pull a public image, I changed the image to the default
prefecthq/prefect
and the flow was able to succeed. However, if I tried a different tagged image such as
prefecthq/prefect:latest-python3.7
it would give a
prefect.exceptions.ClientError: [{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}]
error. Is there a reason why I couldn’t pull the 3.7 version?
k

Kevin Kho

02/25/2022, 4:32 PM
This looks like a registration issue. What storage are you using?
That’s weird it succeeded though on 3.8. What was your registration version?
d

David Wang

02/25/2022, 4:48 PM
Currently using s3 to store the flows. What do you mean by registration version?
k

Kevin Kho

02/25/2022, 5:38 PM
Python version when you registered
d

David Wang

02/25/2022, 6:30 PM
I’m pretty sure it was python version 3.7. Where can I check it?
k

Kevin Kho

02/25/2022, 6:32 PM
I don’t know. You can type
python
on the same box and open the interactive interpreter and it displays the version
d

David Wang

02/25/2022, 7:27 PM
oh yes I am registering with python 3.7
k

Kevin Kho

02/25/2022, 7:30 PM
This seems like it should work to me. Can you try running again to see if the error is deterministic?
d

David Wang

02/25/2022, 8:01 PM
ah youre right it was able to run with
prefecthq/prefect:latest-python3.7
this time
so looks like I can pull images from ECR
I do have a log group configured and working in CloudWatch, so I guess the problem has to do with my image.
Question about custom images: if I build
FROM prefecthq/prefect:latest
and then clone my own private package from git and install from its requirements.txt file, the requirements file specifies a
prefect~=0.15.6
to be installed too, would that affect the docker image in any way?
k

Kevin Kho

02/25/2022, 8:36 PM
I think it will just overwrite the prefect version with a pip uninstall/install
v

Vamsi Reddy

02/25/2022, 8:55 PM
@Kevin Kho so the error we are seeing is Rescheduled by a Lazarus process. This is attempt 1…2…3.
when we try to submit the flow
we are trying to submit flows to an ecs agent
k

Kevin Kho

02/25/2022, 9:04 PM
Yeah if you have the log group configured, I think you can get more info in CloudWatch
Lazarus is Prefect just seeing that the Flow didn’t start and trying to resubmit it, but there looks to be a problem with kicking off the ECS Task job. Are you guys on the same team?
v

Vamsi Reddy

02/25/2022, 9:10 PM
yup @Kevin Kho 🙂 we are building some cool automation stuff for our product using P
1
d

David Wang

02/25/2022, 9:16 PM
CloudWatch doesn’t show much unfortunately
k

Kevin Kho

02/25/2022, 9:18 PM
I believe this is the agent side log, you can add the log group to the task-definition of ECS and then you will get logs for the flow container as well
You add it in the container definition as well like this
v

Vamsi Reddy

02/25/2022, 9:25 PM
@Kevin Kho where do we put this yaml file ? or do we add it to the run config ?
k

Kevin Kho

02/25/2022, 9:27 PM
something like this where the agent can access it.
v

Vamsi Reddy

02/25/2022, 9:59 PM
@Kevin Kho we tried using this yaml approach but we still cannot get this to run. also no new logs are generated
k

Kevin Kho

02/25/2022, 10:00 PM
Can the agent pull the YAML?
Does a new task get registered and are the logs configured?
v

Vamsi Reddy

02/25/2022, 10:03 PM
yes the agent can indeed pull the YAML. we noticed there was an error in the YAML format so we corrected it as well and agent was able to pull it.
we can see the agent logs and its stuck at submitted for execution
k

Kevin Kho

02/25/2022, 10:05 PM
The log group doesnt reveal anything new? Check if the log group exists, matches the task definition the flow. You can also do two things on the agent: 1. --show-flow-logs 2. --log-level=DEBUG
And becuase your agent log group is working, maybe we just use that to figure out what is going wrong
d

David Wang

02/25/2022, 10:08 PM
so I checked if a new task was run inside the ECS and turns out the task errored out with this message:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post <https://api.ecr>....
Looks like it just can’t pull the private image?
k

Kevin Kho

02/25/2022, 10:10 PM
Looks like it to me. Did you log in to ECR?
You can see this
v

Vamsi Reddy

02/25/2022, 10:12 PM
@Kevin Kho Thank you very much….we will try these and see how it goes.🤞
I think you want the credential manager for production but not 100% sure
5 Views