Hi everyone, I’m having some trouble getting an E...
# prefect-community
d
Hi everyone, I’m having some trouble getting an ECS run to work with a custom docker image. Whenever I run the flow it gets stuck in the “submitted for execution” state. I’ve checked the CloudWatch logs but it only shows that the flow deployment was completed. Does anybody have any suggestions for how to resolve this problem? My suspicions are that the custom image is unable to be pulled from the private ECR repo, so does anybody know what the correct permissions I need to access ECR? Below is my RUN_CONFIG for ECSrun:
Copy code
RUN_CONFIG = ECSRun(
    labels=["dev"],
    task_role_arn=f"arn:aws:iam::xxx:role/prefectTaskRole",  # a role with S3 permissions
    execution_role_arn="arn:aws:iam::xxx:role/prefectECSAgentTaskExecutionRole",
    run_task_kwargs=dict(cluster="prefectEcsClusterDev"),
    image="<http://xxx.dkr.ecr.us-east-1.amazonaws.com/prefect-orchestration:docker-test-v1|xxx.dkr.ecr.us-east-1.amazonaws.com/prefect-orchestration:docker-test-v1>"
)
k
ECS is always a pain to debug. The most common reasons are: 1. image is not architecture compatible 2. not being able to pull the image 3. not having a log group if you configure one I don’t know exact permissions but of course do check you can pull from ECR. An easy way to test though is if you can pull a public image like the default
prefecthq/prefect
one
d
When checking if I am able to pull a public image, I changed the image to the default
prefecthq/prefect
and the flow was able to succeed. However, if I tried a different tagged image such as
prefecthq/prefect:latest-python3.7
it would give a
prefect.exceptions.ClientError: [{'path': ['get_or_create_task_run_info'], 'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'Expected type UUID!, found ""; Could not parse UUID: ', 'locations': [{'line': 2, 'column': 101}], 'path': None}}}]
error. Is there a reason why I couldn’t pull the 3.7 version?
k
This looks like a registration issue. What storage are you using?
That’s weird it succeeded though on 3.8. What was your registration version?
d
Currently using s3 to store the flows. What do you mean by registration version?
k
Python version when you registered
d
I’m pretty sure it was python version 3.7. Where can I check it?
k
I don’t know. You can type
python
on the same box and open the interactive interpreter and it displays the version
d
oh yes I am registering with python 3.7
k
This seems like it should work to me. Can you try running again to see if the error is deterministic?
d
ah youre right it was able to run with
prefecthq/prefect:latest-python3.7
this time
so looks like I can pull images from ECR
I do have a log group configured and working in CloudWatch, so I guess the problem has to do with my image.
Question about custom images: if I build
FROM prefecthq/prefect:latest
and then clone my own private package from git and install from its requirements.txt file, the requirements file specifies a
prefect~=0.15.6
to be installed too, would that affect the docker image in any way?
k
I think it will just overwrite the prefect version with a pip uninstall/install
v
@Kevin Kho so the error we are seeing is Rescheduled by a Lazarus process. This is attempt 1…2…3.
when we try to submit the flow
we are trying to submit flows to an ecs agent
k
Yeah if you have the log group configured, I think you can get more info in CloudWatch
Lazarus is Prefect just seeing that the Flow didn’t start and trying to resubmit it, but there looks to be a problem with kicking off the ECS Task job. Are you guys on the same team?
v
yup @Kevin Kho 🙂 we are building some cool automation stuff for our product using P
1
d
CloudWatch doesn’t show much unfortunately
k
I believe this is the agent side log, you can add the log group to the task-definition of ECS and then you will get logs for the flow container as well
You add it in the container definition as well like this
v
@Kevin Kho where do we put this yaml file ? or do we add it to the run config ?
k
something like this where the agent can access it.
v
@Kevin Kho we tried using this yaml approach but we still cannot get this to run. also no new logs are generated
k
Can the agent pull the YAML?
Does a new task get registered and are the logs configured?
v
yes the agent can indeed pull the YAML. we noticed there was an error in the YAML format so we corrected it as well and agent was able to pull it.
we can see the agent logs and its stuck at submitted for execution
k
The log group doesnt reveal anything new? Check if the log group exists, matches the task definition the flow. You can also do two things on the agent: 1. --show-flow-logs 2. --log-level=DEBUG
And becuase your agent log group is working, maybe we just use that to figure out what is going wrong
d
so I checked if a new task was run inside the ECS and turns out the task errored out with this message:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post <https://api.ecr>....
Looks like it just can’t pull the private image?
k
Looks like it to me. Did you log in to ECR?
You can see this
v
@Kevin Kho Thank you very much….we will try these and see how it goes.🤞
I think you want the credential manager for production but not 100% sure