https://prefect.io logo
#prefect-community
Title
# prefect-community
s

Salim Doost

04/14/2022, 2:53 AM
We’re having issues with deployments on one of our Prefect workspaces all of a sudden (without changing anything on our setup to the best of our knowledge). Our flows are stored as Docker images in AWS ECR. Running a newly created flow leads to the following error:
Copy code
404 Client Error for <http+docker://localhost/v1.41/containers/create?name=quantum-squid>: Not Found ("No such image: <account-id>.<http://dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name>%22|dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name>">)
However, we’re able to confirm that the image with this tag exists on EMR. Updating an existing flow by overriding an existing image-tag leads to the following error:
Copy code
KeyError: 'Task slug <task-name> is not found in the current Flow. This is usually caused by a mismatch between the flow version stored in the Prefect backend and the flow that was loaded from storage.
- Did you change the flow without re-registering it?
- Did you register the flow without updating it in your storage location (if applicable)?'
Again, we’re able to confirm in AWS ECR that the image got pushed and updated successfully. Our deployment job didn’t throw any error messages as well. Any idea what we can do to resolve this issue?
k

Kevin Kho

04/14/2022, 3:01 AM
I don’t know immediate, but a couple of questions: 1. Does
docker pull
work for the image? 2. How if the agent authenticated to pull that? 3. Is the key error during flow execution and it ends the flow?
s

Salim Doost

04/14/2022, 3:05 AM
1. I can see in our logs, that it was able to pull the image just before the error occurs:
Copy code
Pulling image <account-id>.<http://dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name|dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name>>...

Successfully pulled image <account-id>.<http://dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name|dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name>>

docker.errors.ImageNotFound: 404 Client Error for <http+docker://localhost/v1.41/containers/create?name=quantum-squid>: Not Found ("No such image: <account-id>.<http://dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name>%22|dkr.ecr.ap-northeast-1.amazonaws.com/datascience-prefect:<image-tag-name>">)
3. yes, that’s the case
k

Kevin Kho

04/14/2022, 3:14 AM
Is the flow simple enough to share through DM?
s

Salim Doost

04/14/2022, 3:47 AM
The issue is not the flow code itself. The flow could be anything.
k

Kevin Kho

04/14/2022, 3:50 AM
That was more about the KeyError and I wanted to see the RunConfig
s

Salim Doost

04/14/2022, 4:04 AM
The Run Config is:
Copy code
{
  "env": null,
  "type": "DockerRun",
  "image": null,
  "labels": [
    "ec2-dockeragent"
  ],
  "__version__": "0.15.4",
  "host_config": null
}
(same for all of our flows, even those that are still running and haven’t been updated since this error occurs) The
KeyError
occurs AFAIK because it tries to access a task by name that doesn’t exist in the image. Is prefect doing any caching here maybe?
k

Kevin Kho

04/14/2022, 4:09 AM
There is no caching exactly. More like the serialized flow with the task names are stored in the Prefect Cloud/Server and it raises the error if the Flow loaded from storage deviates from the stored serialized version. This can happen if you try to do something like dynamic adding of tasks or maybe if you store as script and changed the script (more common with Github), such that it doesn’t match the one stored in the backend
I can’t find the link to the Github issue anymore but I feel I saw this ImageNotFound error one time when there wasn’t enough disk space to download, though your download seemed to have completed. Anyway, might be worth checking disk space for new images
👀 1
s

Salim Doost

04/14/2022, 5:46 AM
Thank you Kevin, the instance is out of disk space indeed. I wonder why we’re not getting any error message about that. Why does it say that pulling the image was completed successfully? And even worse (for the update-case), why is it running the flow with the old image - this can be pretty dangerous for cases where changes are compatible.
k

Kevin Kho

04/14/2022, 1:34 PM
So this is because docker py basically uses a try-except to start the container and if it can’t it raises that NotFound error message. I don’t know why it says it was pulling successfully. The default for docker should be to re-pull. I think only Kubernetes needs explicit handling because the default is to pull “IfNotPresent” but docker should always pull I believe
5 Views