I have a strange issue where changes I make to Flows are not being reflected in the execution enviro...
j
I have a strange issue where changes I make to Flows are not being reflected in the execution environment. I am running a docker agent on an AWS EC2 instance, and when I make updates to a Flow, those updates are not reflected at runtime. All of the storage building and registration of the Flow appear to happen successfully, and when i look at the terminal output on the EC2 instance i see
Successfully pulled image XXXXX
and
agent | Completed deployment of flow run XXXXX
, but the flows content is not being updated . When I change the runtime Labels on Prefect cloud and execute the same (updated) flow in another environment (docker agent running on mac osx) the updates i made to the flow are reflected…
k
Hey @Jacob Goldberg, maybe the image tag being pulled is an old one?
j
the docker containers are stored in ECR. The tag being pulled is
<http://XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest|XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest>
accroding to the EC2 terminal output. I dont realyl have experience with ECR outside of storing Prefect images, but i have not changed any configuration settings, so I am not sure why it would be wrong if it is pulling ‘latest’. it has always worked in the past and it is working on another environment
k
Yeah, just saw the new changes are reflected in the new environment…that’s weird you see the image is pulled though. Would assume it would just use the cache
j
that is an interesting point…not sure how it is related. Even if i do not update the flow, every time i execute a flow from Prefect cloud i see “Pulling Image…” and “Successfully pulled image…” in that EC2 environment
Here is another breadcrumb. I just tied to run a barebones test flow in the enviornment that has always worked in the passed (on MacOSX) and got this error
Copy code
Failed to load and execute Flow's environment: StorageError('An error occurred while unpickling the flow:\n  ModuleNotFoundError("No module named \'tasks\'")\nThis may be due to one of the following version mismatches between the flow build and execution environments:\n  - python: (flow built with \'3.8.6\', currently running with \'3.8.10\')\nThis also may be due to a missing Python module in your current environment. Please ensure you have all required flow dependencies installed.')
🤔 definitely something strange going on
never had any issues like this in the past, and am not sure what could have changed causing these types of issue
k
I don’t know if that would cause the ECS issue, but in general the Python version should be the same of flow registration, the agent, and that container.
j
I will make sure those match and see if it changes anything. regarding the ECS issue is there another place i could file a ticket or find support?
k
I think we do have a partnership with a consulting firm and if you need more dedicated debugging on cloud stuff i could forward you to them
Do Prefect logs show anything about what image was downloaded?
j
They do not..
regarding the python versions, i have always used the
prefect.storage.docker.Docker()
function to build the docker image, and have relied on default for the base_image arg. The docs say that the
Copy code
"the base image for this when building this image (e.g. python:3.6), defaults to the prefecthq/prefect image matching your python version and prefect core library version used at runtime."
Therefore I am not sure what could be causing the python version mismatch
could it be that an update to the base image is a source of all of my problems?
k
Ah I see, have you tried running with debug level logs for the Flow? Also maybe the agent is the one with a different Python version?
j
I have not run with debug level logs, what would be the best way to enable that, do i need to rewrite the flow?
k
ECSRun(..., env="PREFECT___LOGGING___LEVEL":"DEBUG")
👍 1
and then re-register and re-run
Ah is it possible to go to the ECS Task Definition in the UI of the one pulling the same container, and then checking if it is the container you are expecting?
j
to clarify, I am not actually using ECSRun(), we had a number of issues configuring ECS to run with prefect given out custom vpc vonfiguration. I am running DockerRun() on an EC2 instance that is always running
so i cannot go to the ECS task definition
k
Ohhh gotcha. My bad. This should not be as bad then. I think DockerRun with debug logs should give us some insight
I think this might be the issue where the tag latest satisfies the requirement and it doesn’t pull the image
j
Hmm. I updated the DockerRun Env var to
"PREFECT___LOGGING___LEVEL":"DEBUG"
, updated the image, and re-registered the flow and I am still seeing the same message with a basic test flow. i.e. no additional information from debugging
maybe because it is not actually pulling the latest image!
le me checkout that SO link
k
Oof…of course haha. That was dumb of me. Looks like as long as there is no error, it will print that log here
But we use pull….no I think it should work
j
I am trying to make all image tags unique with a timestamp to see if that resolves anything
k
That would be best practice 😄
😬 1
j
The latest image is now being pulled, still seeing the same error about python version mismatch, going to try to patch that up and see if this solves things
k
You can also directly use docker pull to see the versions in that image? I suspect it’s the agent though on that EC2?
j
@Kevin Kho I am starting to think more that some change to the default docker image chosen by
prefect.storage.docker.Docker()
is causing the problem. I am starting outside of the EC2 environment, and working on my Mac OSX environment where everything has always worked fine until the last 48 hours. I have ensured the build environment of the image + registration is running python 3.8.5 and that the agent is running under python 3.8.5 however I still get this message when trying to execute flows:
Copy code
Failed to load and execute Flow's environment: FlowStorageError('An error occurred while unpickling the flow:\n  ModuleNotFoundError("No module named \'tasks\'")\nThis may be due to one of the following version mismatches between the flow build and execution environments:\n  - python: (flow built with \'3.8.5\', currently running with \'3.8.12\')\nThis also may be due to a missing Python module in your current environment. Please ensure you have all required flow dependencies installed.')
I think it must be
prefect.storage.docker.Docker()
that is choosing a base image with python 3.8.12 causing this error. Am i missing something here? This still may not be root cause of my issues, but i do want to ensure the python versions are aligned
i have also ensured that the prefect version installed in the build environment, within the container (via a requirements.txt) file and the agent are all the same and up-to-date with the latest version
k
Gotcha will take a look at this and at that image
Could you share your Docker Storage code with me? Just omit sensitive info
j
Copy code
with open("../requirements.txt") as f:
    requirements = f.read().splitlines()

STORAGE = Docker(
    registry_url="<http://XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/|XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/>",
    image_name="cal_val_etl_flows",
    image_tag=f"latest-{round(time.time())}",
    python_dependencies=requirements,
    # copy module files to Docker container
    files={
        os.path.dirname(
            os.path.dirname(os.path.realpath(__file__))
        ): "/custom_modules/calval_etl"
    },
    # add  module to python path of Docker container
    env_vars={"PYTHONPATH": "$PYTHONPATH:custom_modules/calval_etl"},
    # adding this, because run into healthcheck error where API modules are not authorized to fetch creds from aws secrets.
    ignore_healthchecks=True,
)
is that what you mean?
k
yep perfect! thank you!
Can I see your requirements?
j
requirements.txt
k
Thanks!
Could you DM me flow code? I’d like to see the imports
I am wondering if the version mismatch is misleading here. There looks to be a real error
j
ok. Im an idiot. My test flow was quite old, it is not part of my testing suite, the module has been rearranged and my import statment was simply wrong
k
That should have failed in registration I think?
j
Fixed the import statement, going to rerun on my macosx test environment, that should work then going to get back to EC2 and see if i am still having the same issue. I think not because i have ignore_healthchecks set to True?
k
Yeah that sounds good then let’s see!
j
Ok. I have confirmed everything is working fine now on my testing environment in macOSX (i dont think anything was ever really wrong, just me using a dumb old test flow). Anyhow a positive outcome of that red herring is that now the python versions on my dev, test, and prod, environments are identical….now back to EC2, when i try to run a test flow i now get this error.
404 Client Error for <http+docker://localhost/v1.41/containers/create?name=stoic-roadrunner>: Not Found ("No such image: <http://XXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest-1633114956|XXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest-1633114956>")
i know the image exists because i verified via the ECR UI, AND my test execution environment on macOSX has no problem finding and pulling it
seems like original issue was caused by the EC2 instance that is running a docker agent not being able to properly pull images
in addition, if i SSH into the EC2 instance, stop running the prefect agent, and run
aws ecr list-images --repository-name "cal_val_etl_flows"
i can see the image that supposedly cant be found
not sure why the python docker api would suddenly not be able to find the image…?
k
Have been staring at this and thinking and no ideas yet
j
definitely a head scratcher…
k
Do you have
AWS_CREDENTIALS
as a secret on Prefect Cloud? (am hoping not)
j
i do not
k
I was wondering if Prefect was using some other credentials but I guess not
j
they should be the same. I load AWS creds as env vars when i build and register flows via DockerRun() like this:
Copy code
RUN_CONFIG = DockerRun(
    labels=["ec2", "prod"],
    env={
        "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
        "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
        "AWS_DEFAULT_REGION": os.environ["AWS_DEFAULT_REGION"],
        "PREFECT__LOGGING__LEVEL": "DEBUG"
    },
)
k
I think you ideally want secrets for those (not related to the issue). If those are the same, it should be fine, and really i would expect an authentication issue if they were different
j
ya, i know this is a bit hacky and not ideal. I have an interesting auth dilema preventing me from accessing those creds as secrets…but that is for another day. Doesn’t seem to be the issue here
i guess i can try just nuking the EC2 instance and starting again…hoping that a good old “unplug and plug back in” may do some good? but im not sure what else to do
k
I think you can try SSH-ing into there are using
docker pull
directly and see if that works?
1
j
Interesting…
i tried a docker pull, it had no trouble finding the image, and started to pull succesfully but errored out with
write /var/lib/docker/tmp/GetImageBlob150753823: no space left on device
maybe that is the real issue and the previous error about not finding the image was misleading?
k
Maybe. I bet it was a try-except block that gave that error message
Well there is no try-except on the Prefect side haha. The error is on
dockerpy
side so maybe it really doesn’t find it? It does say 404
Frikin hell it is a red herring https://github.com/docker/docker-py/issues/2503
🙌 1
j
Super annoying! looks like that was the root of all of my issues, and everything is working now. I think at first it was silently failing and using a cached version, and once i switched to unique image tags it failed with this misleading error
thanks for you help down this rabbit whole Kevin. Super helpful as usual!
k
Yeah what a rabbit hole lol. Glad it’s working now
j
For now, I have increased the storage volume on my EC2 instance, but i figure this may happen in the future again. The only thing that i think may be adding considerable space on the instance in the future is caches of previous images that prefect has pulled. Is there anyway to control this? Can i limit the size of the cache or number of images that are stored?
k
Not seeing an immediate way on the Prefect front. I would explore if Docker has but if not, you could have a Prefect Flow that runs
docker image prune
through the ShellTask lol
1
j
hackity hack hack 😄
Well at least i know this can be an issue now. Thanks again for the help!
👍 1