I have a strange issue where changes I make to Flows are not being reflected in the execution enviro...

Jacob Goldberg

10/01/2021, 4:33 PM

I have a strange issue where changes I make to Flows are not being reflected in the execution environment. I am running a docker agent on an AWS EC2 instance, and when I make updates to a Flow, those updates are not reflected at runtime. All of the storage building and registration of the Flow appear to happen successfully, and when i look at the terminal output on the EC2 instance i see

Successfully pulled image XXXXX

and

agent | Completed deployment of flow run XXXXX

, but the flows content is not being updated . When I change the runtime Labels on Prefect cloud and execute the same (updated) flow in another environment (docker agent running on mac osx) the updates i made to the flow are reflected…

Kevin Kho

10/01/2021, 4:35 PM

Hey @Jacob Goldberg, maybe the image tag being pulled is an old one?

Jacob Goldberg

10/01/2021, 4:38 PM

the docker containers are stored in ECR. The tag being pulled is

<http://XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest|XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest>

accroding to the EC2 terminal output. I dont realyl have experience with ECR outside of storing Prefect images, but i have not changed any configuration settings, so I am not sure why it would be wrong if it is pulling ‘latest’. it has always worked in the past and it is working on another environment

Kevin Kho

10/01/2021, 4:39 PM

Yeah, just saw the new changes are reflected in the new environment…that’s weird you see the image is pulled though. Would assume it would just use the cache

Jacob Goldberg

10/01/2021, 4:42 PM

that is an interesting point…not sure how it is related. Even if i do not update the flow, every time i execute a flow from Prefect cloud i see “Pulling Image…” and “Successfully pulled image…” in that EC2 environment

Jacob Goldberg

10/01/2021, 4:47 PM

Here is another breadcrumb. I just tied to run a barebones test flow in the enviornment that has always worked in the passed (on MacOSX) and got this error

Copy code

Failed to load and execute Flow's environment: StorageError('An error occurred while unpickling the flow:\n  ModuleNotFoundError("No module named \'tasks\'")\nThis may be due to one of the following version mismatches between the flow build and execution environments:\n  - python: (flow built with \'3.8.6\', currently running with \'3.8.10\')\nThis also may be due to a missing Python module in your current environment. Please ensure you have all required flow dependencies installed.')

Jacob Goldberg

10/01/2021, 4:50 PM

🤔 definitely something strange going on

Jacob Goldberg

10/01/2021, 4:50 PM

never had any issues like this in the past, and am not sure what could have changed causing these types of issue

Kevin Kho

10/01/2021, 5:02 PM

I don’t know if that would cause the ECS issue, but in general the Python version should be the same of flow registration, the agent, and that container.

Jacob Goldberg

10/01/2021, 5:08 PM

I will make sure those match and see if it changes anything. regarding the ECS issue is there another place i could file a ticket or find support?

Kevin Kho

10/01/2021, 5:23 PM

I think we do have a partnership with a consulting firm and if you need more dedicated debugging on cloud stuff i could forward you to them

Kevin Kho

10/01/2021, 5:23 PM

Do Prefect logs show anything about what image was downloaded?

Jacob Goldberg

10/01/2021, 5:24 PM

They do not..

Jacob Goldberg

10/01/2021, 5:24 PM

regarding the python versions, i have always used the

prefect.storage.docker.Docker()

function to build the docker image, and have relied on default for the base_image arg. The docs say that the

Copy code

"the base image for this when building this image (e.g. python:3.6), defaults to the prefecthq/prefect image matching your python version and prefect core library version used at runtime."

Therefore I am not sure what could be causing the python version mismatch

Jacob Goldberg

10/01/2021, 5:25 PM

could it be that an update to the base image is a source of all of my problems?

Kevin Kho

10/01/2021, 5:26 PM

Ah I see, have you tried running with debug level logs for the Flow? Also maybe the agent is the one with a different Python version?

Jacob Goldberg

10/01/2021, 5:26 PM

I have not run with debug level logs, what would be the best way to enable that, do i need to rewrite the flow?

Kevin Kho

10/01/2021, 5:27 PM

ECSRun(..., env="PREFECT___LOGGING___LEVEL":"DEBUG")

👍 1

Kevin Kho

10/01/2021, 5:28 PM

and then re-register and re-run

Kevin Kho

10/01/2021, 5:34 PM

Ah is it possible to go to the ECS Task Definition in the UI of the one pulling the same container, and then checking if it is the container you are expecting?

Jacob Goldberg

10/01/2021, 5:42 PM

to clarify, I am not actually using ECSRun(), we had a number of issues configuring ECS to run with prefect given out custom vpc vonfiguration. I am running DockerRun() on an EC2 instance that is always running

Jacob Goldberg

10/01/2021, 5:42 PM

so i cannot go to the ECS task definition

Kevin Kho

10/01/2021, 5:44 PM

Ohhh gotcha. My bad. This should not be as bad then. I think DockerRun with debug logs should give us some insight

Kevin Kho

10/01/2021, 5:46 PM

I think this might be the issue where the tag latest satisfies the requirement and it doesn’t pull the image

Jacob Goldberg

10/01/2021, 5:47 PM

Hmm. I updated the DockerRun Env var to

"PREFECT___LOGGING___LEVEL":"DEBUG"

, updated the image, and re-registered the flow and I am still seeing the same message with a basic test flow. i.e. no additional information from debugging

Jacob Goldberg

10/01/2021, 5:47 PM

maybe because it is not actually pulling the latest image!

Jacob Goldberg

10/01/2021, 5:47 PM

le me checkout that SO link

Kevin Kho

10/01/2021, 5:50 PM

Oof…of course haha. That was dumb of me. Looks like as long as there is no error, it will print that log here

Kevin Kho

10/01/2021, 5:51 PM

But we use pull….no I think it should work

Jacob Goldberg

10/01/2021, 5:52 PM

I am trying to make all image tags unique with a timestamp to see if that resolves anything

Kevin Kho

10/01/2021, 5:56 PM

That would be best practice 😄

😬 1

Jacob Goldberg

10/01/2021, 6:05 PM

The latest image is now being pulled, still seeing the same error about python version mismatch, going to try to patch that up and see if this solves things

Kevin Kho

10/01/2021, 6:33 PM

You can also directly use docker pull to see the versions in that image? I suspect it’s the agent though on that EC2?

Jacob Goldberg

10/01/2021, 7:22 PM

@Kevin Kho I am starting to think more that some change to the default docker image chosen by

prefect.storage.docker.Docker()

is causing the problem. I am starting outside of the EC2 environment, and working on my Mac OSX environment where everything has always worked fine until the last 48 hours. I have ensured the build environment of the image + registration is running python 3.8.5 and that the agent is running under python 3.8.5 however I still get this message when trying to execute flows:

Copy code

Failed to load and execute Flow's environment: FlowStorageError('An error occurred while unpickling the flow:\n  ModuleNotFoundError("No module named \'tasks\'")\nThis may be due to one of the following version mismatches between the flow build and execution environments:\n  - python: (flow built with \'3.8.5\', currently running with \'3.8.12\')\nThis also may be due to a missing Python module in your current environment. Please ensure you have all required flow dependencies installed.')

Jacob Goldberg

10/01/2021, 7:23 PM

I think it must be

prefect.storage.docker.Docker()

that is choosing a base image with python 3.8.12 causing this error. Am i missing something here? This still may not be root cause of my issues, but i do want to ensure the python versions are aligned

Jacob Goldberg

10/01/2021, 7:24 PM

i have also ensured that the prefect version installed in the build environment, within the container (via a requirements.txt) file and the agent are all the same and up-to-date with the latest version

Kevin Kho

10/01/2021, 7:25 PM

Gotcha will take a look at this and at that image

Kevin Kho

10/01/2021, 7:26 PM

Could you share your Docker Storage code with me? Just omit sensitive info

Jacob Goldberg

10/01/2021, 7:27 PM

Copy code

with open("../requirements.txt") as f:
    requirements = f.read().splitlines()

STORAGE = Docker(
    registry_url="<http://XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/|XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/>",
    image_name="cal_val_etl_flows",
    image_tag=f"latest-{round(time.time())}",
    python_dependencies=requirements,
    # copy module files to Docker container
    files={
        os.path.dirname(
            os.path.dirname(os.path.realpath(__file__))
        ): "/custom_modules/calval_etl"
    },
    # add  module to python path of Docker container
    env_vars={"PYTHONPATH": "$PYTHONPATH:custom_modules/calval_etl"},
    # adding this, because run into healthcheck error where API modules are not authorized to fetch creds from aws secrets.
    ignore_healthchecks=True,
)

Jacob Goldberg

10/01/2021, 7:27 PM

is that what you mean?

Kevin Kho

10/01/2021, 7:27 PM

yep perfect! thank you!

Kevin Kho

10/01/2021, 7:28 PM

Can I see your requirements?

Jacob Goldberg

10/01/2021, 7:29 PM

requirements.txt

Kevin Kho

10/01/2021, 7:29 PM

Thanks!

Kevin Kho

10/01/2021, 7:31 PM

Could you DM me flow code? I’d like to see the imports

Kevin Kho

10/01/2021, 7:33 PM

I am wondering if the version mismatch is misleading here. There looks to be a real error

Jacob Goldberg

10/01/2021, 7:33 PM

ok. Im an idiot. My test flow was quite old, it is not part of my testing suite, the module has been rearranged and my import statment was simply wrong

Kevin Kho

10/01/2021, 7:34 PM

That should have failed in registration I think?

Jacob Goldberg

10/01/2021, 7:34 PM

Fixed the import statement, going to rerun on my macosx test environment, that should work then going to get back to EC2 and see if i am still having the same issue. I think not because i have ignore_healthchecks set to True?

Kevin Kho

10/01/2021, 7:34 PM

Yeah that sounds good then let’s see!

Jacob Goldberg

10/01/2021, 7:43 PM

Ok. I have confirmed everything is working fine now on my testing environment in macOSX (i dont think anything was ever really wrong, just me using a dumb old test flow). Anyhow a positive outcome of that red herring is that now the python versions on my dev, test, and prod, environments are identical….now back to EC2, when i try to run a test flow i now get this error.

404 Client Error for <http+docker://localhost/v1.41/containers/create?name=stoic-roadrunner>: Not Found ("No such image: <http://XXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest-1633114956|XXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/cal_val_etl_flows:latest-1633114956>")

Jacob Goldberg

10/01/2021, 7:43 PM

i know the image exists because i verified via the ECR UI, AND my test execution environment on macOSX has no problem finding and pulling it

Jacob Goldberg

10/01/2021, 7:43 PM

seems like original issue was caused by the EC2 instance that is running a docker agent not being able to properly pull images

Jacob Goldberg

10/01/2021, 7:53 PM

in addition, if i SSH into the EC2 instance, stop running the prefect agent, and run

aws ecr list-images --repository-name "cal_val_etl_flows"

i can see the image that supposedly cant be found

Jacob Goldberg

10/01/2021, 7:54 PM

not sure why the python docker api would suddenly not be able to find the image…?

Kevin Kho

10/01/2021, 7:54 PM

Have been staring at this and thinking and no ideas yet

Jacob Goldberg

10/01/2021, 7:55 PM

definitely a head scratcher…

Kevin Kho

10/01/2021, 7:55 PM

Do you have

AWS_CREDENTIALS

as a secret on Prefect Cloud? (am hoping not)

Jacob Goldberg

10/01/2021, 7:55 PM

i do not

Kevin Kho

10/01/2021, 7:55 PM

I was wondering if Prefect was using some other credentials but I guess not

Jacob Goldberg

10/01/2021, 7:56 PM

they should be the same. I load AWS creds as env vars when i build and register flows via DockerRun() like this:

Copy code

RUN_CONFIG = DockerRun(
    labels=["ec2", "prod"],
    env={
        "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
        "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
        "AWS_DEFAULT_REGION": os.environ["AWS_DEFAULT_REGION"],
        "PREFECT__LOGGING__LEVEL": "DEBUG"
    },
)

Kevin Kho

10/01/2021, 7:58 PM

I think you ideally want secrets for those (not related to the issue). If those are the same, it should be fine, and really i would expect an authentication issue if they were different

Jacob Goldberg

10/01/2021, 8:00 PM

ya, i know this is a bit hacky and not ideal. I have an interesting auth dilema preventing me from accessing those creds as secrets…but that is for another day. Doesn’t seem to be the issue here

Jacob Goldberg

10/01/2021, 8:01 PM

i guess i can try just nuking the EC2 instance and starting again…hoping that a good old “unplug and plug back in” may do some good? but im not sure what else to do

Kevin Kho

10/01/2021, 8:02 PM

I think you can try SSH-ing into there are using

docker pull

directly and see if that works?

✅ 1

Jacob Goldberg

10/01/2021, 8:03 PM

Interesting…

Jacob Goldberg

10/01/2021, 8:04 PM

i tried a docker pull, it had no trouble finding the image, and started to pull succesfully but errored out with

write /var/lib/docker/tmp/GetImageBlob150753823: no space left on device

Jacob Goldberg

10/01/2021, 8:04 PM

maybe that is the real issue and the previous error about not finding the image was misleading?

Kevin Kho

10/01/2021, 8:04 PM

Maybe. I bet it was a try-except block that gave that error message

Kevin Kho

10/01/2021, 8:07 PM

Well there is no try-except on the Prefect side haha. The error is on

dockerpy

side so maybe it really doesn’t find it? It does say 404

Kevin Kho

10/01/2021, 8:16 PM

Frikin hell it is a red herring https://github.com/docker/docker-py/issues/2503

🙌 1

Jacob Goldberg

10/01/2021, 8:26 PM

Super annoying! looks like that was the root of all of my issues, and everything is working now. I think at first it was silently failing and using a cached version, and once i switched to unique image tags it failed with this misleading error

Jacob Goldberg

10/01/2021, 8:27 PM

thanks for you help down this rabbit whole Kevin. Super helpful as usual!

Kevin Kho

10/01/2021, 8:27 PM

Yeah what a rabbit hole lol. Glad it’s working now

Jacob Goldberg

10/01/2021, 8:28 PM

For now, I have increased the storage volume on my EC2 instance, but i figure this may happen in the future again. The only thing that i think may be adding considerable space on the instance in the future is caches of previous images that prefect has pulled. Is there anyway to control this? Can i limit the size of the cache or number of images that are stored?

Kevin Kho

10/01/2021, 8:30 PM

Not seeing an immediate way on the Prefect front. I would explore if Docker has but if not, you could have a Prefect Flow that runs

docker image prune

through the ShellTask lol

✅ 1

Jacob Goldberg

10/01/2021, 8:31 PM

hackity hack hack 😄

Jacob Goldberg

10/01/2021, 8:32 PM

Well at least i know this can be an issue now. Thanks again for the help!

👍 1

10 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.