We re experiencing really strange behavior when creating a s Prefect Community #ask-community

We’re experiencing really strange behavior when cr...

Ilya Galperin

08/26/2022, 8:30 PM

We’re experiencing really strange behavior when creating a storage block and deployment using the Python Deployment object. When we create the deployment the first time, the flow is picked up and runs on our Kubernetes-hosted agent, but subsequent flow runs will seemingly randomly go into a failed state almost instantly with no error message and never get picked up by the agent. I notice when our flow does successfully run, we get this warning message in the container’s log:

Copy code

/usr/local/lib/python3.10/site-packages/prefect/deployments.py:48: UserWarning: Block document has schema checksum sha256:0ec43f8010cee4adbf73aebcc58f1e45986d765c2a224dfc9cd5428f98c516f8 which does not match the schema checksum for class 'S3'. This indicates the schema has changed and this block may not load.
  storage_block = Block._from_block_document(storage_document)

Deleting and re-creating the block and deployment will sometimes cause it to work again but again, only on an intermittent basis. The flow code itself does not touch or interact with Prefect block storage. Has anyone experienced this or have any idea what might be causing the issue?

✅ 1

Ilya Galperin

08/26/2022, 8:33 PM

Kevin Grismore

08/26/2022, 8:34 PM

This is actually the exact same issue that led me to the one I posted about above.

Kevin Grismore

08/26/2022, 8:35 PM

I deleted and recreated my GCS block, but it doesn't cooperate on certain docker images.

Ilya Galperin

08/26/2022, 8:36 PM

That is really weird. We’re having the same behavior on EKS using s3fs package for storage. These flows aren’t even getting to the point where the image will load into the container or any pod/container will be created by the agent.

Ilya Galperin

08/26/2022, 8:36 PM

They’ll just fail silently with no error logging it seems like when starting them from the Cloud UI.

Kevin Grismore

08/26/2022, 8:40 PM

for your deployment, what does the storage block say in the cloud UI? should be in the top right corner

Kevin Grismore

08/26/2022, 8:40 PM

mine is anonymous-**. I'm not sure if that's the expected behavior or not.

Ilya Galperin

08/26/2022, 8:40 PM

That might be a problem with your GCS authentication

Ilya Galperin

08/26/2022, 8:41 PM

Ours does have the actual storage block we created listed on the deployment in the cloud UI

Ilya Galperin

08/26/2022, 8:41 PM

I ran into the anonymous block issue before which was caused by our AWS credentials not being sent when running block.save

Kevin Grismore

08/26/2022, 8:42 PM

I created the block in the UI with the GCS service account json just pasted in

Ilya Galperin

08/26/2022, 8:42 PM

Ah I see

Kevin Grismore

08/26/2022, 8:42 PM

may be unrelated then.

Ryan Peden

08/26/2022, 8:49 PM

Ilya, which Prefect image are using to run your k8s agent (assuming you're running a prefecthq image directly or using one as your base image)?

Ilya Galperin

08/26/2022, 8:53 PM

Hi Ryan — we are running

prefecthq/prefect:2.2.0-python3.10

on the agent.

Ilya Galperin

08/26/2022, 9:23 PM

Some more documentation, all these flow runs were started within 5 seconds of each other from the cloud UI - 2 failed instantly with no error logging and 2 succeeded.

Ryan Peden

08/26/2022, 9:38 PM

Thanks for the extra information, Ilya - I'm going to reach out to some of my colleagues about this to try to figure this out

Ilya Galperin

08/26/2022, 9:39 PM

Thanks Ryan — just a heads up that I have opened a ticket through email support as well on this.

Anna Geller

08/27/2022, 1:13 AM

Ilya, given that you're using a base image, it's likely a dependency issue that s3fs is not installed within the pod. We have a Discourse topic about it

Anna Geller

08/27/2022, 1:13 AM

https://discourse.prefect.io/t/i-m-getting-an-error-file-system-could-not-be-created-you-are-likely-missing-a-python-module-required-to-use-the-given-storage-protocol-how-to-solve-that/1459

Ilya Galperin

08/27/2022, 6:18 AM

Hi Anna - we are using a custom image defined on the pod manifest that has s3fs installed so I do not think this is the case. The agent runs the base image but our flow runs are using custom images. This cannot be a dependency issue I think as the flows do indeed start and succeed intermittently but around 50% of the time they do not even start or stay in a pending/scheduled state long enough for the agent to pick them up, they just fail immediately.

Ilya Galperin

08/27/2022, 6:30 AM

Based on the warning messages we've been seeing it seems like an issue with storage blocks or the Orion API. Flows runs are failing intermittently which makes me think it might be an issue with the API. Unfortunately there's no messages on the failed flows that are surfaced on our side but I'd be happy to share the flow run details if you have any other methods to see what might be causing them to fail silently and immediately.

Anna Geller

08/27/2022, 11:01 AM

What I would do is upgrade to the latest version and recreate the blocks and deployments - doesn't hurt to check if it's working with a clean slate. When pod doesn't start it may have many different root causes. It's worth checking pod logs and forwarding them along with your support ticket if the recreation of blocks and deployments didn't help If this doesn't work too, it's helpful if you could create MRE so that we can replicate the issue on our end and figure out how to solve it

Ilya Galperin

08/27/2022, 5:05 PM

Hi Anna — thanks for the responsiveness, it’s much appreciated. I believe we are running the latest version of Prefect on the environment creating the deployments.

Copy code

Version:             2.2.0
API version:         0.8.0
Python version:      3.8.9
Git commit:          e3651362
Built:               Tue, Aug 23, 2022 2:18 PM
OS/Arch:             darwin/arm64
Profile:             default
Server type:         hosted

I’ve just now tried creating an entirely new flow, deployment and storage block and we continue to see the same behavior. I kick off 5 flows from the Cloud UI, 3 run successfully (pods are created for these) and 2 fail immediately. Unfortunately there are no logs that I’m able to capture because no pods are even attempted to be created for the failed flows — they are not recognized by the agent nor do they seem to ever exist in a “scheduled” state, they just go into a “failed” state immediately so the agent does not have an opportunity grab them. I am not sure what an MRE is if you can please explain but I have forwarded all the documentation in the support ticket and created a Github issue here. Please let me know if there are any other details i.e. flow run IDs that I can provide that could help in the investigation. https://github.com/PrefectHQ/prefect/issues/6586#issuecomment-1229039900

Ilya Galperin

08/27/2022, 7:58 PM

I’ve added a comment to the GIthub issue but also wanted to note here that I’ve done additional testing by building a brand new block using the Prefect Cloud UI and a new deployment and flow using only the Prefect CLI and we’re still seeing the same behavior on this new deployment. I think this problem is independent of storage blocks or Python deployment objects entirely. Perhaps there is an issue with our workspace, the Orion API, or maybe the agent?

Anna Geller

08/27/2022, 8:52 PM

MRE = minimal reproducible example Thanks for the GitHub issue. Similarly to Michael, I'm also a little behind on the state of blocks checksum, so I'll let Alex respond on the issue. I would recommend deploying it with the

prefect deployment build

CLI which is way easier to troubleshoot than doing it from Python. Also, given that you have both a support ticket and an open GitHub issue, let's continue the discussion there and close this Slack discussion to avoid repeating ourselves across channels. I'm marking this thread as solved, and please if you have any new findings to add, add it to the GitHub issue and if the person that is handling your support ticket has access to this issue, they should have everything they need there. Thanks a lot for the detailed write-up and for providing a thorough explanation. Someone should get back to you after the weekend.

🙏 1

21 Views

Open in Slack

Previous Next