We’re experiencing really strange behavior when cr...
# prefect-community
We’re experiencing really strange behavior when creating a storage block and deployment using the Python Deployment object. When we create the deployment the first time, the flow is picked up and runs on our Kubernetes-hosted agent, but subsequent flow runs will seemingly randomly go into a failed state almost instantly with no error message and never get picked up by the agent. I notice when our flow does successfully run, we get this warning message in the container’s log:
Copy code
/usr/local/lib/python3.10/site-packages/prefect/deployments.py:48: UserWarning: Block document has schema checksum sha256:0ec43f8010cee4adbf73aebcc58f1e45986d765c2a224dfc9cd5428f98c516f8 which does not match the schema checksum for class 'S3'. This indicates the schema has changed and this block may not load.
  storage_block = Block._from_block_document(storage_document)
Deleting and re-creating the block and deployment will sometimes cause it to work again but again, only on an intermittent basis. The flow code itself does not touch or interact with Prefect block storage. Has anyone experienced this or have any idea what might be causing the issue?
This is actually the exact same issue that led me to the one I posted about above.
I deleted and recreated my GCS block, but it doesn't cooperate on certain docker images.
That is really weird. We’re having the same behavior on EKS using s3fs package for storage. These flows aren’t even getting to the point where the image will load into the container or any pod/container will be created by the agent.
They’ll just fail silently with no error logging it seems like when starting them from the Cloud UI.
for your deployment, what does the storage block say in the cloud UI? should be in the top right corner
mine is anonymous-**. I'm not sure if that's the expected behavior or not.
That might be a problem with your GCS authentication
Ours does have the actual storage block we created listed on the deployment in the cloud UI
I ran into the anonymous block issue before which was caused by our AWS credentials not being sent when running block.save
I created the block in the UI with the GCS service account json just pasted in
Ah I see
may be unrelated then.
Ilya, which Prefect image are using to run your k8s agent (assuming you're running a prefecthq image directly or using one as your base image)?
Hi Ryan — we are running
on the agent.
Some more documentation, all these flow runs were started within 5 seconds of each other from the cloud UI - 2 failed instantly with no error logging and 2 succeeded.
Thanks for the extra information, Ilya - I'm going to reach out to some of my colleagues about this to try to figure this out
Thanks Ryan — just a heads up that I have opened a ticket through email support as well on this.
Ilya, given that you're using a base image, it's likely a dependency issue that s3fs is not installed within the pod. We have a Discourse topic about it
Hi Anna - we are using a custom image defined on the pod manifest that has s3fs installed so I do not think this is the case. The agent runs the base image but our flow runs are using custom images. This cannot be a dependency issue I think as the flows do indeed start and succeed intermittently but around 50% of the time they do not even start or stay in a pending/scheduled state long enough for the agent to pick them up, they just fail immediately.
Based on the warning messages we've been seeing it seems like an issue with storage blocks or the Orion API. Flows runs are failing intermittently which makes me think it might be an issue with the API. Unfortunately there's no messages on the failed flows that are surfaced on our side but I'd be happy to share the flow run details if you have any other methods to see what might be causing them to fail silently and immediately.
What I would do is upgrade to the latest version and recreate the blocks and deployments - doesn't hurt to check if it's working with a clean slate. When pod doesn't start it may have many different root causes. It's worth checking pod logs and forwarding them along with your support ticket if the recreation of blocks and deployments didn't help If this doesn't work too, it's helpful if you could create MRE so that we can replicate the issue on our end and figure out how to solve it
Hi Anna — thanks for the responsiveness, it’s much appreciated. I believe we are running the latest version of Prefect on the environment creating the deployments.
Copy code
Version:             2.2.0
API version:         0.8.0
Python version:      3.8.9
Git commit:          e3651362
Built:               Tue, Aug 23, 2022 2:18 PM
OS/Arch:             darwin/arm64
Profile:             default
Server type:         hosted
I’ve just now tried creating an entirely new flow, deployment and storage block and we continue to see the same behavior. I kick off 5 flows from the Cloud UI, 3 run successfully (pods are created for these) and 2 fail immediately. Unfortunately there are no logs that I’m able to capture because no pods are even attempted to be created for the failed flows — they are not recognized by the agent nor do they seem to ever exist in a “scheduled” state, they just go into a “failed” state immediately so the agent does not have an opportunity grab them. I am not sure what an MRE is if you can please explain but I have forwarded all the documentation in the support ticket and created a Github issue here. Please let me know if there are any other details i.e. flow run IDs that I can provide that could help in the investigation. https://github.com/PrefectHQ/prefect/issues/6586#issuecomment-1229039900
I’ve added a comment to the GIthub issue but also wanted to note here that I’ve done additional testing by building a brand new block using the Prefect Cloud UI and a new deployment and flow using only the Prefect CLI and we’re still seeing the same behavior on this new deployment. I think this problem is independent of storage blocks or Python deployment objects entirely. Perhaps there is an issue with our workspace, the Orion API, or maybe the agent?
MRE = minimal reproducible example Thanks for the GitHub issue. Similarly to Michael, I'm also a little behind on the state of blocks checksum, so I'll let Alex respond on the issue. I would recommend deploying it with the
prefect deployment build
CLI which is way easier to troubleshoot than doing it from Python. Also, given that you have both a support ticket and an open GitHub issue, let's continue the discussion there and close this Slack discussion to avoid repeating ourselves across channels. I'm marking this thread as solved, and please if you have any new findings to add, add it to the GitHub issue and if the person that is handling your support ticket has access to this issue, they should have everything they need there. Thanks a lot for the detailed write-up and for providing a thorough explanation. Someone should get back to you after the weekend.
🙏 1