Description of problem: After upgrade to 4.7.24 the new builds are not working due to mismatch in the layers. Same image from external repository is working but not from internal Version-Release number of selected component (if applicable): OCP 4.7.24 Baremetal on VMWare How reproducible: Try to build image from internal repository Steps to Reproduce: 1. podman pull image-registry.openshift-image-registry.svc:5000/prject/image 2. 3. Actual results: Error: Error writing blob: error storing blob to file "/var/tmp/storage059258202/1": error happened during read: Digest did not match, expected sha256:2a096c51921790689d99fd69901a9572ea89cdabb80f97fa005e748091b02afa, got sha256:b5a8d656896dbad35ef0e618cd323c8de2cf4190f82febda9ed6f8eb3c0d0ae6 Expected results: Create the pod correctly Additional info:
Hi @jsafrane, thanks for your help, let me confirm your doubt with CU
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Setting NEEDINFO for the impact statement in comment 13th.
> Who is impacted? Not only registry on cephfs is affected. Basically any data on cephfs may be corrupted. It could be a harmless log, but it could be a critical database too. > How involved is remediation For random corrupted data, restore them from backup. In addition, the cluster does not report any error. Users may find out pretty late that their data is corrupted (and maybe even backed up).
(In reply to Oleg Bulatov from comment #17) > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? > > Customers, who use 4.7.24 and use PV with cephfs for the image registry. Is this 4.7.24-only? 4.7.28 is not vulnerable? No vulnerable 4.8 releases?
AFAIK 4.7.24+ and 4.8.0+ are vulnerable.
Edits upon comment 17 from Oleg, since I happen to know this affects all 4.8 as well. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers, who use 4.7.24+ or 4.8.2+ and use PV with cephfs for the image registry. At time of writing, this is not fixed in a later 4.7.z or 4.8.z. What is the impact? Is it serious enough to warrant blocking edges? The registry storage irreversibly corrupts container images. Corrupted layers cannot be pulled/re-pushed, manual intervention is required. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Admin must rsh into the registry container and delete corrupted blobs and layer links. Corrupted images can only be re-pushed/re-built. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, the regression is introduced at 4.7.24 and 4.8.2. At time of writing, this is not fixed in a later 4.7.z or 4.8.z.
This is believed fixed with kernel-4.18.0-305.19.1.el8_4 in 4.7.30.
The same kernel is in use in 4.8.30 and latest 4.9 nightlies - would be wise to test there as well, but I'll direct this bug at 4.7 (not sure a clone is needed for other releases).
(In reply to Luke Meyer from comment #31) > The same kernel is in use in 4.8.30 :facepalm: I meant 4.8.11
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.30 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3422
Catching up here, two weeks ago we blocked edges into 4.7.29 and 4.8.10 (on top of some impacted edges that had already been blocked for other reasons) in [1,2], based on the impact statement from comment 23. [1]: https://github.com/openshift/cincinnati-graph-data/pull/1033 [2]: https://github.com/openshift/cincinnati-graph-data/pull/1034