Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1999591

Summary: internal registry is rejecting the container creation due to sha256 layer mismatch
Product: OpenShift Container Platform Reporter: Pamela Escorza <pescorza>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.7CC: aos-bugs, david.karlsen, dfuller, hchiramm, jack.ottofaro, jlayton, jsafrane, kgordeev, luaparicio, luca.mercuri, mhackett, mnunes, obulatov, pdonnell, sdodson, sostapov, vcojot, wking
Target Milestone: ---Keywords: UpgradeBlocker
Target Release: 4.7.z   
Hardware: All   
OS: All   
Whiteboard: UpdateRecommendationsBlocked
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 09:16:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pamela Escorza 2021-08-31 11:42:59 UTC
Description of problem:
After upgrade to 4.7.24 the new builds are not working due to mismatch in the layers. 

Same image from external repository is working but not from internal

Version-Release number of selected component (if applicable):
OCP 4.7.24 Baremetal on VMWare

How reproducible:
Try to build image from internal repository 

Steps to Reproduce:
1. podman pull image-registry.openshift-image-registry.svc:5000/prject/image
2.
3.

Actual results:
Error: Error writing blob: error storing blob to file "/var/tmp/storage059258202/1": error happened during read: Digest did not match, expected sha256:2a096c51921790689d99fd69901a9572ea89cdabb80f97fa005e748091b02afa, got sha256:b5a8d656896dbad35ef0e618cd323c8de2cf4190f82febda9ed6f8eb3c0d0ae6

Expected results:
Create the pod correctly

Additional info:

Comment 8 Pamela Escorza 2021-09-01 12:39:50 UTC
Hi @jsafrane, thanks for your help, let me confirm your doubt with CU

Comment 13 Jack Ottofaro 2021-09-01 14:51:38 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 14 W. Trevor King 2021-09-01 21:07:44 UTC
Setting NEEDINFO for the impact statement in comment 13th.

Comment 18 Jan Safranek 2021-09-02 14:19:25 UTC
> Who is impacted?  

Not only registry on cephfs is affected. Basically any data on cephfs may be corrupted. It could be a harmless log, but it could be a critical database too.

> How involved is remediation

For random corrupted data, restore them from backup. In addition, the cluster does not report any error. Users may find out pretty late that their data is corrupted (and maybe even backed up).

Comment 19 W. Trevor King 2021-09-02 16:48:41 UTC
(In reply to Oleg Bulatov from comment #17)
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?
> 
>   Customers, who use 4.7.24 and use PV with cephfs for the image registry.

Is this 4.7.24-only?  4.7.28 is not vulnerable?  No vulnerable 4.8 releases?

Comment 20 Oleg Bulatov 2021-09-02 16:53:14 UTC
AFAIK 4.7.24+ and 4.8.0+ are vulnerable.

Comment 23 Scott Dodson 2021-09-02 19:47:34 UTC
Edits upon comment 17 from Oleg, since I happen to know this affects all 4.8 as well.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

  Customers, who use 4.7.24+ or 4.8.2+ and use PV with cephfs for the image registry. At time of writing, this is not fixed in a later 4.7.z or 4.8.z.

What is the impact?  Is it serious enough to warrant blocking edges?

  The registry storage irreversibly corrupts container images. Corrupted layers cannot be pulled/re-pushed, manual intervention is required.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Admin must rsh into the registry container and delete corrupted blobs and layer links.
  Corrupted images can only be re-pushed/re-built.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

  Yes, the regression is introduced at 4.7.24 and 4.8.2. At time of writing, this is not fixed in a later 4.7.z or 4.8.z.

Comment 30 Luke Meyer 2021-09-09 20:37:49 UTC
This is believed fixed with kernel-4.18.0-305.19.1.el8_4 in 4.7.30.

Comment 31 Luke Meyer 2021-09-09 20:41:24 UTC
The same kernel is in use in 4.8.30 and latest 4.9 nightlies - would be wise to test there as well, but I'll direct this bug at 4.7 (not sure a clone is needed for other releases).

Comment 34 Luke Meyer 2021-09-09 20:45:09 UTC
(In reply to Luke Meyer from comment #31)
> The same kernel is in use in 4.8.30

:facepalm: I meant 4.8.11

Comment 37 errata-xmlrpc 2021-09-15 09:16:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.30 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3422

Comment 38 W. Trevor King 2021-09-15 16:13:33 UTC
Catching up here, two weeks ago we blocked edges into 4.7.29 and 4.8.10 (on top of some impacted edges that had already been blocked for other reasons) in [1,2], based on the impact statement from comment 23.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/1033
[2]: https://github.com/openshift/cincinnati-graph-data/pull/1034