Bug 1999591 - internal registry is rejecting the container creation due to sha256 layer mismatch
Summary: internal registry is rejecting the container creation due to sha256 layer mi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.7
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 4.7.z
Assignee: Oleg Bulatov
QA Contact: XiuJuan Wang
URL:
Whiteboard: UpdateRecommendationsBlocked
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-31 11:42 UTC by Pamela Escorza
Modified: 2022-05-11 19:02 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 09:16:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:3422 0 None None None 2021-09-15 09:17:06 UTC

Description Pamela Escorza 2021-08-31 11:42:59 UTC
Description of problem:
After upgrade to 4.7.24 the new builds are not working due to mismatch in the layers. 

Same image from external repository is working but not from internal

Version-Release number of selected component (if applicable):
OCP 4.7.24 Baremetal on VMWare

How reproducible:
Try to build image from internal repository 

Steps to Reproduce:
1. podman pull image-registry.openshift-image-registry.svc:5000/prject/image
2.
3.

Actual results:
Error: Error writing blob: error storing blob to file "/var/tmp/storage059258202/1": error happened during read: Digest did not match, expected sha256:2a096c51921790689d99fd69901a9572ea89cdabb80f97fa005e748091b02afa, got sha256:b5a8d656896dbad35ef0e618cd323c8de2cf4190f82febda9ed6f8eb3c0d0ae6

Expected results:
Create the pod correctly

Additional info:

Comment 8 Pamela Escorza 2021-09-01 12:39:50 UTC
Hi @jsafrane, thanks for your help, let me confirm your doubt with CU

Comment 13 Jack Ottofaro 2021-09-01 14:51:38 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 14 W. Trevor King 2021-09-01 21:07:44 UTC
Setting NEEDINFO for the impact statement in comment 13th.

Comment 18 Jan Safranek 2021-09-02 14:19:25 UTC
> Who is impacted?  

Not only registry on cephfs is affected. Basically any data on cephfs may be corrupted. It could be a harmless log, but it could be a critical database too.

> How involved is remediation

For random corrupted data, restore them from backup. In addition, the cluster does not report any error. Users may find out pretty late that their data is corrupted (and maybe even backed up).

Comment 19 W. Trevor King 2021-09-02 16:48:41 UTC
(In reply to Oleg Bulatov from comment #17)
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?
> 
>   Customers, who use 4.7.24 and use PV with cephfs for the image registry.

Is this 4.7.24-only?  4.7.28 is not vulnerable?  No vulnerable 4.8 releases?

Comment 20 Oleg Bulatov 2021-09-02 16:53:14 UTC
AFAIK 4.7.24+ and 4.8.0+ are vulnerable.

Comment 23 Scott Dodson 2021-09-02 19:47:34 UTC
Edits upon comment 17 from Oleg, since I happen to know this affects all 4.8 as well.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

  Customers, who use 4.7.24+ or 4.8.2+ and use PV with cephfs for the image registry. At time of writing, this is not fixed in a later 4.7.z or 4.8.z.

What is the impact?  Is it serious enough to warrant blocking edges?

  The registry storage irreversibly corrupts container images. Corrupted layers cannot be pulled/re-pushed, manual intervention is required.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Admin must rsh into the registry container and delete corrupted blobs and layer links.
  Corrupted images can only be re-pushed/re-built.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

  Yes, the regression is introduced at 4.7.24 and 4.8.2. At time of writing, this is not fixed in a later 4.7.z or 4.8.z.

Comment 30 Luke Meyer 2021-09-09 20:37:49 UTC
This is believed fixed with kernel-4.18.0-305.19.1.el8_4 in 4.7.30.

Comment 31 Luke Meyer 2021-09-09 20:41:24 UTC
The same kernel is in use in 4.8.30 and latest 4.9 nightlies - would be wise to test there as well, but I'll direct this bug at 4.7 (not sure a clone is needed for other releases).

Comment 34 Luke Meyer 2021-09-09 20:45:09 UTC
(In reply to Luke Meyer from comment #31)
> The same kernel is in use in 4.8.30

:facepalm: I meant 4.8.11

Comment 37 errata-xmlrpc 2021-09-15 09:16:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.30 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3422

Comment 38 W. Trevor King 2021-09-15 16:13:33 UTC
Catching up here, two weeks ago we blocked edges into 4.7.29 and 4.8.10 (on top of some impacted edges that had already been blocked for other reasons) in [1,2], based on the impact statement from comment 23.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/1033
[2]: https://github.com/openshift/cincinnati-graph-data/pull/1034


Note You need to log in before you can comment on or make changes to this bug.