Bug 1949040 - image-registry operator is Degraded when upgrade from 4.6.24 to 4.6.0-0.nightly-2021-04-09-145812
Summary: image-registry operator is Degraded when upgrade from 4.6.24 to 4.6.0-0.night...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.z
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
: 1949086 (view as bug list)
Depends On: 1897520
TreeView+ depends on / blocked
Reported: 2021-04-13 10:08 UTC by Wenjing Zheng
Modified: 2022-10-12 03:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-04-20 19:27:50 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 678 0 None open [release-4.6] Bug 1949040: Use mkdir -p to create ca-trust directories 2021-04-13 12:11:32 UTC
Red Hat Product Errata RHBA-2021:1153 0 None None None 2021-04-20 19:27:58 UTC

Description Wenjing Zheng 2021-04-13 10:08:41 UTC
Description of problem:
image registry is degraded after upgrading with error in pod as below:
$ oc logs pods/image-registry-6f4c7b9569-mgtgs
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/edk2': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/java': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/openssl': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/pem': File exists

$ oc get co | grep image-registry
image-registry                             4.6.0-0.nightly-2021-04-09-145812   False       True          False      55m
$ oc get pods
NAME                                               READY   STATUS             RESTARTS   AGE
cluster-image-registry-operator-65968cd5c9-n4cph   1/1     Running            1          57m
image-registry-6f4c7b9569-jphzm                    0/1     CrashLoopBackOff   15         59m
image-registry-6f4c7b9569-mgtgs                    0/1     CrashLoopBackOff   15         59m
node-ca-228kt                                      1/1     Running            0          79m
node-ca-4r4sv                                      1/1     Running            0          79m
node-ca-785bv                                      1/1     Running            0          80m
node-ca-hbj6x                                      1/1     Running            0          80m
node-ca-scnrg                                      1/1     Running            0          79m

Version-Release number of selected component (if applicable):
4.6.24 to 4.6.0-0.nightly-2021-04-09-145812

How reproducible:

Steps to Reproduce:
1.Upgrade from 4.6.24 to 4.6.0-0.nightly-2021-04-09-145812

Actual results:
image registry is degraded

Expected results:
image registry should be available after upgrade.

Additional info:

Comment 2 Oleg Bulatov 2021-04-13 12:09:07 UTC
*** Bug 1949086 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2021-04-13 15:15:50 UTC
I'm a bit confused.  This bug is now a child of bug 1897520, and is backporting a fix that landed in 4.7 in November.  How is it only impacting 4.6 now?  Has this been an issue with all 4.6->4.6 updates, and we only noticed now?  Or is this a corner case that only impacts some fraction of 4.6->4.6 updates?  Or...?

Comment 4 Oleg Bulatov 2021-04-13 17:00:33 UTC
Who is impacted?

Anyone who uses 4.6.24 and later 4.6 without the fix, if the registry processes crash or restart for any reason after the pod is created.

What is the impact?

The registry does not survive restarts, once the process is restarted it enters into a crash loop. Manual intervention is needed.

How involved is remediation?

Deleting image-registry pods should bring the registry back to the normal state. Updating to a fixed release will also recover the registry.

Is this a regression?

Yes, we regressed in 4.6.24 while fixing bug 1936984.

Comment 9 W. Trevor King 2021-04-14 04:12:46 UTC
I am not clear on why QE has been able to consistently reproduce this, since comment 4 claims the need for some kind of initial crash inside the pod to get to the broken state.  And [1] shows the cluster-bot update from 4.6.24 to 4.6.0-0.nightly-2021-04-09-145812 that I launched today, which succeeded without hitting this issue [2].  I dunno what could be different between QE's updates and the cluster-bot update run...

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.6.24#upgrades-to
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1382007608419815424

Comment 10 Wenjing Zheng 2021-04-14 05:28:57 UTC
This bug can also be reproduced sometime from 4.5.37-x86_64 to 4.6.24-x86_64: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/13262/console

Comment 11 Wenjing Zheng 2021-04-15 12:18:09 UTC
Verified with several upgrade paths from/to 4.6.0-0.nightly-2021-04-14-161003:

Comment 12 Wenjing Zheng 2021-04-15 13:20:52 UTC
(In reply to Wenjing Zheng from comment #10)
> This bug can also be reproduced sometime from 4.5.37-x86_64 to
> 4.6.24-x86_64:
> https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/
> upgrade_CI/13262/console

Sorry, should be this job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/13164/consoleFull

Comment 14 errata-xmlrpc 2021-04-20 19:27:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 15 W. Trevor King 2021-05-04 20:02:26 UTC
Removing UpgradeBlocker, because I don't think we blocked any update recommendations based on this bug.

Note You need to log in before you can comment on or make changes to this bug.