Bug 1949040

Summary: image-registry operator is Degraded when upgrade from 4.6.24 to 4.6.0-0.nightly-2021-04-09-145812
Product: OpenShift Container Platform Reporter: Wenjing Zheng <wzheng>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6.zCC: aivaras.laimikis, aos-bugs, jima, mfuruta, obulatov, openshift-bugzilla-robot, rsandu, wduan, wewang, wking, xiuwang
Target Milestone: ---Keywords: Regression
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-20 19:27:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1897520    
Bug Blocks:    

Description Wenjing Zheng 2021-04-13 10:08:41 UTC
Description of problem:
image registry is degraded after upgrading with error in pod as below:
$ oc logs pods/image-registry-6f4c7b9569-mgtgs
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/edk2': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/java': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/openssl': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/pem': File exists

$ oc get co | grep image-registry
image-registry                             4.6.0-0.nightly-2021-04-09-145812   False       True          False      55m
$ oc get pods
NAME                                               READY   STATUS             RESTARTS   AGE
cluster-image-registry-operator-65968cd5c9-n4cph   1/1     Running            1          57m
image-registry-6f4c7b9569-jphzm                    0/1     CrashLoopBackOff   15         59m
image-registry-6f4c7b9569-mgtgs                    0/1     CrashLoopBackOff   15         59m
node-ca-228kt                                      1/1     Running            0          79m
node-ca-4r4sv                                      1/1     Running            0          79m
node-ca-785bv                                      1/1     Running            0          80m
node-ca-hbj6x                                      1/1     Running            0          80m
node-ca-scnrg                                      1/1     Running            0          79m



Version-Release number of selected component (if applicable):
4.6.24 to 4.6.0-0.nightly-2021-04-09-145812

How reproducible:
always

Steps to Reproduce:
1.Upgrade from 4.6.24 to 4.6.0-0.nightly-2021-04-09-145812
2.
3.

Actual results:
image registry is degraded

Expected results:
image registry should be available after upgrade.

Additional info:

Comment 2 Oleg Bulatov 2021-04-13 12:09:07 UTC
*** Bug 1949086 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2021-04-13 15:15:50 UTC
I'm a bit confused.  This bug is now a child of bug 1897520, and is backporting a fix that landed in 4.7 in November.  How is it only impacting 4.6 now?  Has this been an issue with all 4.6->4.6 updates, and we only noticed now?  Or is this a corner case that only impacts some fraction of 4.6->4.6 updates?  Or...?

Comment 4 Oleg Bulatov 2021-04-13 17:00:33 UTC
Who is impacted?

Anyone who uses 4.6.24 and later 4.6 without the fix, if the registry processes crash or restart for any reason after the pod is created.

What is the impact?

The registry does not survive restarts, once the process is restarted it enters into a crash loop. Manual intervention is needed.

How involved is remediation?

Deleting image-registry pods should bring the registry back to the normal state. Updating to a fixed release will also recover the registry.

Is this a regression?

Yes, we regressed in 4.6.24 while fixing bug 1936984.

Comment 9 W. Trevor King 2021-04-14 04:12:46 UTC
I am not clear on why QE has been able to consistently reproduce this, since comment 4 claims the need for some kind of initial crash inside the pod to get to the broken state.  And [1] shows the cluster-bot update from 4.6.24 to 4.6.0-0.nightly-2021-04-09-145812 that I launched today, which succeeded without hitting this issue [2].  I dunno what could be different between QE's updates and the cluster-bot update run...

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.6.24#upgrades-to
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1382007608419815424

Comment 10 Wenjing Zheng 2021-04-14 05:28:57 UTC
This bug can also be reproduced sometime from 4.5.37-x86_64 to 4.6.24-x86_64: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/13262/console

Comment 11 Wenjing Zheng 2021-04-15 12:18:09 UTC
Verified with several upgrade paths from/to 4.6.0-0.nightly-2021-04-14-161003:
https://docs.google.com/spreadsheets/d/1T-tmF1tjNmuNTgMvve9ZkeiUvFLXl1Y3-t55Kfj8egQ/edit#gid=0

Comment 12 Wenjing Zheng 2021-04-15 13:20:52 UTC
(In reply to Wenjing Zheng from comment #10)
> This bug can also be reproduced sometime from 4.5.37-x86_64 to
> 4.6.24-x86_64:
> https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/
> upgrade_CI/13262/console

Sorry, should be this job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/13164/consoleFull

Comment 14 errata-xmlrpc 2021-04-20 19:27:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153

Comment 15 W. Trevor King 2021-05-04 20:02:26 UTC
Removing UpgradeBlocker, because I don't think we blocked any update recommendations based on this bug.