1893956 – Installation always fails at "failed to initialize the cluster: Cluster operator image-registry is still updating"

Bug 1893956 - Installation always fails at "failed to initialize the cluster: Cluster operator image-registry is still updating"

Summary: Installation always fails at "failed to initialize the cluster: Cluster opera...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Oleg Bulatov
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1936006 (view as bug list)
Depends On:
Blocks:	1936984 1940877
TreeView+	depends on / blocked

Reported:	2020-11-03 08:09 UTC by Xingxing Xia
Modified:	2021-03-19 13:13 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1936984 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:29:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
image-registry deployment file (13.90 KB, text/plain) 2020-11-04 08:52 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 634	0	None	closed	Bug 1893956: Make /etc/pki/ca-trust/extracted writable	2021-01-18 10:13:32 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:29:45 UTC

Description Xingxing Xia 2020-11-03 08:09:00 UTC

Description of problem:
Recent 4.7 payload env installation always fails at:
level=fatal msg=failed to initialize the cluster: Cluster operator image-registry is still updating

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-11-03-002310

How reproducible:
Always

Steps to Reproduce:
1. Install 4.7 env

Actual results:
1. Fails as above. Checked oc get node, oc get pod -A, oc get co, all are well, except image-registry:

$ oc get co image-registry
image-registry             False       True          False      56m
$ oc get po -n openshift-image-registry
NAME                                              READY   STATUS             RESTARTS   AGE
cluster-image-registry-operator-74c6ff47f-fm4gx   1/1     Running            1          61m
image-registry-7b456759cb-kqqct                   0/1     CrashLoopBackOff   12         44m
image-registry-7b456759cb-twrhd                   0/1     CrashLoopBackOff   12         44m
image-registry-7c46b94c59-dr4wj                   0/1     CrashLoopBackOff   12         44m
...
$ oc describe po image-registry-7c46b94c59-dr4wj -n openshift-image-registry
...
  Normal   Scheduled         51m                  default-scheduler  Successfully assigned openshift-image-registry/image-registry-7c46b94c59-dr4wj to ip-10-0-212-18.ap-northeast-2.compute.internal
  Normal   AddedInterface    51m                  multus             Add eth0 [10.131.0.13/23]
  Normal   Pulled            49m (x5 over 51m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2" already present on machine
  Normal   Created           49m (x5 over 51m)    kubelet            Created container registry
  Normal   Started           49m (x5 over 51m)    kubelet            Started container registry
  Warning  BackOff           90s (x236 over 51m)  kubelet            Back-off restarting failed container
$ oc logs image-registry-7c46b94c59-dr4wj -n openshift-image-registry # nothing returns

Expected results:
1. Succeed

Additional info:
Checked CI jobs, they hit same "failed to initialize the cluster: Cluster operator image-registry is still updating":
4.7.0-0.nightly-2020-11-03-002310: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1323475127433695232
4.7.0-0.nightly-2020-11-03-040426: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.7/1323476719478247424

Looks like this bug is the cause why there are no accepted nightly payloads recently. Adding TestBlocker in Keywords.

Comment 1 Junqi Zhao 2020-11-04 08:25:32 UTC

# oc -n openshift-image-registry get pod image-registry-6fccd7bf5f-5g9l2 -oyaml
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    message: 'containers with unready status: [registry]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    message: 'containers with unready status: [registry]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://326af805108456268c97896caa847e596f220bb9c6565ddfa35f6f2783ad6d31
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2
    lastState:
      terminated:
        containerID: cri-o://326af805108456268c97896caa847e596f220bb9c6565ddfa35f6f2783ad6d31
        exitCode: 1
        finishedAt: "2020-11-04T08:17:17Z"
        reason: Error
        startedAt: "2020-11-04T08:17:17Z"
    name: registry
    ready: false
    restartCount: 16
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=registry pod=image-registry-6fccd7bf5f-5g9l2_openshift-image-registry(6a2040ca-cdab-4658-a386-9bb5c16525af)
        reason: CrashLoopBackOff
  hostIP: 10.0.32.3
  phase: Running
  podIP: 10.128.2.8
  podIPs:
  - ip: 10.128.2.8
  qosClass: Burstable
  startTime: "2020-11-04T07:20:07Z"

Comment 2 Junqi Zhao 2020-11-04 08:46:59 UTC

continue with Comment 2, 
3 image-registry pods, but the deployment only required 2 replicas, image-registry-7bc845d666-fbvhc and image-registry-6fccd7bf5f-5g9l2 are in the same node "zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal",
# oc -n openshift-image-registry get pod -o wide | grep -Ev "Running|Completed"
NAME                                               READY   STATUS             RESTARTS   AGE   IP            NODE                                                       NOMINATED NODE   READINESS GATES
image-registry-6fccd7bf5f-5g9l2                    0/1     CrashLoopBackOff   20         78m   10.128.2.8    zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal   <none>           <none>
image-registry-7bc845d666-5stcl                    0/1     CrashLoopBackOff   20         78m   10.131.0.17   zhsun114gcp-tk9lc-worker-b-n9l4h.c.openshift-qe.internal   <none>           <none>
image-registry-7bc845d666-fbvhc                    0/1     CrashLoopBackOff   20         78m   10.128.2.7    zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal   <none>           <none>

# oc -n openshift-image-registry get deploy image-registry
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
image-registry   0/2     1            0           81m


there are conflict replicas in spec and status section
# oc -n openshift-image-registry get deploy image-registry -oyaml
...
spec:
  progressDeadlineSeconds: 600
  replicas: 2
...
status:
  conditions:
  - lastTransitionTime: "2020-11-04T07:20:06Z"
    lastUpdateTime: "2020-11-04T07:20:06Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2020-11-04T07:30:08Z"
    lastUpdateTime: "2020-11-04T07:30:08Z"
    message: ReplicaSet "image-registry-6fccd7bf5f" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 2
  replicas: 3
  unavailableReplicas: 3
  updatedReplicas: 1

Comment 3 Junqi Zhao 2020-11-04 08:52:13 UTC

Created attachment 1726494 [details]
image-registry deployment file

Comment 5 Xingxing Xia 2020-11-05 03:19:36 UTC

Verified in 4.7.0-0.nightly-2020-11-05-010603

Comment 8 errata-xmlrpc 2021-02-24 15:29:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 9 Oleg Bulatov 2021-03-08 12:27:28 UTC

*** Bug 1936006 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.