Bug 1893956

Summary: Installation always fails at "failed to initialize the cluster: Cluster operator image-registry is still updating"
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.7CC: aos-bugs, juzhao, lisowski
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1936984 (view as bug list) Environment:
Last Closed: 2021-02-24 15:29:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1936984, 1940877    
Attachments:
Description Flags
image-registry deployment file none

Description Xingxing Xia 2020-11-03 08:09:00 UTC
Description of problem:
Recent 4.7 payload env installation always fails at:
level=fatal msg=failed to initialize the cluster: Cluster operator image-registry is still updating

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-11-03-002310

How reproducible:
Always

Steps to Reproduce:
1. Install 4.7 env

Actual results:
1. Fails as above. Checked oc get node, oc get pod -A, oc get co, all are well, except image-registry:

$ oc get co image-registry
image-registry             False       True          False      56m
$ oc get po -n openshift-image-registry
NAME                                              READY   STATUS             RESTARTS   AGE
cluster-image-registry-operator-74c6ff47f-fm4gx   1/1     Running            1          61m
image-registry-7b456759cb-kqqct                   0/1     CrashLoopBackOff   12         44m
image-registry-7b456759cb-twrhd                   0/1     CrashLoopBackOff   12         44m
image-registry-7c46b94c59-dr4wj                   0/1     CrashLoopBackOff   12         44m
...
$ oc describe po image-registry-7c46b94c59-dr4wj -n openshift-image-registry
...
  Normal   Scheduled         51m                  default-scheduler  Successfully assigned openshift-image-registry/image-registry-7c46b94c59-dr4wj to ip-10-0-212-18.ap-northeast-2.compute.internal
  Normal   AddedInterface    51m                  multus             Add eth0 [10.131.0.13/23]
  Normal   Pulled            49m (x5 over 51m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2" already present on machine
  Normal   Created           49m (x5 over 51m)    kubelet            Created container registry
  Normal   Started           49m (x5 over 51m)    kubelet            Started container registry
  Warning  BackOff           90s (x236 over 51m)  kubelet            Back-off restarting failed container
$ oc logs image-registry-7c46b94c59-dr4wj -n openshift-image-registry # nothing returns

Expected results:
1. Succeed

Additional info:
Checked CI jobs, they hit same "failed to initialize the cluster: Cluster operator image-registry is still updating":
4.7.0-0.nightly-2020-11-03-002310: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1323475127433695232
4.7.0-0.nightly-2020-11-03-040426: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.7/1323476719478247424

Looks like this bug is the cause why there are no accepted nightly payloads recently. Adding TestBlocker in Keywords.

Comment 1 Junqi Zhao 2020-11-04 08:25:32 UTC
# oc -n openshift-image-registry get pod image-registry-6fccd7bf5f-5g9l2 -oyaml
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    message: 'containers with unready status: [registry]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    message: 'containers with unready status: [registry]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-11-04T07:20:07Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://326af805108456268c97896caa847e596f220bb9c6565ddfa35f6f2783ad6d31
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2
    lastState:
      terminated:
        containerID: cri-o://326af805108456268c97896caa847e596f220bb9c6565ddfa35f6f2783ad6d31
        exitCode: 1
        finishedAt: "2020-11-04T08:17:17Z"
        reason: Error
        startedAt: "2020-11-04T08:17:17Z"
    name: registry
    ready: false
    restartCount: 16
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=registry pod=image-registry-6fccd7bf5f-5g9l2_openshift-image-registry(6a2040ca-cdab-4658-a386-9bb5c16525af)
        reason: CrashLoopBackOff
  hostIP: 10.0.32.3
  phase: Running
  podIP: 10.128.2.8
  podIPs:
  - ip: 10.128.2.8
  qosClass: Burstable
  startTime: "2020-11-04T07:20:07Z"

Comment 2 Junqi Zhao 2020-11-04 08:46:59 UTC
continue with Comment 2, 
3 image-registry pods, but the deployment only required 2 replicas, image-registry-7bc845d666-fbvhc and image-registry-6fccd7bf5f-5g9l2 are in the same node "zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal",
# oc -n openshift-image-registry get pod -o wide | grep -Ev "Running|Completed"
NAME                                               READY   STATUS             RESTARTS   AGE   IP            NODE                                                       NOMINATED NODE   READINESS GATES
image-registry-6fccd7bf5f-5g9l2                    0/1     CrashLoopBackOff   20         78m   10.128.2.8    zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal   <none>           <none>
image-registry-7bc845d666-5stcl                    0/1     CrashLoopBackOff   20         78m   10.131.0.17   zhsun114gcp-tk9lc-worker-b-n9l4h.c.openshift-qe.internal   <none>           <none>
image-registry-7bc845d666-fbvhc                    0/1     CrashLoopBackOff   20         78m   10.128.2.7    zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal   <none>           <none>

# oc -n openshift-image-registry get deploy image-registry
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
image-registry   0/2     1            0           81m


there are conflict replicas in spec and status section
# oc -n openshift-image-registry get deploy image-registry -oyaml
...
spec:
  progressDeadlineSeconds: 600
  replicas: 2
...
status:
  conditions:
  - lastTransitionTime: "2020-11-04T07:20:06Z"
    lastUpdateTime: "2020-11-04T07:20:06Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2020-11-04T07:30:08Z"
    lastUpdateTime: "2020-11-04T07:30:08Z"
    message: ReplicaSet "image-registry-6fccd7bf5f" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 2
  replicas: 3
  unavailableReplicas: 3
  updatedReplicas: 1

Comment 3 Junqi Zhao 2020-11-04 08:52:13 UTC
Created attachment 1726494 [details]
image-registry deployment file

Comment 5 Xingxing Xia 2020-11-05 03:19:36 UTC
Verified in 4.7.0-0.nightly-2020-11-05-010603

Comment 8 errata-xmlrpc 2021-02-24 15:29:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 9 Oleg Bulatov 2021-03-08 12:27:28 UTC
*** Bug 1936006 has been marked as a duplicate of this bug. ***