Description of problem: Recent 4.7 payload env installation always fails at: level=fatal msg=failed to initialize the cluster: Cluster operator image-registry is still updating Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-11-03-002310 How reproducible: Always Steps to Reproduce: 1. Install 4.7 env Actual results: 1. Fails as above. Checked oc get node, oc get pod -A, oc get co, all are well, except image-registry: $ oc get co image-registry image-registry False True False 56m $ oc get po -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-74c6ff47f-fm4gx 1/1 Running 1 61m image-registry-7b456759cb-kqqct 0/1 CrashLoopBackOff 12 44m image-registry-7b456759cb-twrhd 0/1 CrashLoopBackOff 12 44m image-registry-7c46b94c59-dr4wj 0/1 CrashLoopBackOff 12 44m ... $ oc describe po image-registry-7c46b94c59-dr4wj -n openshift-image-registry ... Normal Scheduled 51m default-scheduler Successfully assigned openshift-image-registry/image-registry-7c46b94c59-dr4wj to ip-10-0-212-18.ap-northeast-2.compute.internal Normal AddedInterface 51m multus Add eth0 [10.131.0.13/23] Normal Pulled 49m (x5 over 51m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2" already present on machine Normal Created 49m (x5 over 51m) kubelet Created container registry Normal Started 49m (x5 over 51m) kubelet Started container registry Warning BackOff 90s (x236 over 51m) kubelet Back-off restarting failed container $ oc logs image-registry-7c46b94c59-dr4wj -n openshift-image-registry # nothing returns Expected results: 1. Succeed Additional info: Checked CI jobs, they hit same "failed to initialize the cluster: Cluster operator image-registry is still updating": 4.7.0-0.nightly-2020-11-03-002310: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1323475127433695232 4.7.0-0.nightly-2020-11-03-040426: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.7/1323476719478247424 Looks like this bug is the cause why there are no accepted nightly payloads recently. Adding TestBlocker in Keywords.
# oc -n openshift-image-registry get pod image-registry-6fccd7bf5f-5g9l2 -oyaml ... status: conditions: - lastProbeTime: null lastTransitionTime: "2020-11-04T07:20:07Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-11-04T07:20:07Z" message: 'containers with unready status: [registry]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2020-11-04T07:20:07Z" message: 'containers with unready status: [registry]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-11-04T07:20:07Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://326af805108456268c97896caa847e596f220bb9c6565ddfa35f6f2783ad6d31 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c54f78026566c7fe18df411ee0d9b230c1ff8f2c696e52882909951a7d9efca2 lastState: terminated: containerID: cri-o://326af805108456268c97896caa847e596f220bb9c6565ddfa35f6f2783ad6d31 exitCode: 1 finishedAt: "2020-11-04T08:17:17Z" reason: Error startedAt: "2020-11-04T08:17:17Z" name: registry ready: false restartCount: 16 started: false state: waiting: message: back-off 5m0s restarting failed container=registry pod=image-registry-6fccd7bf5f-5g9l2_openshift-image-registry(6a2040ca-cdab-4658-a386-9bb5c16525af) reason: CrashLoopBackOff hostIP: 10.0.32.3 phase: Running podIP: 10.128.2.8 podIPs: - ip: 10.128.2.8 qosClass: Burstable startTime: "2020-11-04T07:20:07Z"
continue with Comment 2, 3 image-registry pods, but the deployment only required 2 replicas, image-registry-7bc845d666-fbvhc and image-registry-6fccd7bf5f-5g9l2 are in the same node "zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal", # oc -n openshift-image-registry get pod -o wide | grep -Ev "Running|Completed" NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES image-registry-6fccd7bf5f-5g9l2 0/1 CrashLoopBackOff 20 78m 10.128.2.8 zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal <none> <none> image-registry-7bc845d666-5stcl 0/1 CrashLoopBackOff 20 78m 10.131.0.17 zhsun114gcp-tk9lc-worker-b-n9l4h.c.openshift-qe.internal <none> <none> image-registry-7bc845d666-fbvhc 0/1 CrashLoopBackOff 20 78m 10.128.2.7 zhsun114gcp-tk9lc-worker-c-rw8cg.c.openshift-qe.internal <none> <none> # oc -n openshift-image-registry get deploy image-registry NAME READY UP-TO-DATE AVAILABLE AGE image-registry 0/2 1 0 81m there are conflict replicas in spec and status section # oc -n openshift-image-registry get deploy image-registry -oyaml ... spec: progressDeadlineSeconds: 600 replicas: 2 ... status: conditions: - lastTransitionTime: "2020-11-04T07:20:06Z" lastUpdateTime: "2020-11-04T07:20:06Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2020-11-04T07:30:08Z" lastUpdateTime: "2020-11-04T07:30:08Z" message: ReplicaSet "image-registry-6fccd7bf5f" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 2 replicas: 3 unavailableReplicas: 3 updatedReplicas: 1
Created attachment 1726494 [details] image-registry deployment file
Verified in 4.7.0-0.nightly-2020-11-05-010603
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
*** Bug 1936006 has been marked as a duplicate of this bug. ***