Description of problem: OCP: 4.6.8 - ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf OSD/ROSA installs ClusterMonitoringOperator immediately after the installation completes (install from openshift-installer). Clusters often fire alerts because OCP is still updating different operators after install. We (OSD/ROSA) assume a cluster is ready when Operators under ClusterOperator stop transitioning phases. During OSD/ROSA scale testing it was observed that at least 2 clusters took over 3 hours to become ready. We had at the time at least 28 out of 500 clusters not completely ready but we weren't able to triage them all (assuming it was the same issue) Initial investigations into the two clusters mentioned showed they both have 1 master with the NoSchedule taint, further investigation shows cluster-version-operator pods in the terminating phase on that particular master node. Terminating CVO pods eg. ``` oc get node NAME STATUS ROLES AGE VERSION ip-10-0-135-201.ec2.internal Ready worker 3h30m v1.19.0+7070803 ip-10-0-164-83.ec2.internal Ready infra,worker 3h10m v1.19.0+7070803 ip-10-0-174-3.ec2.internal Ready worker 3h30m v1.19.0+7070803 ip-10-0-185-42.ec2.internal Ready infra,worker 3h10m v1.19.0+7070803 ip-10-0-190-36.ec2.internal Ready,SchedulingDisabled master 3h36m v1.19.0+7070803 ip-10-0-191-203.ec2.internal Ready master 3h37m v1.19.0+7070803 ip-10-0-199-94.ec2.internal Ready master 3h35m v1.19.0+7070803 ``` ``` oc get pods --all-namespaces -o wide| grep -vE '(Completed|Running)' NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-apiserver apiserver-5689669645-8jc85 0/2 Pending 0 3h2m <none> <none> <none> <none> openshift-cluster-version cluster-version-operator-968d9fd48-rlpsr 1/1 Terminating 0 2m27s 10.0.190.36 ip-10-0-190-36.ec2.internal <none> <none> openshift-etcd etcd-quorum-guard-d6f58885d-fkdzk 0/1 Pending 0 3h3m <none> <none> <none> <none> openshift-oauth-apiserver apiserver-87f4565f-zjlxs 0/1 Pending 0 3h2m <none> <none> <none> <none> ``` Sample output from the cluster-version-operator logs ``` I0126 19:29:37.402060 1 leaderelection.go:342] lock is held by ip-10-0-190-36_5801b652-82ee-4ec1-8479-95bbe2340438 and has not yet expired I0126 19:29:37.402090 1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version I0126 19:29:51.737693 1 leaderelection.go:342] lock is held by ip-10-0-190-36_5801b652-82ee-4ec1-8479-95bbe2340438 and has not yet expired I0126 19:29:51.737720 1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version ``` I have must gathers from both clusters exhibiting these issues. Looking at host_service_logs/masters/kubelet_service.log in the must gather I see a taint being applied to all masters but that's it. I assume it gets removed at some point but I couldn't see any indication of that. Seems like this pod was holding the node from rebooting or clearing the taint. Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1909875
Poking at 4414881080314309899-osde2e-qa18o: ClusterVersion: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version' 2021-01-26T15:45:42Z 2021-01-26T16:16:03Z Completed 4.6.8 $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-01-26T16:16:03Z Available=True -: Done applying 4.6.8 2021-01-26T16:16:03Z Failing=False -: - 2021-01-26T16:16:03Z Progressing=False -: Cluster version is 4.6.8 2021-01-26T15:45:42Z RetrievedUpdates=True -: - ClusterOperators: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators.yaml | jq -r '.items[] | .metadata as $o | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .lastTransitionTime + " " + $o.name + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort 2021-01-26T16:19:02Z openshift-apiserver Degraded=True APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () 2021-01-26T16:19:45Z authentication Degraded=True APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver () 2021-01-26T16:37:49Z machine-config Available=False -: Cluster not available for 4.6.8 2021-01-26T16:37:49Z machine-config Degraded=True RequiredPoolsFailed: Failed to resync 4.6.8 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1) One node entry contained corrupt YAML (which I've seen before, but whose bug I don't have handy at the moment): $ tail -n2 cluster-scoped-resources/core/nodes/ip-10-0-135-201.ec2.internal.yaml systemUUID: ec239dba-92a9-b5f1-ffde-50f242157a91 a91 After removing the dangling, stuttered a91: $ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json < "${NODE}" | jq -r '.metadata as $m | .status.conditions[] | select(.type == "Ready" or .type == "SchedulingDisabled") | .lastTransitionTime + " " + $m.name + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-") + " " + ([$m.labels | to_entries[].key | select(startswith("node-role.kubernetes.io/"))] | sort | tostring)'; done | sort 2021-01-26T15:52:23Z ip-10-0-190-36.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/master"] 2021-01-26T15:53:34Z ip-10-0-199-94.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/master"] 2021-01-26T16:21:50Z ip-10-0-191-203.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/master"] 2021-01-26T16:22:04Z ip-10-0-174-3.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/worker"] 2021-01-26T16:24:41Z ip-10-0-185-42.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/infra","node-role.kubernetes.io/worker"] 2021-01-26T16:26:27Z ip-10-0-135-201.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/worker"] 2021-01-26T16:30:00Z ip-10-0-164-83.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/infra","node-role.kubernetes.io/worker"] Hmm. No SchedulingDisabled. Even in the whole must-gather: $ grep -r SchedulingDisabled | wc -l 0 CVO is still failing to become live: $ yaml2json <namespaces/openshift-cluster-version/core/events.yaml | jq -r '.items[] | .firstTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort ... 2021-01-26T19:00:37Z 1 cluster-version-operator-968d9fd48-7fv7b Started: Started container cluster-version-operator 2021-01-26T19:01:43Z 1 version LeaderElection: ip-10-0-190-36_5fa26af6-e82c-461d-8f87-9c9eae497ceb became leader 2021-01-26T19:02:47Z 1 cluster-version-operator-968d9fd48-7fv7b Killing: Stopping container cluster-version-operator 2021-01-26T19:02:48Z 1 cluster-version-operator-968d9fd48-25zkz Created: Created container cluster-version-operator 2021-01-26T19:02:48Z 1 cluster-version-operator-968d9fd48-25zkz Pulled: Container image "quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf" already present on machine 2021-01-26T19:02:48Z 1 cluster-version-operator-968d9fd48-25zkz Started: Started container cluster-version-operator 2021-01-26T19:03:19Z 1 version LeaderElection: ip-10-0-190-36_64269ab6-497c-44ed-8337-5a7170750c85 became leader 2021-01-26T19:05:30Z 1 version LeaderElection: ip-10-0-190-36_9f5be022-0ad9-453d-8cfd-895a5009ba03 became leader 2021-01-26T19:05:39Z 1 cluster-version-operator-968d9fd48-25zkz Killing: Stopping container cluster-version-operator 2021-01-26T19:05:39Z 1 cluster-version-operator-968d9fd48-8vhw6 Pulled: Container image "quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf" already present on machine 2021-01-26T19:05:40Z 1 cluster-version-operator-968d9fd48-8vhw6 Created: Created container cluster-version-operator 2021-01-26T19:05:40Z 1 cluster-version-operator-968d9fd48-8vhw6 Started: Started container cluster-version-operator We have those last two in the gather: $ ls namespaces/openshift-cluster-version/pods/cluster-version-operator-*/*.yaml namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator-968d9fd48-25zkz.yaml namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-8vhw6/cluster-version-operator-968d9fd48-8vhw6.yaml One is terminating: $ yaml2json <namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator-968d9fd48-25zkz.yaml | jq -r '.metadata | .deletionTimestamp + " " + (.deletionGracePeriodSeconds | tostring)' 2021-01-26T19:07:49Z 130 130s for deletion is the overloaded terminationGracePeriodSeconds [1,2]. Looking at that CVO: $ grep 'start.go.*ClusterVersionOperator\|successfully acquired lease\|Shutting down' namespaces/openshift-cluster-version/pods/cluster-version-operator-9 68d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log 2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120 1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256 2021-01-26T19:05:30.840612657Z I0126 19:05:30.840561 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version 2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690 1 start.go:190] Shutting down due to terminated $ tail -n1 namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log 2021-01-26T19:06:30.907417207Z I0126 19:06:30.907343 1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version So: * 15:45:42Z install begins * 16:16:03Z CVO is satisfied, install completes * 16:19 API-server and auth go Degraded=True on an unavailable control-plane node * 16:21:50Z ip-10-0-191-203.ec2.internal recovers from being Ready=False. But in comment 0, it was ip-10-0-190-36.ec2.internal tat was SchedulingDisabled. * 16:37 Machine-config goes Available=False and Degraded=True on the unavailable control-plane node, but... it was available again? I'm a bit fuzzy on this condition entry. * Something we don't have logs to answer kills a CVO without gracefully releasing the leader lease. * Replacement CVOs spend so long waiting for the lease to expire, that they are left with only seconds before they are killed for not going live fast enough. This seems to be an endless loop. * ClusterVersion still happy at must-gather time. So, would be good to understand the initial, destabilizing trigger. But a possible mitigation is to either lower the lease expiration time or raise the liveness/readiness limit so the CVO has enough time to declare itself happy, even if it's recovering from an orphaned, unreleased lease. [1]: https://github.com/openshift/cluster-ingress-operator/pull/534#issue-557819145 [2]: https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756244044
Seem like this bug is not easy to be produced, anyone can help me figure how to verify this bug from QE perspective?
PR topic post has a bit more discussion on the circumstances seen in the bug. Easiest verification would be regression testing, and then having another OSD scale test once we have a nightly/candidate that includes the fix. More difficult, but possible to perform without a scale test, would be launching something on each control-plane node to listen on 9099, so the incoming CVO pods will fail to launch their metrics servers, and then checking for "Stopped leading; shutting down." in the terminated CVO logs. Without the bug fix, that line will not be in the logs. With the bug fix, the line will be in the logs.
I have been seeing this as well during my rosa testing. The cli will return complete and rosa list clusters states its ready but some nodes will be flagged as scheduling disabled or NotReady. Single AZ 27 node cluster. OpenShift version 4.6.13. OCM ID 1ik3banmks45dc6j400hm4i3selo1ha4 NAME STATUS ROLES AGE VERSION ip-10-0-131-123.us-west-2.compute.internal Ready infra,worker 3m26s v1.19.0+3b01205 ip-10-0-134-65.us-west-2.compute.internal NotReady,SchedulingDisabled worker 16m v1.19.0+3b01205 ip-10-0-136-128.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-142-204.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-145-99.us-west-2.compute.internal Ready worker 17m v1.19.0+3b01205 ip-10-0-147-60.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-155-219.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-167-52.us-west-2.compute.internal Ready worker 19m v1.19.0+3b01205 ip-10-0-179-142.us-west-2.compute.internal Ready worker 17m v1.19.0+3b01205 ip-10-0-181-6.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-182-195.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-185-132.us-west-2.compute.internal Ready master 25m v1.19.0+3b01205 ip-10-0-186-250.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-209-226.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-210-103.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-215-204.us-west-2.compute.internal Ready worker 16m v1.19.0+3b01205 ip-10-0-216-16.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-217-85.us-west-2.compute.internal Ready worker 16m v1.19.0+3b01205 ip-10-0-220-212.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-221-70.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-223-134.us-west-2.compute.internal Ready infra,worker 3m34s v1.19.0+3b01205 ip-10-0-224-229.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-226-1.us-west-2.compute.internal Ready master 26m v1.19.0+3b01205 ip-10-0-226-42.us-west-2.compute.internal Ready worker 19m v1.19.0+3b01205 ip-10-0-228-70.us-west-2.compute.internal Ready worker 17m v1.19.0+3b01205 ip-10-0-230-240.us-west-2.compute.internal Ready worker 18m v1.19.0+3b01205 ip-10-0-235-88.us-west-2.compute.internal Ready worker 16m v1.19.0+3b01205 ip-10-0-251-11.us-west-2.compute.internal Ready worker 17m v1.19.0+3b01205 ip-10-0-251-120.us-west-2.compute.internal Ready,SchedulingDisabled master 26m v1.19.0+3b01205 ip-10-0-253-7.us-west-2.compute.internal Ready worker 16m v1.19.0+3b01205 ip-10-0-254-133.us-west-2.compute.internal Ready worker 17m v1.19.0+3b01205 ip-10-0-254-205.us-west-2.compute.internal Ready worker 16m v1.19.0+3b01205
Are we going to try for 4.7 with this? If not, lets set blocker-
I've set blocker-. I expect this is a regression from 4.5->4.6. It was reported in 4.6.8, so we know it's not a regression from 4.6->4.7. That said, we will work hard to get the open PR landed this week, which will set us up for a 4.6.z backport in the future. And *that* said, the PR is narrowly fixing the CVO leader-vs.-probe death spiral. We don't have a lead on what the initial trigger was that launched us into that pathological case.
Moving both severity and priority to high so that we move toward the top of the 4.7.z queue if that becomes the scenario.
Verified this bug with 4.7.0-0.nightly-2021-02-09-003138, and passed. Run an install using 4.7.0-0.nightly-2021-02-09-003138, after installation, log onto each masters to run the following command: # while :; do if netstat -lnp|grep -q :9099; then echo "9099 Address already in use"; else nc -l 9099 -k; fi; sleep 10; done This command will start a process listenning on 9099, so the incoming CVO pods will fail to launch their metrics servers. Trigger upgrade, and check CVO log. [root@preserve-jialiu-ansible ~]# oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-05-074735 --force --allow-explicit-upgrade warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-05-074735 [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-6d86fd5b79-67vgx 0/1 CrashLoopBackOff 4 2m48s version--vxrlt-q4727 0/1 Completed 0 3m2s [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-6d86fd5b79-67vgx|grep -i shutting I0209 11:40:44.770697 1 cvo.go:407] Shutting down ClusterVersionOperator I0209 11:40:44.776759 1 start.go:254] Stopped leading; shutting down. The "Stopped leading; shutting down" log is seen.
Created attachment 1756463 [details] cluster-version-operator-79c57f956f-4zjpt.log Moving back to ASSIGNED on a new failure mode. Same metrics bind conflict. This time we didn't get a leader lease at all, so Options.run hung waiting for 'leader controller' [1]. Leader context is postMainContext [2], which we only cancel after collecting the main operator goroutine [3], which we only launch after we acquire the leader lease [4]. Still thinking through the fix... [1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L259 [2]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L229 [3]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L279 [4]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L236-L240
Actually, with 4.7 forked off of master already, it's probably better if I open a fresh 4.8 bug for comment 11. Then this bug can become its 4.7 backport, and we don't have to mess with swapping the errata association.
Aaaand moving this back to ASSIGNED again, so it can hold anything we backport from bug 1927944.
Workaround, confirmed in test-platform's 4.7.0-rc.1 build01 cluster, is to scale the CVO down to zero: $ oc -n openshift-cluster-version scale --replicas 0 deployments/cluster-version-operator Let things cool off for a few minutes, and then scale it back up to 1: $ oc -n openshift-cluster-version scale --replicas 1 deployments/cluster-version-operator This also helps outside of the CVO-death-spiral scenario, when the CVO pod is being killed because the machine-config daemon is trying to drain the node, but the scheduler keeps dropping the NoSchedule-allowing CVO pod back onto the NoSchedule-tainted node being drained. Having the CVO scaled off leaves you exposed to "CVO-managed cluster resources drifting out of line with what the CVO would prefer without the CVO around to stomp the divergence", but the CVO will ignore some manifests while focusing on others already, so a few minutes of ignoring all manifests is not that much exposure on top of what we already accept.
@wking, do you have some idea how to verify this new PR from QE perspective?
Hrm. We obviously missed something with the verification plan used in comment 10. Maybe scaling the CVO Deployment up to 'replicas: 2' and looking at the logs of the pod that is not holding the leader lock? We should see it fail to listen to the already-bound metrics port (if you have some synthetic listener on the control plane nodes without the leader CVO) and shut down gracefully, instead of hanging waiting for the main goroutine which never launched?
Verified this bug with 4.7.0-0.nightly-2021-03-01-085007, and PASS. Run an install using 4.7.0-0.nightly-2021-03-01-085007, after installation, log onto each masters to run the following command: # while :; do if netstat -lnp|grep -q :9099; then echo "9099 Address already in use"; else nc -l 9099 -k; fi; sleep 10; done This command will start a process listenning on 9099, so the incoming CVO pods will fail to launch their metrics servers. [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-01-085007 True False 136m Cluster version is 4.7.0-0.nightly-2021-03-01-085007 [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 0 deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version No resources found in openshift-cluster-version namespace. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 2 deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled Check CVO log: [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-579d978bf9-j268x 0/1 CrashLoopBackOff 2 37s cluster-version-operator-579d978bf9-pbsfz 0/1 CrashLoopBackOff 2 37s [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-j268x |grep -i shutting I0302 09:33:24.000395 1 start.go:256] Stopped leading; shutting down. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-pbsfz |grep -i shutting I0302 09:33:22.264642 1 start.go:256] Stopped leading; shutting down. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-j268x I0302 09:33:23.543726 1 start.go:21] ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1 I0302 09:33:23.543900 1 merged_client_builder.go:121] Using in-cluster configuration I0302 09:33:23.548365 1 payload.go:260] Loading updatepayload from "/" I0302 09:33:23.999267 1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release I0302 09:33:23.999683 1 start.go:266] Waiting on 2 outstanding goroutines. I0302 09:33:23.999765 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... I0302 09:33:23.999907 1 reflector.go:219] Starting reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 I0302 09:33:23.999973 1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134 E0302 09:33:24.000021 1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use I0302 09:33:24.000099 1 start.go:266] Waiting on 1 outstanding goroutines. I0302 09:33:24.000136 1 start.go:270] Run context completed; beginning two-minute graceful shutdown period. I0302 09:33:24.000192 1 start.go:266] Waiting on 1 outstanding goroutines. I0302 09:33:24.000240 1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 E0302 09:33:24.000312 1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu47.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled I0302 09:33:24.000353 1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version I0302 09:33:24.000395 1 start.go:256] Stopped leading; shutting down. I0302 09:33:24.000430 1 start.go:296] Collected leader controller goroutine. I0302 09:33:24.000461 1 start.go:306] Finished collecting operator goroutines. I0302 09:33:24.000514 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-pbsfz I0302 09:33:21.800805 1 start.go:21] ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1 I0302 09:33:21.801112 1 merged_client_builder.go:121] Using in-cluster configuration I0302 09:33:21.806884 1 payload.go:260] Loading updatepayload from "/" I0302 09:33:22.263375 1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release I0302 09:33:22.263757 1 start.go:266] Waiting on 2 outstanding goroutines. I0302 09:33:22.263792 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... I0302 09:33:22.264013 1 reflector.go:219] Starting reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134 I0302 09:33:22.264092 1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134 I0302 09:33:22.264113 1 reflector.go:219] Starting reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 I0302 09:33:22.264140 1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134 I0302 09:33:22.264052 1 reflector.go:219] Starting reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0302 09:33:22.264363 1 reflector.go:255] Listing and watching *v1.ClusterOperator from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 E0302 09:33:22.264442 1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use I0302 09:33:22.264464 1 start.go:266] Waiting on 1 outstanding goroutines. I0302 09:33:22.264473 1 start.go:270] Run context completed; beginning two-minute graceful shutdown period. I0302 09:33:22.264485 1 start.go:266] Waiting on 1 outstanding goroutines. I0302 09:33:22.264370 1 reflector.go:219] Starting reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0302 09:33:22.264499 1 reflector.go:225] Stopping reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0302 09:33:22.264516 1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 I0302 09:33:22.264550 1 reflector.go:225] Stopping reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0302 09:33:22.264570 1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134 E0302 09:33:22.264610 1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu47.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled I0302 09:33:22.264633 1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version I0302 09:33:22.264642 1 start.go:256] Stopped leading; shutting down. I0302 09:33:22.264652 1 start.go:296] Collected leader controller goroutine. I0302 09:33:22.264670 1 start.go:306] Finished collecting operator goroutines. I0302 09:33:22.264672 1 reflector.go:219] Starting reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0302 09:33:22.264684 1 reflector.go:225] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0302 09:33:22.264801 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1. After stop the nc process which is listening 9099, cvo pods get running. [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-579d978bf9-j268x 0/1 CrashLoopBackOff 5 5m45s cluster-version-operator-579d978bf9-pbsfz 0/1 CrashLoopBackOff 5 5m45s [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-579d978bf9-j268x 1/1 Running 6 6m10s cluster-version-operator-579d978bf9-pbsfz 1/1 Running 6 6m10s [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-01-085007 True False 144m Cluster version is 4.7.0-0.nightly-2021-03-01-085007 [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 1 deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-01-085007 True False 144m Cluster version is 4.7.0-0.nightly-2021-03-01-085007 [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-579d978bf9-j268x 0/1 Terminating 6 7m27s cluster-version-operator-579d978bf9-pbsfz 1/1 Running 6 7m27s [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-579d978bf9-pbsfz 1/1 Running 6 7m31s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.1 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0678