Bug 1921413 - cluster version operator cycles terminating state waiting for leader election
Summary: cluster version operator cycles terminating state waiting for leader election
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: W. Trevor King
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On: 1927944
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-27 23:56 UTC by James Harrington
Modified: 2021-03-10 11:24 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1927944 (view as bug list)
Environment:
Last Closed: 2021-03-10 11:24:01 UTC
Target Upstream Version:


Attachments (Terms of Use)
cluster-version-operator-79c57f956f-4zjpt.log (8.44 KB, text/plain)
2021-02-11 22:03 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 508 0 None closed Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels 2021-02-17 15:29:28 UTC
Github openshift cluster-version-operator pull 521 0 None closed Bug 1921413: pkg/start: Fix shutdown deadlock when die before getting a leader lock 2021-02-27 21:45:26 UTC
Red Hat Product Errata RHBA-2021:0678 0 None None None 2021-03-10 11:24:36 UTC

Description James Harrington 2021-01-27 23:56:30 UTC
Description of problem:

OCP: 4.6.8 - ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf 

OSD/ROSA installs ClusterMonitoringOperator immediately after the installation completes (install from openshift-installer). Clusters often fire alerts because OCP is still updating different operators after install. We (OSD/ROSA) assume a cluster is ready when Operators under ClusterOperator stop transitioning phases.

During OSD/ROSA scale testing it was observed that at least 2 clusters took over 3 hours to become ready. We had at the time at least 28 out of 500 clusters not completely ready but we weren't able to triage them all (assuming it was the same issue)

Initial investigations into the two clusters mentioned showed they both have 1 master with the NoSchedule taint, further investigation shows cluster-version-operator pods in the terminating phase on that particular master node.

Terminating CVO pods eg.

```
oc get node
NAME                           STATUS                     ROLES          AGE     VERSION
ip-10-0-135-201.ec2.internal   Ready                      worker         3h30m   v1.19.0+7070803
ip-10-0-164-83.ec2.internal    Ready                      infra,worker   3h10m   v1.19.0+7070803
ip-10-0-174-3.ec2.internal     Ready                      worker         3h30m   v1.19.0+7070803
ip-10-0-185-42.ec2.internal    Ready                      infra,worker   3h10m   v1.19.0+7070803
ip-10-0-190-36.ec2.internal    Ready,SchedulingDisabled   master         3h36m   v1.19.0+7070803
ip-10-0-191-203.ec2.internal   Ready                      master         3h37m   v1.19.0+7070803
ip-10-0-199-94.ec2.internal    Ready                      master         3h35m   v1.19.0+7070803
```

```
oc get pods --all-namespaces -o wide| grep -vE '(Completed|Running)'
NAMESPACE                                          NAME                                                     READY   STATUS        RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
openshift-apiserver                                apiserver-5689669645-8jc85                               0/2     Pending       0          3h2m    <none>         <none>                         <none>           <none>
openshift-cluster-version                          cluster-version-operator-968d9fd48-rlpsr                 1/1     Terminating   0          2m27s   10.0.190.36    ip-10-0-190-36.ec2.internal    <none>           <none>
openshift-etcd                                     etcd-quorum-guard-d6f58885d-fkdzk                        0/1     Pending       0          3h3m    <none>         <none>                         <none>           <none>
openshift-oauth-apiserver                          apiserver-87f4565f-zjlxs                                 0/1     Pending       0          3h2m    <none>         <none>                         <none>           <none>
```

Sample output from the cluster-version-operator logs 

```
I0126 19:29:37.402060       1 leaderelection.go:342] lock is held by ip-10-0-190-36_5801b652-82ee-4ec1-8479-95bbe2340438 and has not yet expired
I0126 19:29:37.402090       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
I0126 19:29:51.737693       1 leaderelection.go:342] lock is held by ip-10-0-190-36_5801b652-82ee-4ec1-8479-95bbe2340438 and has not yet expired
I0126 19:29:51.737720       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
```

I have must gathers from both clusters exhibiting these issues.

Looking at host_service_logs/masters/kubelet_service.log in the must gather I see a taint being applied to all masters but that's it. I assume it gets removed at some point but I couldn't see any indication of that.

Seems like this pod was holding the node from rebooting or clearing the taint. Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1909875

Comment 2 W. Trevor King 2021-01-28 01:02:09 UTC
Poking at 4414881080314309899-osde2e-qa18o:

ClusterVersion:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
2021-01-26T15:45:42Z 2021-01-26T16:16:03Z Completed 4.6.8
$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2021-01-26T16:16:03Z Available=True -: Done applying 4.6.8
2021-01-26T16:16:03Z Failing=False -: -
2021-01-26T16:16:03Z Progressing=False -: Cluster version is 4.6.8
2021-01-26T15:45:42Z RetrievedUpdates=True -: -

ClusterOperators:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators.yaml | jq -r '.items[] | .metadata as $o | .status.conditions[] | select((.type == "Available" and .status == "False") or (.type == "Degraded" and .status == "True")) | .lastTransitionTime + " " + $o.name + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort
2021-01-26T16:19:02Z openshift-apiserver Degraded=True APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
2021-01-26T16:19:45Z authentication Degraded=True APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
2021-01-26T16:37:49Z machine-config Available=False -: Cluster not available for 4.6.8
2021-01-26T16:37:49Z machine-config Degraded=True RequiredPoolsFailed: Failed to resync 4.6.8 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)

One node entry contained corrupt YAML (which I've seen before, but whose bug I don't have handy at the moment):

$ tail -n2 cluster-scoped-resources/core/nodes/ip-10-0-135-201.ec2.internal.yaml
    systemUUID: ec239dba-92a9-b5f1-ffde-50f242157a91
a91

After removing the dangling, stuttered a91:

$ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json < "${NODE}" | jq -r '.metadata as $m | .status.conditions[] | select(.type == "Ready" or .type == "SchedulingDisabled") | .lastTransitionTime + " " + $m.name + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-") + " " + ([$m.labels | to_entries[].key | select(startswith("node-role.kubernetes.io/"))] | sort | tostring)'; done | sort
2021-01-26T15:52:23Z ip-10-0-190-36.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/master"]
2021-01-26T15:53:34Z ip-10-0-199-94.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/master"]
2021-01-26T16:21:50Z ip-10-0-191-203.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/master"]
2021-01-26T16:22:04Z ip-10-0-174-3.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/worker"]
2021-01-26T16:24:41Z ip-10-0-185-42.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/infra","node-role.kubernetes.io/worker"]
2021-01-26T16:26:27Z ip-10-0-135-201.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/worker"]
2021-01-26T16:30:00Z ip-10-0-164-83.ec2.internal Ready=True KubeletReady: kubelet is posting ready status ["node-role.kubernetes.io/infra","node-role.kubernetes.io/worker"]

Hmm.  No SchedulingDisabled.  Even in the whole must-gather:

$ grep -r SchedulingDisabled | wc -l
0

CVO is still failing to become live:

$ yaml2json <namespaces/openshift-cluster-version/core/events.yaml | jq -r '.items[] | .firstTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort
...
2021-01-26T19:00:37Z 1 cluster-version-operator-968d9fd48-7fv7b Started: Started container cluster-version-operator
2021-01-26T19:01:43Z 1 version LeaderElection: ip-10-0-190-36_5fa26af6-e82c-461d-8f87-9c9eae497ceb became leader
2021-01-26T19:02:47Z 1 cluster-version-operator-968d9fd48-7fv7b Killing: Stopping container cluster-version-operator
2021-01-26T19:02:48Z 1 cluster-version-operator-968d9fd48-25zkz Created: Created container cluster-version-operator
2021-01-26T19:02:48Z 1 cluster-version-operator-968d9fd48-25zkz Pulled: Container image "quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf" already present on machine
2021-01-26T19:02:48Z 1 cluster-version-operator-968d9fd48-25zkz Started: Started container cluster-version-operator
2021-01-26T19:03:19Z 1 version LeaderElection: ip-10-0-190-36_64269ab6-497c-44ed-8337-5a7170750c85 became leader
2021-01-26T19:05:30Z 1 version LeaderElection: ip-10-0-190-36_9f5be022-0ad9-453d-8cfd-895a5009ba03 became leader
2021-01-26T19:05:39Z 1 cluster-version-operator-968d9fd48-25zkz Killing: Stopping container cluster-version-operator
2021-01-26T19:05:39Z 1 cluster-version-operator-968d9fd48-8vhw6 Pulled: Container image "quay.io/openshift-release-dev/ocp-release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf" already present on machine
2021-01-26T19:05:40Z 1 cluster-version-operator-968d9fd48-8vhw6 Created: Created container cluster-version-operator
2021-01-26T19:05:40Z 1 cluster-version-operator-968d9fd48-8vhw6 Started: Started container cluster-version-operator

We have those last two in the gather:

$ ls namespaces/openshift-cluster-version/pods/cluster-version-operator-*/*.yaml
namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator-968d9fd48-25zkz.yaml
namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-8vhw6/cluster-version-operator-968d9fd48-8vhw6.yaml

One is terminating:

$ yaml2json <namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator-968d9fd48-25zkz.yaml | jq -r '.metadata | .deletionTimestamp + " " + (.deletionGracePeriodSeconds | tostring)'
2021-01-26T19:07:49Z 130

130s for deletion is the overloaded terminationGracePeriodSeconds [1,2].  Looking at that CVO:

$ grep 'start.go.*ClusterVersionOperator\|successfully acquired lease\|Shutting down' namespaces/openshift-cluster-version/pods/cluster-version-operator-9
68d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log
2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120       1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256
2021-01-26T19:05:30.840612657Z I0126 19:05:30.840561       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version
2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690       1 start.go:190] Shutting down due to terminated
$ tail -n1 namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log
2021-01-26T19:06:30.907417207Z I0126 19:06:30.907343       1 leaderelection.go:273] successfully renewed lease openshift-cluster-version/version

So:

* 15:45:42Z install begins
* 16:16:03Z CVO is satisfied, install completes
* 16:19 API-server and auth go Degraded=True on an unavailable control-plane node
* 16:21:50Z ip-10-0-191-203.ec2.internal recovers from being Ready=False.  But in comment 0, it was ip-10-0-190-36.ec2.internal tat was SchedulingDisabled.
* 16:37 Machine-config goes Available=False and Degraded=True on the unavailable control-plane node, but... it was available again?  I'm a bit fuzzy on this condition entry.
* Something we don't have logs to answer kills a CVO without gracefully releasing the leader lease.
* Replacement CVOs spend so long waiting for the lease to expire, that they are left with only seconds before they are killed for not going live fast enough.  This seems to be an endless loop.
* ClusterVersion still happy at must-gather time.

So, would be good to understand the initial, destabilizing trigger.  But a possible mitigation is to either lower the lease expiration time or raise the liveness/readiness limit so the CVO has enough time to declare itself happy, even if it's recovering from an orphaned, unreleased lease.

[1]: https://github.com/openshift/cluster-ingress-operator/pull/534#issue-557819145
[2]: https://github.com/kubernetes/kubernetes/issues/64715#issuecomment-756244044

Comment 3 Johnny Liu 2021-01-29 04:51:49 UTC
Seem like this bug is not easy to be produced, anyone can help me figure how to verify this bug from QE perspective?

Comment 4 W. Trevor King 2021-01-29 22:56:16 UTC
PR topic post has a bit more discussion on the circumstances seen in the bug.  Easiest verification would be regression testing, and then having another OSD scale test once we have a nightly/candidate that includes the fix.  More difficult, but possible to perform without a scale test, would be launching something on each control-plane node to listen on 9099, so the incoming CVO pods will fail to launch their metrics servers, and then checking for "Stopped leading; shutting down." in the terminated CVO logs.  Without the bug fix, that line will not be in the logs.  With the bug fix, the line will be in the logs.

Comment 5 Russ Zaleski 2021-02-03 11:50:56 UTC
I have been seeing this as well during my rosa testing. The cli will return complete and rosa list clusters states its ready but some nodes will be flagged as scheduling disabled or NotReady. Single AZ 27 node cluster. OpenShift version 4.6.13. OCM ID 1ik3banmks45dc6j400hm4i3selo1ha4

NAME                                         STATUS                        ROLES          AGE     VERSION
ip-10-0-131-123.us-west-2.compute.internal   Ready                         infra,worker   3m26s   v1.19.0+3b01205
ip-10-0-134-65.us-west-2.compute.internal    NotReady,SchedulingDisabled   worker         16m     v1.19.0+3b01205
ip-10-0-136-128.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-142-204.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-145-99.us-west-2.compute.internal    Ready                         worker         17m     v1.19.0+3b01205
ip-10-0-147-60.us-west-2.compute.internal    Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-155-219.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-167-52.us-west-2.compute.internal    Ready                         worker         19m     v1.19.0+3b01205
ip-10-0-179-142.us-west-2.compute.internal   Ready                         worker         17m     v1.19.0+3b01205
ip-10-0-181-6.us-west-2.compute.internal     Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-182-195.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-185-132.us-west-2.compute.internal   Ready                         master         25m     v1.19.0+3b01205
ip-10-0-186-250.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-209-226.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-210-103.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-215-204.us-west-2.compute.internal   Ready                         worker         16m     v1.19.0+3b01205
ip-10-0-216-16.us-west-2.compute.internal    Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-217-85.us-west-2.compute.internal    Ready                         worker         16m     v1.19.0+3b01205
ip-10-0-220-212.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-221-70.us-west-2.compute.internal    Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-223-134.us-west-2.compute.internal   Ready                         infra,worker   3m34s   v1.19.0+3b01205
ip-10-0-224-229.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-226-1.us-west-2.compute.internal     Ready                         master         26m     v1.19.0+3b01205
ip-10-0-226-42.us-west-2.compute.internal    Ready                         worker         19m     v1.19.0+3b01205
ip-10-0-228-70.us-west-2.compute.internal    Ready                         worker         17m     v1.19.0+3b01205
ip-10-0-230-240.us-west-2.compute.internal   Ready                         worker         18m     v1.19.0+3b01205
ip-10-0-235-88.us-west-2.compute.internal    Ready                         worker         16m     v1.19.0+3b01205
ip-10-0-251-11.us-west-2.compute.internal    Ready                         worker         17m     v1.19.0+3b01205
ip-10-0-251-120.us-west-2.compute.internal   Ready,SchedulingDisabled      master         26m     v1.19.0+3b01205
ip-10-0-253-7.us-west-2.compute.internal     Ready                         worker         16m     v1.19.0+3b01205
ip-10-0-254-133.us-west-2.compute.internal   Ready                         worker         17m     v1.19.0+3b01205
ip-10-0-254-205.us-west-2.compute.internal   Ready                         worker         16m     v1.19.0+3b01205

Comment 6 Nick Stielau 2021-02-03 19:10:38 UTC
Are we going to try for 4.7 with this?  If not, lets set blocker-

Comment 7 W. Trevor King 2021-02-03 19:24:59 UTC
I've set blocker-.  I expect this is a regression from 4.5->4.6.  It was reported in 4.6.8, so we know it's not a regression from 4.6->4.7.  That said, we will work hard to get the open PR landed this week, which will set us up for a 4.6.z backport in the future.  And *that* said, the PR is narrowly fixing the CVO leader-vs.-probe death spiral.  We don't have a lead on what the initial trigger was that launched us into that pathological case.

Comment 8 Scott Dodson 2021-02-04 18:29:35 UTC
Moving both severity and priority to high so that we move toward the top of the 4.7.z queue if that becomes the scenario.

Comment 10 Johnny Liu 2021-02-09 11:46:04 UTC
Verified this bug with 4.7.0-0.nightly-2021-02-09-003138, and passed.

Run an install using 4.7.0-0.nightly-2021-02-09-003138, after installation, log onto each masters to run the following command:
# while :; do if netstat -lnp|grep -q  :9099; then echo  "9099 Address already in use"; else nc -l 9099 -k; fi; sleep 10; done

This command will start a process listenning on 9099, so the incoming CVO pods will fail to launch their metrics servers.

Trigger upgrade, and check CVO log.

[root@preserve-jialiu-ansible ~]# oc adm upgrade --to-image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-05-074735 --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-05-074735

[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS             RESTARTS   AGE
cluster-version-operator-6d86fd5b79-67vgx   0/1     CrashLoopBackOff   4          2m48s
version--vxrlt-q4727                        0/1     Completed          0          3m2s

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-6d86fd5b79-67vgx|grep -i shutting
I0209 11:40:44.770697       1 cvo.go:407] Shutting down ClusterVersionOperator
I0209 11:40:44.776759       1 start.go:254] Stopped leading; shutting down.


The "Stopped leading; shutting down" log is seen.

Comment 11 W. Trevor King 2021-02-11 22:03:42 UTC
Created attachment 1756463 [details]
cluster-version-operator-79c57f956f-4zjpt.log

Moving back to ASSIGNED on a new failure mode.  Same metrics bind conflict.  This time we didn't get a leader lease at all, so Options.run hung waiting for 'leader controller' [1].  Leader context is postMainContext [2], which we only cancel after collecting the main operator goroutine [3], which we only launch after we acquire the leader lease [4].  Still thinking through the fix...

[1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L259
[2]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L229
[3]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L279
[4]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L236-L240

Comment 12 W. Trevor King 2021-02-11 22:14:11 UTC
Actually, with 4.7 forked off of master already, it's probably better if I open a fresh 4.8 bug for comment 11.  Then this bug can become its 4.7 backport, and we don't have to mess with swapping the errata association.

Comment 13 W. Trevor King 2021-02-11 23:09:27 UTC
Aaaand moving this back to ASSIGNED again, so it can hold anything we backport from bug 1927944.

Comment 14 W. Trevor King 2021-02-11 23:15:51 UTC
Workaround, confirmed in test-platform's 4.7.0-rc.1 build01 cluster, is to scale the CVO down to zero:

  $ oc -n openshift-cluster-version scale --replicas 0 deployments/cluster-version-operator

Let things cool off for a few minutes, and then scale it back up to 1:

  $ oc -n openshift-cluster-version scale --replicas 1 deployments/cluster-version-operator

This also helps outside of the CVO-death-spiral scenario, when the CVO pod is being killed because the machine-config daemon is trying to drain the node, but the scheduler keeps dropping the NoSchedule-allowing CVO pod back onto the NoSchedule-tainted node being drained.

Having the CVO scaled off leaves you exposed to "CVO-managed cluster resources drifting out of line with what the CVO would prefer without the CVO around to stomp the divergence", but the CVO will ignore some manifests while focusing on others already, so a few minutes of ignoring all manifests is not that much exposure on top of what we already accept.

Comment 15 Johnny Liu 2021-02-18 11:11:55 UTC
@wking, do you have some idea how to verify this new PR from QE perspective?

Comment 16 W. Trevor King 2021-02-19 05:51:18 UTC
Hrm.  We obviously missed something with the verification plan used in comment 10.  Maybe scaling the CVO Deployment up to 'replicas: 2' and looking at the logs of the pod that is not holding the leader lock?  We should see it fail to listen to the already-bound metrics port (if you have some synthetic listener on the control plane nodes without the leader CVO) and shut down gracefully, instead of hanging waiting for the main goroutine which never launched?

Comment 19 Johnny Liu 2021-03-02 09:40:48 UTC
Verified this bug with 4.7.0-0.nightly-2021-03-01-085007, and PASS.

Run an install using 4.7.0-0.nightly-2021-03-01-085007, after installation, log onto each masters to run the following command:
# while :; do if netstat -lnp|grep -q  :9099; then echo  "9099 Address already in use"; else nc -l 9099 -k; fi; sleep 10; done

This command will start a process listenning on 9099, so the incoming CVO pods will fail to launch their metrics servers.

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-01-085007   True        False         136m    Cluster version is 4.7.0-0.nightly-2021-03-01-085007
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 0 deployments/cluster-version-operator
deployment.apps/cluster-version-operator scaled
[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
No resources found in openshift-cluster-version namespace.
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 2 deployments/cluster-version-operator
deployment.apps/cluster-version-operator scaled

Check CVO log:
[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS             RESTARTS   AGE
cluster-version-operator-579d978bf9-j268x   0/1     CrashLoopBackOff   2          37s
cluster-version-operator-579d978bf9-pbsfz   0/1     CrashLoopBackOff   2          37s

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-j268x |grep -i shutting
I0302 09:33:24.000395       1 start.go:256] Stopped leading; shutting down.
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-pbsfz |grep -i shutting
I0302 09:33:22.264642       1 start.go:256] Stopped leading; shutting down.

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-j268x
I0302 09:33:23.543726       1 start.go:21] ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1
I0302 09:33:23.543900       1 merged_client_builder.go:121] Using in-cluster configuration
I0302 09:33:23.548365       1 payload.go:260] Loading updatepayload from "/"
I0302 09:33:23.999267       1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security@redhat.com>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security@redhat.com>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release
I0302 09:33:23.999683       1 start.go:266] Waiting on 2 outstanding goroutines.
I0302 09:33:23.999765       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version...
I0302 09:33:23.999907       1 reflector.go:219] Starting reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
I0302 09:33:23.999973       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134
E0302 09:33:24.000021       1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
I0302 09:33:24.000099       1 start.go:266] Waiting on 1 outstanding goroutines.
I0302 09:33:24.000136       1 start.go:270] Run context completed; beginning two-minute graceful shutdown period.
I0302 09:33:24.000192       1 start.go:266] Waiting on 1 outstanding goroutines.
I0302 09:33:24.000240       1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
E0302 09:33:24.000312       1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu47.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled
I0302 09:33:24.000353       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
I0302 09:33:24.000395       1 start.go:256] Stopped leading; shutting down.
I0302 09:33:24.000430       1 start.go:296] Collected leader controller goroutine.
I0302 09:33:24.000461       1 start.go:306] Finished collecting operator goroutines.
I0302 09:33:24.000514       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1.

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-579d978bf9-pbsfz
I0302 09:33:21.800805       1 start.go:21] ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1
I0302 09:33:21.801112       1 merged_client_builder.go:121] Using in-cluster configuration
I0302 09:33:21.806884       1 payload.go:260] Loading updatepayload from "/"
I0302 09:33:22.263375       1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security@redhat.com>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security@redhat.com>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release
I0302 09:33:22.263757       1 start.go:266] Waiting on 2 outstanding goroutines.
I0302 09:33:22.263792       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version...
I0302 09:33:22.264013       1 reflector.go:219] Starting reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134
I0302 09:33:22.264092       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134
I0302 09:33:22.264113       1 reflector.go:219] Starting reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
I0302 09:33:22.264140       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134
I0302 09:33:22.264052       1 reflector.go:219] Starting reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0302 09:33:22.264363       1 reflector.go:255] Listing and watching *v1.ClusterOperator from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
E0302 09:33:22.264442       1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
I0302 09:33:22.264464       1 start.go:266] Waiting on 1 outstanding goroutines.
I0302 09:33:22.264473       1 start.go:270] Run context completed; beginning two-minute graceful shutdown period.
I0302 09:33:22.264485       1 start.go:266] Waiting on 1 outstanding goroutines.
I0302 09:33:22.264370       1 reflector.go:219] Starting reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0302 09:33:22.264499       1 reflector.go:225] Stopping reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0302 09:33:22.264516       1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
I0302 09:33:22.264550       1 reflector.go:225] Stopping reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0302 09:33:22.264570       1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134
E0302 09:33:22.264610       1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu47.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled
I0302 09:33:22.264633       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
I0302 09:33:22.264642       1 start.go:256] Stopped leading; shutting down.
I0302 09:33:22.264652       1 start.go:296] Collected leader controller goroutine.
I0302 09:33:22.264670       1 start.go:306] Finished collecting operator goroutines.
I0302 09:33:22.264672       1 reflector.go:219] Starting reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0302 09:33:22.264684       1 reflector.go:225] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0302 09:33:22.264801       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.7.0-202103010125.p0-cf0e3d1.

After stop the nc process which is listening 9099, cvo pods get running.
[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS             RESTARTS   AGE
cluster-version-operator-579d978bf9-j268x   0/1     CrashLoopBackOff   5          5m45s
cluster-version-operator-579d978bf9-pbsfz   0/1     CrashLoopBackOff   5          5m45s
[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-579d978bf9-j268x   1/1     Running   6          6m10s
cluster-version-operator-579d978bf9-pbsfz   1/1     Running   6          6m10s
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-01-085007   True        False         144m    Cluster version is 4.7.0-0.nightly-2021-03-01-085007
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 1 deployments/cluster-version-operator
deployment.apps/cluster-version-operator scaled
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-01-085007   True        False         144m    Cluster version is 4.7.0-0.nightly-2021-03-01-085007
[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS        RESTARTS   AGE
cluster-version-operator-579d978bf9-j268x   0/1     Terminating   6          7m27s
cluster-version-operator-579d978bf9-pbsfz   1/1     Running       6          7m27s
[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-579d978bf9-pbsfz   1/1     Running   6          7m31s

Comment 21 errata-xmlrpc 2021-03-10 11:24:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0678


Note You need to log in before you can comment on or make changes to this bug.