1927944 – cluster version operator cycles terminating state waiting for leader election

Bug 1927944 - cluster version operator cycles terminating state waiting for leader election

Summary: cluster version operator cycles terminating state waiting for leader election

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	W. Trevor King
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1921413
TreeView+	depends on / blocked

Reported:	2021-02-11 22:15 UTC by W. Trevor King
Modified:	2021-07-27 22:44 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1921413
Environment:
Last Closed:	2021-07-27 22:44:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 519	0	None	closed	Bug 1927944: pkg/start: Fix shutdown deadlock when die before getting a leader lock	2021-02-16 05:53:46 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:44:35 UTC

Description W. Trevor King 2021-02-11 22:15:33 UTC

+++ This bug was initially created as a clone of Bug #1921413 +++

--- Additional comment from W. Trevor King on 2021-02-11 22:03:42 UTC ---

Moving back to ASSIGNED on a new failure mode.  Same metrics bind conflict.  This time we didn't get a leader lease at all, so Options.run hung waiting for 'leader controller' [1].  Leader context is postMainContext [2], which we only cancel after collecting the main operator goroutine [3], which we only launch after we acquire the leader lease [4].  Still thinking through the fix...

[1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L259
[2]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L229
[3]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L279
[4]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L236-L240

--- Additional comment from W. Trevor King on 2021-02-11 22:14:11 UTC ---

Actually, with 4.7 forked off of master already, it's probably better if I open a fresh 4.8 bug for comment 11.  Then this bug can become its 4.7 backport, and we don't have to mess with swapping the errata association.

Comment 1 W. Trevor King 2021-02-12 22:25:26 UTC

4.5->4.6 regression, so blocker-, same as [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1921413#c7

Comment 3 Johnny Liu 2021-02-20 04:56:34 UTC

Verified this bug with 4.8.0-0.nightly-2021-02-18-175130, and passed.

Run an install using 4.8.0-0.nightly-2021-02-18-175130, after installation, log onto each masters to run the following command:
# while :; do if netstat -lnp|grep -q  :9099; then echo  "9099 Address already in use"; else nc -l 9099 -k; fi; sleep 10; done

This command will start a process listenning on 9099, so the incoming CVO pods will fail to launch their metrics servers.

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 0 deployments/cluster-version-operator
deployment.apps/cluster-version-operator scaled

[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
No resources found in openshift-cluster-version namespace.

Scale CVO to replicas to 2
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 2 deployments/cluster-version-operator
deployment.apps/cluster-version-operator scaled

[root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS             RESTARTS   AGE
cluster-version-operator-5bbf4ff56b-7gvdt   0/1     CrashLoopBackOff   1          5s
cluster-version-operator-5bbf4ff56b-dlhx4   0/1     CrashLoopBackOff   1          5s


check CVO log.
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-7gvdt |grep -i shutting
I0220 04:48:56.568008       1 start.go:256] Stopped leading; shutting down.
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-dlhx4 |grep -i shutting
I0220 04:48:56.862711       1 start.go:256] Stopped leading; shutting down.

The "Stopped leading; shutting down" log is seen.

[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-7gvdt
I0220 04:49:42.041878       1 start.go:21] ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0
I0220 04:49:42.042127       1 merged_client_builder.go:121] Using in-cluster configuration
I0220 04:49:42.046689       1 payload.go:260] Loading updatepayload from "/"
I0220 04:49:42.523309       1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release
I0220 04:49:42.523708       1 start.go:266] Waiting on 2 outstanding goroutines.
I0220 04:49:42.523735       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version...
I0220 04:49:42.523959       1 reflector.go:219] Starting reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134
I0220 04:49:42.524035       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134
I0220 04:49:42.524124       1 reflector.go:219] Starting reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
I0220 04:49:42.524543       1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134
I0220 04:49:42.524327       1 reflector.go:219] Starting reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.524698       1 reflector.go:255] Listing and watching *v1.ClusterOperator from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.524485       1 reflector.go:219] Starting reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.524802       1 reflector.go:255] Listing and watching *v1.ClusterVersion from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
E0220 04:49:42.524492       1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
I0220 04:49:42.524935       1 start.go:266] Waiting on 1 outstanding goroutines.
I0220 04:49:42.524945       1 start.go:270] Run context completed; beginning two-minute graceful shutdown period.
I0220 04:49:42.524957       1 start.go:266] Waiting on 1 outstanding goroutines.
I0220 04:49:42.523992       1 reflector.go:219] Starting reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.524971       1 reflector.go:225] Stopping reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.524996       1 reflector.go:225] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.525003       1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134
I0220 04:49:42.525015       1 reflector.go:225] Stopping reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:42.525030       1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134
E0220 04:49:42.525049       1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu48.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled
I0220 04:49:42.525072       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
I0220 04:49:42.525080       1 start.go:256] Stopped leading; shutting down.
I0220 04:49:42.525092       1 start.go:296] Collected leader controller goroutine.
I0220 04:49:42.525101       1 start.go:306] Finished collecting operator goroutines.
I0220 04:49:42.525162       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0.
[root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-dlhx4
I0220 04:49:37.276636       1 start.go:21] ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0
I0220 04:49:37.276828       1 merged_client_builder.go:121] Using in-cluster configuration
I0220 04:49:37.281339       1 payload.go:260] Loading updatepayload from "/"
I0220 04:49:37.784101       1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release
I0220 04:49:37.784740       1 start.go:266] Waiting on 2 outstanding goroutines.
I0220 04:49:37.784825       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version...
E0220 04:49:37.785783       1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
I0220 04:49:37.785850       1 start.go:266] Waiting on 1 outstanding goroutines.
I0220 04:49:37.785887       1 start.go:270] Run context completed; beginning two-minute graceful shutdown period.
I0220 04:49:37.785935       1 start.go:266] Waiting on 1 outstanding goroutines.
I0220 04:49:37.786154       1 reflector.go:219] Starting reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
I0220 04:49:37.786199       1 reflector.go:225] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101
E0220 04:49:37.786293       1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu48.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled
I0220 04:49:37.786333       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
I0220 04:49:37.786361       1 start.go:256] Stopped leading; shutting down.
I0220 04:49:37.786390       1 start.go:296] Collected leader controller goroutine.
I0220 04:49:37.786419       1 start.go:306] Finished collecting operator goroutines.
I0220 04:49:37.786467       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0.

If anything get wrong, pls correct me.

Comment 4 W. Trevor King 2021-02-20 05:23:51 UTC

Looks great to me, thanks :)

Comment 7 errata-xmlrpc 2021-07-27 22:44:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.