+++ This bug was initially created as a clone of Bug #1921413 +++ --- Additional comment from W. Trevor King on 2021-02-11 22:03:42 UTC --- Moving back to ASSIGNED on a new failure mode. Same metrics bind conflict. This time we didn't get a leader lease at all, so Options.run hung waiting for 'leader controller' [1]. Leader context is postMainContext [2], which we only cancel after collecting the main operator goroutine [3], which we only launch after we acquire the leader lease [4]. Still thinking through the fix... [1]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L259 [2]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L229 [3]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L279 [4]: https://github.com/openshift/cluster-version-operator/blob/9e25ac8815dbf0d7b97dc640ad78071e49391a54/pkg/start/start.go#L236-L240 --- Additional comment from W. Trevor King on 2021-02-11 22:14:11 UTC --- Actually, with 4.7 forked off of master already, it's probably better if I open a fresh 4.8 bug for comment 11. Then this bug can become its 4.7 backport, and we don't have to mess with swapping the errata association.
4.5->4.6 regression, so blocker-, same as [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1921413#c7
Verified this bug with 4.8.0-0.nightly-2021-02-18-175130, and passed. Run an install using 4.8.0-0.nightly-2021-02-18-175130, after installation, log onto each masters to run the following command: # while :; do if netstat -lnp|grep -q :9099; then echo "9099 Address already in use"; else nc -l 9099 -k; fi; sleep 10; done This command will start a process listenning on 9099, so the incoming CVO pods will fail to launch their metrics servers. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 0 deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version No resources found in openshift-cluster-version namespace. Scale CVO to replicas to 2 [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version scale --replicas 2 deployments/cluster-version-operator deployment.apps/cluster-version-operator scaled [root@preserve-jialiu-ansible ~]# oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-5bbf4ff56b-7gvdt 0/1 CrashLoopBackOff 1 5s cluster-version-operator-5bbf4ff56b-dlhx4 0/1 CrashLoopBackOff 1 5s check CVO log. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-7gvdt |grep -i shutting I0220 04:48:56.568008 1 start.go:256] Stopped leading; shutting down. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-dlhx4 |grep -i shutting I0220 04:48:56.862711 1 start.go:256] Stopped leading; shutting down. The "Stopped leading; shutting down" log is seen. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-7gvdt I0220 04:49:42.041878 1 start.go:21] ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0 I0220 04:49:42.042127 1 merged_client_builder.go:121] Using in-cluster configuration I0220 04:49:42.046689 1 payload.go:260] Loading updatepayload from "/" I0220 04:49:42.523309 1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release I0220 04:49:42.523708 1 start.go:266] Waiting on 2 outstanding goroutines. I0220 04:49:42.523735 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... I0220 04:49:42.523959 1 reflector.go:219] Starting reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134 I0220 04:49:42.524035 1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134 I0220 04:49:42.524124 1 reflector.go:219] Starting reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 I0220 04:49:42.524543 1 reflector.go:255] Listing and watching *v1.ConfigMap from k8s.io/client-go/informers/factory.go:134 I0220 04:49:42.524327 1 reflector.go:219] Starting reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.524698 1 reflector.go:255] Listing and watching *v1.ClusterOperator from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.524485 1 reflector.go:219] Starting reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.524802 1 reflector.go:255] Listing and watching *v1.ClusterVersion from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 E0220 04:49:42.524492 1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use I0220 04:49:42.524935 1 start.go:266] Waiting on 1 outstanding goroutines. I0220 04:49:42.524945 1 start.go:270] Run context completed; beginning two-minute graceful shutdown period. I0220 04:49:42.524957 1 start.go:266] Waiting on 1 outstanding goroutines. I0220 04:49:42.523992 1 reflector.go:219] Starting reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.524971 1 reflector.go:225] Stopping reflector *v1.Proxy (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.524996 1 reflector.go:225] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.525003 1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m52.861090565s) from k8s.io/client-go/informers/factory.go:134 I0220 04:49:42.525015 1 reflector.go:225] Stopping reflector *v1.ClusterOperator (2m52.525702462s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:42.525030 1 reflector.go:225] Stopping reflector *v1.ConfigMap (3m19.747206386s) from k8s.io/client-go/informers/factory.go:134 E0220 04:49:42.525049 1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu48.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled I0220 04:49:42.525072 1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version I0220 04:49:42.525080 1 start.go:256] Stopped leading; shutting down. I0220 04:49:42.525092 1 start.go:296] Collected leader controller goroutine. I0220 04:49:42.525101 1 start.go:306] Finished collecting operator goroutines. I0220 04:49:42.525162 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0. [root@preserve-jialiu-ansible ~]# oc -n openshift-cluster-version logs cluster-version-operator-5bbf4ff56b-dlhx4 I0220 04:49:37.276636 1 start.go:21] ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0 I0220 04:49:37.276828 1 merged_client_builder.go:121] Using in-cluster configuration I0220 04:49:37.281339 1 payload.go:260] Loading updatepayload from "/" I0220 04:49:37.784101 1 cvo.go:246] Verifying release authenticity: All release image digests must have GPG signatures from verifier-public-key-redhat (567E347AD0044ADE55BA8A5F199E2F91FD431D51: Red Hat, Inc. (release key 2) <security>, B08B659EE86AF623BC90E8DB938A80CAF21541EB: Red Hat, Inc. (beta key 2) <security>) - will check for signatures in containers/image format at serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release I0220 04:49:37.784740 1 start.go:266] Waiting on 2 outstanding goroutines. I0220 04:49:37.784825 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... E0220 04:49:37.785783 1 start.go:280] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use I0220 04:49:37.785850 1 start.go:266] Waiting on 1 outstanding goroutines. I0220 04:49:37.785887 1 start.go:270] Run context completed; beginning two-minute graceful shutdown period. I0220 04:49:37.785935 1 start.go:266] Waiting on 1 outstanding goroutines. I0220 04:49:37.786154 1 reflector.go:219] Starting reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 I0220 04:49:37.786199 1 reflector.go:225] Stopping reflector *v1.ClusterVersion (3m12.559234557s) from github.com/openshift/client-go/config/informers/externalversions/factory.go:101 E0220 04:49:37.786293 1 leaderelection.go:325] error retrieving resource lock openshift-cluster-version/version: Get "https://api-int.jialiu48.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cluster-version/configmaps/version": context canceled I0220 04:49:37.786333 1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version I0220 04:49:37.786361 1 start.go:256] Stopped leading; shutting down. I0220 04:49:37.786390 1 start.go:296] Collected leader controller goroutine. I0220 04:49:37.786419 1 start.go:306] Finished collecting operator goroutines. I0220 04:49:37.786467 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.8.0-202102170021.p0-bac69e0. If anything get wrong, pls correct me.
Looks great to me, thanks :)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438