1880086 – kube-controller-manager degraded state after upgrade to 4.4.17

Bug 1880086 - kube-controller-manager degraded state after upgrade to 4.4.17

Summary: kube-controller-manager degraded state after upgrade to 4.4.17

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:	1876484
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-17 16:05 UTC by mchebbi@redhat.com
Modified:	2023-12-15 19:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-02 18:21:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 457	None	closed	[release-4.4] Bug 1876484: static-pod-installer: remove deadlock by recreating inst…	2021-02-20 16:50:03 UTC
Github	openshift library-go pull 881	None	closed	[release-4.4] Bug 1876484: static-pod-installer: recreate installers that disappeared	2021-02-20 16:50:03 UTC
Red Hat Product Errata	RHBA-2020:5122	None	None	None	2020-12-02 18:21:49 UTC

Description mchebbi@redhat.com 2020-09-17 16:05:23 UTC

please find attachment here: https://bit.ly/33Hefk2

the customer did an upgrade of his OCP cluster from 4.4.16 to 4.4.17. All operators were upgraded but the kube-controller-manager remained in a degraded state since upgrade. 

after checking the must-gather, i found the following issues: 

message: |
          \\\"atl20er8ocpwk02.ocp-lab.ocp.prgx.com\\\" \\nE0914 12:05:01.797427       
1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out\\nI0914 12:05:04.791041  
1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition\\nF0914 12:05:04.791193       
1 controllermanager.go:291] leaderelection lost\\n\"\nStaticPods Degraded: etcdserver: request timed out" to "NodeInstallerDegraded: 
1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found"
E0914 12:05:26.485459       1 targetconfigcontroller.go:666] key failed with : synthetic requeue request
I0914 12:05:26.495107       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found\nTargetConfigControllerDegraded: \"v4.1.0/kube-controller-manager/leader-election-kube-controller-manager-role-kube-system.yaml\" (string): etcdserver: request timed out\nTargetConfigControllerDegraded: \"v4.1.0/kube-controller-manager/leader-election-kube-controller-manager-rolebinding-kube-system.yaml\" (string): etcdserver: request timed out"
I0914 12:05:30.381136       1 leaderelection.go:288] failed to renew lease openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: timed out waiting for the condition
F0914 12:05:30.381194       1 leaderelection.go:67] leaderelection lost
reason: Error

========================================================================

2020-09-14T12:06:43.376310558Z I0914 12:06:43.375916       1 base_controller.go:45] Starting #1 worker of StaticPodStateController controller ...
2020-09-14T12:06:43.376310558Z I0914 12:06:43.373130       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32

2020-09-14T12:06:43.382691866Z E0914 12:06:43.382648       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32]
2020-09-14T12:06:43.385457627Z I0914 12:06:43.385423       1 prune_controller.go:280] No excluded revisions to prune, skipping
2020-09-14T12:06:43.388137316Z E0914 12:06:43.388108       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32]
2020-09-14T12:06:43.388572367Z E0914 12:06:43.388545       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32]

2020-09-14T12:06:43.388889295Z I0914 12:06:43.388860       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32
2020-09-14T12:06:43.388916737Z I0914 12:06:43.388907       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32

====================

 oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.17    True        False         7d4h    Error while reconciling 4.4.17: the cluster operator kube-controller-manager is degraded

thanks.

Comment 2 Jan Chaloupka 2020-09-21 13:24:00 UTC

Checking must-gather.local.15138691491249454/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2acc62c38f40bcebc003a6ce8a30ee58f5c1ed6dc0d8514811cc70528d93a65d/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-7ccbd8c597-76cs6/kube-controller-manager-operator/kube-controller-manager-operator/logs/previous.log, almost every log line is complaining about etcdserver timeout out:

```
2020-09-14T10:41:38.259591571Z E0914 10:41:38.259508       1 leaderelection.go:331] error retrieving resource lock openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: etcdserver: request timed out
...
2020-09-14T12:05:19.387791348Z E0914 12:05:19.387738       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
```

Checking status of etcd itself, all looks in order.

Also, `NodeInstallerDegraded: pods "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" not found` is appearing in all operator status changes.

From namespaces/openshift-machine-config-operator/pods/machine-config-daemon-c6mr5/machine-config-daemon/machine-config-daemon/logs/current.log:
```
2020-09-12T00:40:42.126837462Z I0912 00:40:42.126789    2625 update.go:103] ignoring DaemonSet-managed pods: tuned-6qd85, controller-manager-qcxzs, dns-default-hvf9p, node-ca-ctf9d, machine-config-daemon-c6mr5, machine-config-server-64c7t, node-exporter-blt8z, multus-admission-controller-dprfv, multus-r4zhw, istio-node-9k842, ovs-4ft5h, sdn-24jnp, sdn-controller-prrrj; deleting pods with local storage: insights-operator-76d9f9865-jh6wb; deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com
2020-09-12T00:40:57.234395907Z I0912 00:40:57.234344    2625 update.go:99] pod "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" removed (evicted)
```

Comparing the timestamps kube-controller-manager-operator-7ccbd8c597-76cs6 previous.log is launched at 2020-09-13T12:55:04.345074703Z. However, the missing "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod was evicted before that (~12 hours before, assuming clocks are in sync), at 2020-09-12T00:40:57.234395907Z.


Two questions:
1) why is machine-config-daemon deleting "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod?
2) why is not "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod recreated?

Comment 3 Jan Chaloupka 2020-09-21 13:35:38 UTC

Moez, would it be possible to increase the log levels of kube-controller-manager-operator? E.g. to at least 4?

Built from release-4.4 branch:
```
$ ./cluster-kube-controller-manager-operator operator --help
Start the Cluster kube-controller-manager Operator

Usage:
cluster-kube-controller-manager-operator operator [flags]

Flags:
--config string Location of the master configuration file to run from.
-h, --help help for operator
--kubeconfig string Location of the master configuration file to run from.
--listen string The ip:port to serve on.
--namespace string Namespace where the controller is running. Auto-detected if run in cluster.
--terminate-on-files stringArray A list of files. If one of them changes, the process will terminate.

Global Flags:
--add-dir-header If true, adds the file directory to the header
--alsologtostderr log to standard error as well as files
--log-backtrace-at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log-dir string If non-empty, write log files in this directory
--log-file string If non-empty, use this log file
--log-file-max-size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--log-flush-frequency duration Maximum number of seconds between log flushes (default 5s)
--logtostderr log to standard error instead of files (default true)
--skip-headers If true, avoid header prefixes in the log messages
--skip-log-headers If true, avoid headers when opening log files
--stderrthreshold severity logs at or above this threshold go to stderr (default 2)
-v, --v Level number for the log level verbosity
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
```

Overriding the arguments and setting -v=4 should be hopefully enough.

Comment 4 Jan Chaloupka 2020-09-21 13:49:29 UTC

The installer pod not getting creating might be cause by a deadlock. Potential fix: https://github.com/openshift/library-go/pull/881

Comment 5 Jan Chaloupka 2020-09-21 13:50:57 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1876484 for porting the fix into relevant operators

Comment 6 Jan Chaloupka 2020-10-01 15:10:29 UTC

Adding UpcomingSprint just in case the fix does not get merged by EOW

Comment 10 Scott Dodson 2020-10-19 14:13:19 UTC

I'm moving this to MODIFIED and dropping it from the errata. This means that when the next set of nightlies start for 4.4 it will move to ON_QA which will hopefully coincide with the fix for Bug 1876484 merging.

Comment 13 RamaKasturi 2020-10-21 12:15:59 UTC

Moving the bug back to POST state as i still see Bug 1876484  in POST state. Please feel free to move this back when ever the other bug is ON_QA, thanks !!

Comment 14 Jan Chaloupka 2020-10-23 11:07:48 UTC

Just waiting until https://bugzilla.redhat.com/show_bug.cgi?id=1876484 gets moved to MODIFIED

Comment 15 Jan Chaloupka 2020-11-08 12:56:57 UTC

Since https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is in VERIFIED state, this issue can be moved to VERIFIED as well. Moving to ON_QA, leaving the VERIFIED state for QE.

Comment 16 RamaKasturi 2020-11-09 09:54:39 UTC

Since the underlying bug https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is already in VERIFIED state, moving this bug as well to verified.

Comment 19 errata-xmlrpc 2020-12-02 18:21:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5122

Comment 20 RamaKasturi 2022-11-21 12:59:20 UTC

qe_test_coverage has been set to '-' because i see that the issue is caused by underlying bug and if there is a test coverage added for that test it would be sufficient to cover this case as well.

Note You need to log in before you can comment on or make changes to this bug.