Bug 1880086 - kube-controller-manager degraded state after upgrade to 4.4.17
Summary: kube-controller-manager degraded state after upgrade to 4.4.17
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.4.z
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On: 1876484
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-17 16:05 UTC by mchebbi@redhat.com
Modified: 2022-11-21 12:59 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-02 18:21:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-controller-manager-operator pull 457 0 None closed [release-4.4] Bug 1876484: static-pod-installer: remove deadlock by recreating inst… 2021-02-20 16:50:03 UTC
Github openshift library-go pull 881 0 None closed [release-4.4] Bug 1876484: static-pod-installer: recreate installers that disappeared 2021-02-20 16:50:03 UTC
Red Hat Product Errata RHBA-2020:5122 0 None None None 2020-12-02 18:21:49 UTC

Description mchebbi@redhat.com 2020-09-17 16:05:23 UTC
please find attachment here: https://bit.ly/33Hefk2

the customer did an upgrade of his OCP cluster from 4.4.16 to 4.4.17. All operators were upgraded but the kube-controller-manager remained in a degraded state since upgrade. 

after checking the must-gather, i found the following issues: 

message: |
          \\\"atl20er8ocpwk02.ocp-lab.ocp.prgx.com\\\" \\nE0914 12:05:01.797427       
1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out\\nI0914 12:05:04.791041  
1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition\\nF0914 12:05:04.791193       
1 controllermanager.go:291] leaderelection lost\\n\"\nStaticPods Degraded: etcdserver: request timed out" to "NodeInstallerDegraded: 
1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found"
E0914 12:05:26.485459       1 targetconfigcontroller.go:666] key failed with : synthetic requeue request
I0914 12:05:26.495107       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found\nTargetConfigControllerDegraded: \"v4.1.0/kube-controller-manager/leader-election-kube-controller-manager-role-kube-system.yaml\" (string): etcdserver: request timed out\nTargetConfigControllerDegraded: \"v4.1.0/kube-controller-manager/leader-election-kube-controller-manager-rolebinding-kube-system.yaml\" (string): etcdserver: request timed out"
I0914 12:05:30.381136       1 leaderelection.go:288] failed to renew lease openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: timed out waiting for the condition
F0914 12:05:30.381194       1 leaderelection.go:67] leaderelection lost
reason: Error

========================================================================

2020-09-14T12:06:43.376310558Z I0914 12:06:43.375916       1 base_controller.go:45] Starting #1 worker of StaticPodStateController controller ...
2020-09-14T12:06:43.376310558Z I0914 12:06:43.373130       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32

2020-09-14T12:06:43.382691866Z E0914 12:06:43.382648       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32]
2020-09-14T12:06:43.385457627Z I0914 12:06:43.385423       1 prune_controller.go:280] No excluded revisions to prune, skipping
2020-09-14T12:06:43.388137316Z E0914 12:06:43.388108       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32]
2020-09-14T12:06:43.388572367Z E0914 12:06:43.388545       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32]

2020-09-14T12:06:43.388889295Z I0914 12:06:43.388860       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32
2020-09-14T12:06:43.388916737Z I0914 12:06:43.388907       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32

====================

 oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.17    True        False         7d4h    Error while reconciling 4.4.17: the cluster operator kube-controller-manager is degraded

thanks.

Comment 2 Jan Chaloupka 2020-09-21 13:24:00 UTC
Checking must-gather.local.15138691491249454/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2acc62c38f40bcebc003a6ce8a30ee58f5c1ed6dc0d8514811cc70528d93a65d/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-7ccbd8c597-76cs6/kube-controller-manager-operator/kube-controller-manager-operator/logs/previous.log, almost every log line is complaining about etcdserver timeout out:

```
2020-09-14T10:41:38.259591571Z E0914 10:41:38.259508       1 leaderelection.go:331] error retrieving resource lock openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: etcdserver: request timed out
...
2020-09-14T12:05:19.387791348Z E0914 12:05:19.387738       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
```

Checking status of etcd itself, all looks in order.

Also, `NodeInstallerDegraded: pods "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" not found` is appearing in all operator status changes.

From namespaces/openshift-machine-config-operator/pods/machine-config-daemon-c6mr5/machine-config-daemon/machine-config-daemon/logs/current.log:
```
2020-09-12T00:40:42.126837462Z I0912 00:40:42.126789    2625 update.go:103] ignoring DaemonSet-managed pods: tuned-6qd85, controller-manager-qcxzs, dns-default-hvf9p, node-ca-ctf9d, machine-config-daemon-c6mr5, machine-config-server-64c7t, node-exporter-blt8z, multus-admission-controller-dprfv, multus-r4zhw, istio-node-9k842, ovs-4ft5h, sdn-24jnp, sdn-controller-prrrj; deleting pods with local storage: insights-operator-76d9f9865-jh6wb; deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com
2020-09-12T00:40:57.234395907Z I0912 00:40:57.234344    2625 update.go:99] pod "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" removed (evicted)
```

Comparing the timestamps kube-controller-manager-operator-7ccbd8c597-76cs6 previous.log is launched at 2020-09-13T12:55:04.345074703Z. However, the missing "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod was evicted before that (~12 hours before, assuming clocks are in sync), at 2020-09-12T00:40:57.234395907Z.


Two questions:
1) why is machine-config-daemon deleting "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod?
2) why is not "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod recreated?

Comment 3 Jan Chaloupka 2020-09-21 13:35:38 UTC
Moez, would it be possible to increase the log levels of kube-controller-manager-operator? E.g. to at least 4?

Built from release-4.4 branch:
```
$ ./cluster-kube-controller-manager-operator operator --help
Start the Cluster kube-controller-manager Operator

Usage:
  cluster-kube-controller-manager-operator operator [flags]

Flags:
      --config string                    Location of the master configuration file to run from.
  -h, --help                             help for operator
      --kubeconfig string                Location of the master configuration file to run from.
      --listen string                    The ip:port to serve on.
      --namespace string                 Namespace where the controller is running. Auto-detected if run in cluster.
      --terminate-on-files stringArray   A list of files. If one of them changes, the process will terminate.

Global Flags:
      --add-dir-header                   If true, adds the file directory to the header
      --alsologtostderr                  log to standard error as well as files
      --log-backtrace-at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log-dir string                   If non-empty, write log files in this directory
      --log-file string                  If non-empty, use this log file
      --log-file-max-size uint           Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --log-flush-frequency duration     Maximum number of seconds between log flushes (default 5s)
      --logtostderr                      log to standard error instead of files (default true)
      --skip-headers                     If true, avoid header prefixes in the log messages
      --skip-log-headers                 If true, avoid headers when opening log files
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          number for the log level verbosity
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging
```

Overriding the arguments and setting -v=4 should be hopefully enough.

Comment 4 Jan Chaloupka 2020-09-21 13:49:29 UTC
The installer pod not getting creating might be cause by a deadlock. Potential fix: https://github.com/openshift/library-go/pull/881

Comment 5 Jan Chaloupka 2020-09-21 13:50:57 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1876484 for porting the fix into relevant operators

Comment 6 Jan Chaloupka 2020-10-01 15:10:29 UTC
Adding UpcomingSprint just in case the fix does not get merged by EOW

Comment 10 Scott Dodson 2020-10-19 14:13:19 UTC
I'm moving this to MODIFIED and dropping it from the errata. This means that when the next set of nightlies start for 4.4 it will move to ON_QA which will hopefully coincide with the fix for Bug 1876484 merging.

Comment 13 RamaKasturi 2020-10-21 12:15:59 UTC
Moving the bug back to POST state as i still see Bug 1876484  in POST state. Please feel free to move this back when ever the other bug is ON_QA, thanks !!

Comment 14 Jan Chaloupka 2020-10-23 11:07:48 UTC
Just waiting until https://bugzilla.redhat.com/show_bug.cgi?id=1876484 gets moved to MODIFIED

Comment 15 Jan Chaloupka 2020-11-08 12:56:57 UTC
Since https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is in VERIFIED state, this issue can be moved to VERIFIED as well. Moving to ON_QA, leaving the VERIFIED state for QE.

Comment 16 RamaKasturi 2020-11-09 09:54:39 UTC
Since the underlying bug https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is already in VERIFIED state, moving this bug as well to verified.

Comment 19 errata-xmlrpc 2020-12-02 18:21:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5122

Comment 20 RamaKasturi 2022-11-21 12:59:20 UTC
qe_test_coverage has been set to '-' because i see that the issue is caused by underlying bug and if there is a test coverage added for that test it would be sufficient to cover this case as well.


Note You need to log in before you can comment on or make changes to this bug.