please find attachment here: https://bit.ly/33Hefk2 the customer did an upgrade of his OCP cluster from 4.4.16 to 4.4.17. All operators were upgraded but the kube-controller-manager remained in a degraded state since upgrade. after checking the must-gather, i found the following issues: message: | \\\"atl20er8ocpwk02.ocp-lab.ocp.prgx.com\\\" \\nE0914 12:05:01.797427 1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out\\nI0914 12:05:04.791041 1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition\\nF0914 12:05:04.791193 1 controllermanager.go:291] leaderelection lost\\n\"\nStaticPods Degraded: etcdserver: request timed out" to "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found" E0914 12:05:26.485459 1 targetconfigcontroller.go:666] key failed with : synthetic requeue request I0914 12:05:26.495107 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 32:\nNodeInstallerDegraded: pods \"installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com\" not found\nTargetConfigControllerDegraded: \"v4.1.0/kube-controller-manager/leader-election-kube-controller-manager-role-kube-system.yaml\" (string): etcdserver: request timed out\nTargetConfigControllerDegraded: \"v4.1.0/kube-controller-manager/leader-election-kube-controller-manager-rolebinding-kube-system.yaml\" (string): etcdserver: request timed out" I0914 12:05:30.381136 1 leaderelection.go:288] failed to renew lease openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: timed out waiting for the condition F0914 12:05:30.381194 1 leaderelection.go:67] leaderelection lost reason: Error ======================================================================== 2020-09-14T12:06:43.376310558Z I0914 12:06:43.375916 1 base_controller.go:45] Starting #1 worker of StaticPodStateController controller ... 2020-09-14T12:06:43.376310558Z I0914 12:06:43.373130 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32 2020-09-14T12:06:43.382691866Z E0914 12:06:43.382648 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32] 2020-09-14T12:06:43.385457627Z I0914 12:06:43.385423 1 prune_controller.go:280] No excluded revisions to prune, skipping 2020-09-14T12:06:43.388137316Z E0914 12:06:43.388108 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32] 2020-09-14T12:06:43.388572367Z E0914 12:06:43.388545 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32] 2020-09-14T12:06:43.388889295Z I0914 12:06:43.388860 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32 2020-09-14T12:06:43.388916737Z I0914 12:06:43.388907 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"57342c9c-9db9-40b4-8c5c-fafd5d2e63e5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' configmaps: aggregator-client-ca,client-ca, configmaps: cluster-policy-controller-config-32,config-32,controller-manager-kubeconfig-32,kube-controller-cert-syncer-kubeconfig-32,kube-controller-manager-pod-32,service-ca-32,serviceaccount-ca-32, secrets: csr-signer,kube-controller-manager-client-cert-key, secrets: localhost-recovery-client-token-32,service-account-private-key-32 ==================== oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.17 True False 7d4h Error while reconciling 4.4.17: the cluster operator kube-controller-manager is degraded thanks.
Checking must-gather.local.15138691491249454/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2acc62c38f40bcebc003a6ce8a30ee58f5c1ed6dc0d8514811cc70528d93a65d/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-7ccbd8c597-76cs6/kube-controller-manager-operator/kube-controller-manager-operator/logs/previous.log, almost every log line is complaining about etcdserver timeout out: ``` 2020-09-14T10:41:38.259591571Z E0914 10:41:38.259508 1 leaderelection.go:331] error retrieving resource lock openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: etcdserver: request timed out ... 2020-09-14T12:05:19.387791348Z E0914 12:05:19.387738 1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out ``` Checking status of etcd itself, all looks in order. Also, `NodeInstallerDegraded: pods "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" not found` is appearing in all operator status changes. From namespaces/openshift-machine-config-operator/pods/machine-config-daemon-c6mr5/machine-config-daemon/machine-config-daemon/logs/current.log: ``` 2020-09-12T00:40:42.126837462Z I0912 00:40:42.126789 2625 update.go:103] ignoring DaemonSet-managed pods: tuned-6qd85, controller-manager-qcxzs, dns-default-hvf9p, node-ca-ctf9d, machine-config-daemon-c6mr5, machine-config-server-64c7t, node-exporter-blt8z, multus-admission-controller-dprfv, multus-r4zhw, istio-node-9k842, ovs-4ft5h, sdn-24jnp, sdn-controller-prrrj; deleting pods with local storage: insights-operator-76d9f9865-jh6wb; deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com 2020-09-12T00:40:57.234395907Z I0912 00:40:57.234344 2625 update.go:99] pod "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" removed (evicted) ``` Comparing the timestamps kube-controller-manager-operator-7ccbd8c597-76cs6 previous.log is launched at 2020-09-13T12:55:04.345074703Z. However, the missing "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod was evicted before that (~12 hours before, assuming clocks are in sync), at 2020-09-12T00:40:57.234395907Z. Two questions: 1) why is machine-config-daemon deleting "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod? 2) why is not "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod recreated?
Moez, would it be possible to increase the log levels of kube-controller-manager-operator? E.g. to at least 4? Built from release-4.4 branch: ``` $ ./cluster-kube-controller-manager-operator operator --help Start the Cluster kube-controller-manager Operator Usage: cluster-kube-controller-manager-operator operator [flags] Flags: --config string Location of the master configuration file to run from. -h, --help help for operator --kubeconfig string Location of the master configuration file to run from. --listen string The ip:port to serve on. --namespace string Namespace where the controller is running. Auto-detected if run in cluster. --terminate-on-files stringArray A list of files. If one of them changes, the process will terminate. Global Flags: --add-dir-header If true, adds the file directory to the header --alsologtostderr log to standard error as well as files --log-backtrace-at traceLocation when logging hits line file:N, emit a stack trace (default :0) --log-dir string If non-empty, write log files in this directory --log-file string If non-empty, use this log file --log-file-max-size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) --log-flush-frequency duration Maximum number of seconds between log flushes (default 5s) --logtostderr log to standard error instead of files (default true) --skip-headers If true, avoid header prefixes in the log messages --skip-log-headers If true, avoid headers when opening log files --stderrthreshold severity logs at or above this threshold go to stderr (default 2) -v, --v Level number for the log level verbosity --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging ``` Overriding the arguments and setting -v=4 should be hopefully enough.
The installer pod not getting creating might be cause by a deadlock. Potential fix: https://github.com/openshift/library-go/pull/881
https://bugzilla.redhat.com/show_bug.cgi?id=1876484 for porting the fix into relevant operators
Adding UpcomingSprint just in case the fix does not get merged by EOW
I'm moving this to MODIFIED and dropping it from the errata. This means that when the next set of nightlies start for 4.4 it will move to ON_QA which will hopefully coincide with the fix for Bug 1876484 merging.
Moving the bug back to POST state as i still see Bug 1876484 in POST state. Please feel free to move this back when ever the other bug is ON_QA, thanks !!
Just waiting until https://bugzilla.redhat.com/show_bug.cgi?id=1876484 gets moved to MODIFIED
Since https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is in VERIFIED state, this issue can be moved to VERIFIED as well. Moving to ON_QA, leaving the VERIFIED state for QE.
Since the underlying bug https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is already in VERIFIED state, moving this bug as well to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.4.31 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5122
qe_test_coverage has been set to '-' because i see that the issue is caused by underlying bug and if there is a test coverage added for that test it would be sufficient to cover this case as well.