Bug 1880086
Summary: | kube-controller-manager degraded state after upgrade to 4.4.17 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | mchebbi <mchebbi> |
Component: | kube-controller-manager | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.4 | CC: | aos-bugs, dlamotta, knarra, mfojtik |
Target Milestone: | --- | ||
Target Release: | 4.4.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-02 18:21:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1876484 | ||
Bug Blocks: |
Description
mchebbi@redhat.com
2020-09-17 16:05:23 UTC
Checking must-gather.local.15138691491249454/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2acc62c38f40bcebc003a6ce8a30ee58f5c1ed6dc0d8514811cc70528d93a65d/namespaces/openshift-kube-controller-manager-operator/pods/kube-controller-manager-operator-7ccbd8c597-76cs6/kube-controller-manager-operator/kube-controller-manager-operator/logs/previous.log, almost every log line is complaining about etcdserver timeout out: ``` 2020-09-14T10:41:38.259591571Z E0914 10:41:38.259508 1 leaderelection.go:331] error retrieving resource lock openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock: etcdserver: request timed out ... 2020-09-14T12:05:19.387791348Z E0914 12:05:19.387738 1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out ``` Checking status of etcd itself, all looks in order. Also, `NodeInstallerDegraded: pods "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" not found` is appearing in all operator status changes. From namespaces/openshift-machine-config-operator/pods/machine-config-daemon-c6mr5/machine-config-daemon/machine-config-daemon/logs/current.log: ``` 2020-09-12T00:40:42.126837462Z I0912 00:40:42.126789 2625 update.go:103] ignoring DaemonSet-managed pods: tuned-6qd85, controller-manager-qcxzs, dns-default-hvf9p, node-ca-ctf9d, machine-config-daemon-c6mr5, machine-config-server-64c7t, node-exporter-blt8z, multus-admission-controller-dprfv, multus-r4zhw, istio-node-9k842, ovs-4ft5h, sdn-24jnp, sdn-controller-prrrj; deleting pods with local storage: insights-operator-76d9f9865-jh6wb; deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com 2020-09-12T00:40:57.234395907Z I0912 00:40:57.234344 2625 update.go:99] pod "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" removed (evicted) ``` Comparing the timestamps kube-controller-manager-operator-7ccbd8c597-76cs6 previous.log is launched at 2020-09-13T12:55:04.345074703Z. However, the missing "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod was evicted before that (~12 hours before, assuming clocks are in sync), at 2020-09-12T00:40:57.234395907Z. Two questions: 1) why is machine-config-daemon deleting "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod? 2) why is not "installer-32-atl20er8ocpma01.ocp-lab.ocp.prgx.com" pod recreated? Moez, would it be possible to increase the log levels of kube-controller-manager-operator? E.g. to at least 4? Built from release-4.4 branch: ``` $ ./cluster-kube-controller-manager-operator operator --help Start the Cluster kube-controller-manager Operator Usage: cluster-kube-controller-manager-operator operator [flags] Flags: --config string Location of the master configuration file to run from. -h, --help help for operator --kubeconfig string Location of the master configuration file to run from. --listen string The ip:port to serve on. --namespace string Namespace where the controller is running. Auto-detected if run in cluster. --terminate-on-files stringArray A list of files. If one of them changes, the process will terminate. Global Flags: --add-dir-header If true, adds the file directory to the header --alsologtostderr log to standard error as well as files --log-backtrace-at traceLocation when logging hits line file:N, emit a stack trace (default :0) --log-dir string If non-empty, write log files in this directory --log-file string If non-empty, use this log file --log-file-max-size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) --log-flush-frequency duration Maximum number of seconds between log flushes (default 5s) --logtostderr log to standard error instead of files (default true) --skip-headers If true, avoid header prefixes in the log messages --skip-log-headers If true, avoid headers when opening log files --stderrthreshold severity logs at or above this threshold go to stderr (default 2) -v, --v Level number for the log level verbosity --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging ``` Overriding the arguments and setting -v=4 should be hopefully enough. The installer pod not getting creating might be cause by a deadlock. Potential fix: https://github.com/openshift/library-go/pull/881 https://bugzilla.redhat.com/show_bug.cgi?id=1876484 for porting the fix into relevant operators Adding UpcomingSprint just in case the fix does not get merged by EOW I'm moving this to MODIFIED and dropping it from the errata. This means that when the next set of nightlies start for 4.4 it will move to ON_QA which will hopefully coincide with the fix for Bug 1876484 merging. Moving the bug back to POST state as i still see Bug 1876484 in POST state. Please feel free to move this back when ever the other bug is ON_QA, thanks !! Just waiting until https://bugzilla.redhat.com/show_bug.cgi?id=1876484 gets moved to MODIFIED Since https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is in VERIFIED state, this issue can be moved to VERIFIED as well. Moving to ON_QA, leaving the VERIFIED state for QE. Since the underlying bug https://bugzilla.redhat.com/show_bug.cgi?id=1876484 is already in VERIFIED state, moving this bug as well to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.4.31 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5122 qe_test_coverage has been set to '-' because i see that the issue is caused by underlying bug and if there is a test coverage added for that test it would be sufficient to cover this case as well. |