1852964 – Nodes going notReady because of unknown Reason

Bug 1852964 - Nodes going notReady because of unknown Reason

Summary: Nodes going notReady because of unknown Reason

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-controller-manager
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1860397
TreeView+	depends on / blocked

Reported:	2020-07-01 16:37 UTC by Jaspreet Kaur
Modified:	2023-12-15 18:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: intermittent availability issues with the apiserver could lead to intermittent issues with the openshift controller manager operator retrieving deployments it manages Consequence: a failure to retrieve a deployment at the wrong moment resulted in the openshift controller manager operator making a nil reference that results in a panic in the operator Fix: nil reference checks are now in place to handle this error condition, report it, but also retry again per expected controller operations. Result: openshift controller manager operator properly handles intermittent issues retrieving deployments from the api server.
Clone Of:
Clones:	1860397 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:11:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-openshift-controller-manager-operator pull 163	0	None	closed	Bug 1852964: account for nil DaemonSet returned from library-go	2020-10-20 12:35:32 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:12:05 UTC

Description Jaspreet Kaur 2020-07-01 16:37:01 UTC

Description of problem: It has been observed that Nodes were going to NotReady state because of unknown Reason. different issues are seen at same point of time.

Processes are getting oomkilled as seen in dmesg

./openshift-apiserver/pods/apiserver-vcgm8/openshift-apiserver/openshift-apiserver/logs/previous.log:2020-06-20T13:11:46.585965523Z 	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
./openshift-apiserver/pods/apiserver-vcgm8/openshift-apiserver/openshift-apiserver/logs/previous.log:2020-06-20T13:11:46.585965523Z E0620 13:11:46.585778       1 wrap.go:39] apiserver panic'd on GET /apis/oauth.openshift.io/v1/oauthclients?limit=500&resourceVersion=0
./openshift-etcd/core/pods.yaml:            http: panic serving 10.123.1.92:55230: runtime error: invalid memory address
./openshift-etcd/core/pods.yaml:            +0x139\npanic(0xe5db60, 0x194e150)\n\t/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522


Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 30 Jun 2020 21:00:03 +0400   Thu, 25 Jun 2020 08:44:01 +0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 30 Jun 2020 21:00:03 +0400   Thu, 25 Jun 2020 08:44:01 +0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 30 Jun 2020 21:00:03 +0400   Thu, 25 Jun 2020 08:44:01 +0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Tue, 30 Jun 2020 21:00:03 +0400   Tue, 30 Jun 2020 17:03:49 +0400   KubeletNotReady              [container runtime is down, PLEG is not healthy: pleg was last seen active 3h58m19.579610535s ago; threshold is 3m0s]


Version-Release number of selected component (if applicable):

4.3.22

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Node going to NotReady state because of panic of unknown reason


Expected results:  Node should not have this recurring issue.


Additional info:

Comment 11 wewang 2020-07-20 09:23:53 UTC

@gabe Check version as below, didnot find node is not ready, and openshift-apiserver , openshift-controller-manager-operator, kubelet service log in node, not met related issue again, verified it, if need i check others, feel free set bug to ON_QA.
Version:
4.6.0-0.nightly-2020-07-19-093912

Comment 13 errata-xmlrpc 2020-10-27 16:11:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.