Bug 1852964

Summary: Nodes going notReady because of unknown Reason
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: openshift-controller-managerAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.0CC: aos-bugs, gmontero, jokerman, mfojtik
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: intermittent availability issues with the apiserver could lead to intermittent issues with the openshift controller manager operator retrieving deployments it manages Consequence: a failure to retrieve a deployment at the wrong moment resulted in the openshift controller manager operator making a nil reference that results in a panic in the operator Fix: nil reference checks are now in place to handle this error condition, report it, but also retry again per expected controller operations. Result: openshift controller manager operator properly handles intermittent issues retrieving deployments from the api server.
Story Points: ---
Clone Of:
: 1860397 (view as bug list) Environment:
Last Closed: 2020-10-27 16:11:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1860397    

Description Jaspreet Kaur 2020-07-01 16:37:01 UTC
Description of problem: It has been observed that Nodes were going to NotReady state because of unknown Reason. different issues are seen at same point of time.

Processes are getting oomkilled as seen in dmesg

./openshift-apiserver/pods/apiserver-vcgm8/openshift-apiserver/openshift-apiserver/logs/previous.log:2020-06-20T13:11:46.585965523Z 	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
./openshift-apiserver/pods/apiserver-vcgm8/openshift-apiserver/openshift-apiserver/logs/previous.log:2020-06-20T13:11:46.585965523Z E0620 13:11:46.585778       1 wrap.go:39] apiserver panic'd on GET /apis/oauth.openshift.io/v1/oauthclients?limit=500&resourceVersion=0
./openshift-etcd/core/pods.yaml:            http: panic serving 10.123.1.92:55230: runtime error: invalid memory address
./openshift-etcd/core/pods.yaml:            +0x139\npanic(0xe5db60, 0x194e150)\n\t/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522


Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 30 Jun 2020 21:00:03 +0400   Thu, 25 Jun 2020 08:44:01 +0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 30 Jun 2020 21:00:03 +0400   Thu, 25 Jun 2020 08:44:01 +0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 30 Jun 2020 21:00:03 +0400   Thu, 25 Jun 2020 08:44:01 +0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Tue, 30 Jun 2020 21:00:03 +0400   Tue, 30 Jun 2020 17:03:49 +0400   KubeletNotReady              [container runtime is down, PLEG is not healthy: pleg was last seen active 3h58m19.579610535s ago; threshold is 3m0s]


Version-Release number of selected component (if applicable):

4.3.22

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Node going to NotReady state because of panic of unknown reason


Expected results:  Node should not have this recurring issue.


Additional info:

Comment 11 wewang 2020-07-20 09:23:53 UTC
@gabe Check version as below, didnot find node is not ready, and openshift-apiserver , openshift-controller-manager-operator, kubelet service log in node, not met related issue again, verified it, if need i check others, feel free set bug to ON_QA.
Version:
4.6.0-0.nightly-2020-07-19-093912

Comment 13 errata-xmlrpc 2020-10-27 16:11:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196