Bug 1669311

Summary:	A ovs process gets killed when oom-killer is invoked, leaving it in bad state.
Product:	OpenShift Container Platform	Reporter:	Ryan Howe <rhowe>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	anusaxen, aos-bugs, cdc, danw, glamb, hpolava, jolee, openshift-bugs-escalate
Version:	3.10.0
Target Milestone:	---
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1671820 1671822 (view as bug list)		Environment:
Last Closed:	2019-03-14 02:15:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1671820, 1671822

Description Ryan Howe 2019-01-24 22:59:22 UTC

Description of problem:

Changes from 3.9 to 3.10 now has OVS running in a pod. By default the oom-score is 992 due to being in qosClass Burstable. 

The leaves it open to getting killed by oom-killer. 

If ovs-vswitchd gets killed the container will still be running but will stay in a bad state as no health check is configured. 

Version-Release number of selected component (if applicable):
3.10 

How reproducible:
100% 

Steps to Reproduce:

Invoke oom-killer  

kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or sacrifice child
kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB, anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB


1. kill `pgrep ovs-vswitchd` 


Actual results:

OVS pod continues to run but is in bad state




Expected results:
oom-killer does not kill the process due to score being -999 and a health check is configured to check health of pod. 


Additional info:

Making the following changes will set qosClass Guaranteed with oom-score set to 0. This is not a complete fix but it will reduce the kills of this process. 
 

# oc project openshift-sdn 

# oc edit ds ovs 

Change the limits to equal the request

Setting the following will change teh 
        resources:
          limits:
            cpu: 200m
            memory: 400Mi
          requests:
            cpu: 200m
            memory: 400Mi

Comment 1 Casey Callendrello 2019-01-25 16:12:02 UTC

We should also configure `ovs-ctl status` as a liveness probe.

Assigning to phil.

Comment 2 Phil Cameron 2019-01-28 18:40:15 UTC

In speaking with our ovs contact, this is operating as designed. Adding a liveness probe, while possible, is likely not going to get the desired results since when it restarts vswitchd the same resource pressure will exist and OOM will likely be invoked again. Ultimately, either more resources are need or reduced load.

Comment 3 Phil Cameron 2019-01-28 18:54:59 UTC

Relaxed resource limits and added a liveness probe. The extent that this is useful will become apparent when it is tried on the problem cluster.

https://github.com/openshift/cluster-network-operator/pull/80

Comment 4 Casey Callendrello 2019-01-31 13:35:09 UTC

You'll need to fix this in 3.10 and 3.11, too.

Comment 5 Dan Winship 2019-01-31 22:02:48 UTC

(In reply to Ryan Howe from comment #0)
> Steps to Reproduce:
> 
> Invoke oom-killer  
> 
> kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or
> sacrifice child
> kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB,
> anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB

In what context did you encounter this exactly? OVS normally runs a monitor process that should restart ovs-vswitchd if it dies or is killed.

There was a bug at one point where ovs-vswitchd was beeing OOMkilled *at startup*, but that should be fixed with current openshift-ansible.

Comment 6 Phil Cameron 2019-02-01 20:29:18 UTC

 1671820 if for the fix in 3.11, the fix will be cherry-picked to fix the bug here.

Comment 7 Phil Cameron 2019-02-08 20:26:37 UTC

https://github.com/openshift/openshift-ansible/pull/11150 MERGED

Comment 15 errata-xmlrpc 2019-03-14 02:15:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0405