Bug 1671820 - A ovs process gets killed when oom-killer is invoked, leaving it in bad state.
Summary: A ovs process gets killed when oom-killer is invoked, leaving it in bad state.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.z
Assignee: Phil Cameron
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On: 1669311 1671822
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-01 18:50 UTC by Phil Cameron
Modified: 2019-03-14 02:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1669311
Environment:
Last Closed: 2019-03-14 02:17:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0407 0 None None None 2019-03-14 02:18:07 UTC

Description Phil Cameron 2019-02-01 18:50:01 UTC
+++ This bug was initially created as a clone of Bug #1669311 +++
This is for the 3.11 fix.

Description of problem:

Changes from 3.9 to 3.10 now has OVS running in a pod. By default the oom-score is 992 due to being in qosClass Burstable. 

The leaves it open to getting killed by oom-killer. 

If ovs-vswitchd gets killed the container will still be running but will stay in a bad state as no health check is configured. 

Version-Release number of selected component (if applicable):
3.10 

How reproducible:
100% 

Steps to Reproduce:

Invoke oom-killer  

kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or sacrifice child
kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB, anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB


1. kill `pgrep ovs-vswitchd` 


Actual results:

OVS pod continues to run but is in bad state




Expected results:
oom-killer does not kill the process due to score being -999 and a health check is configured to check health of pod. 


Additional info:

Making the following changes will set qosClass Guaranteed with oom-score set to 0. This is not a complete fix but it will reduce the kills of this process. 
 

# oc project openshift-sdn 

# oc edit ds ovs 

Change the limits to equal the request

Setting the following will change teh 
        resources:
          limits:
            cpu: 200m
            memory: 400Mi
          requests:
            cpu: 200m
            memory: 400Mi

--- Additional comment from Casey Callendrello on 2019-01-25 16:12:02 UTC ---

We should also configure `ovs-ctl status` as a liveness probe.

Assigning to phil.

--- Additional comment from Phil Cameron on 2019-01-28 18:40:15 UTC ---

In speaking with our ovs contact, this is operating as designed. Adding a liveness probe, while possible, is likely not going to get the desired results since when it restarts vswitchd the same resource pressure will exist and OOM will likely be invoked again. Ultimately, either more resources are need or reduced load.

--- Additional comment from Phil Cameron on 2019-01-28 18:54:59 UTC ---

Relaxed resource limits and added a liveness probe. The extent that this is useful will become apparent when it is tried on the problem cluster.

https://github.com/openshift/cluster-network-operator/pull/80

--- Additional comment from Casey Callendrello on 2019-01-31 13:35:09 UTC ---

You'll need to fix this in 3.10 and 3.11, too.

--- Additional comment from Dan Winship on 2019-01-31 22:02:48 UTC ---

(In reply to Ryan Howe from comment #0)
> Steps to Reproduce:
> 
> Invoke oom-killer  
> 
> kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or
> sacrifice child
> kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB,
> anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB

In what context did you encounter this exactly? OVS normally runs a monitor process that should restart ovs-vswitchd if it dies or is killed.

There was a bug at one point where ovs-vswitchd was beeing OOMkilled *at startup*, but that should be fixed with current openshift-ansible.

Comment 1 Phil Cameron 2019-02-01 20:24:24 UTC
https://github.com/openshift/openshift-ansible/pull/11116
out for review

Comment 2 Phil Cameron 2019-02-07 21:55:45 UTC
https://github.com/openshift/openshift-ansible/pull/11116 - MERGED

Comment 6 errata-xmlrpc 2019-03-14 02:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0407


Note You need to log in before you can comment on or make changes to this bug.