Bug 1669311 - A ovs process gets killed when oom-killer is invoked, leaving it in bad state.
Summary: A ovs process gets killed when oom-killer is invoked, leaving it in bad state.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.10.z
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1671820 1671822
TreeView+ depends on / blocked
 
Reported: 2019-01-24 22:59 UTC by Ryan Howe
Modified: 2019-10-23 18:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1671820 1671822 (view as bug list)
Environment:
Last Closed: 2019-03-14 02:15:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0405 0 None None None 2019-03-14 02:15:41 UTC

Description Ryan Howe 2019-01-24 22:59:22 UTC
Description of problem:

Changes from 3.9 to 3.10 now has OVS running in a pod. By default the oom-score is 992 due to being in qosClass Burstable. 

The leaves it open to getting killed by oom-killer. 

If ovs-vswitchd gets killed the container will still be running but will stay in a bad state as no health check is configured. 

Version-Release number of selected component (if applicable):
3.10 

How reproducible:
100% 

Steps to Reproduce:

Invoke oom-killer  

kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or sacrifice child
kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB, anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB


1. kill `pgrep ovs-vswitchd` 


Actual results:

OVS pod continues to run but is in bad state




Expected results:
oom-killer does not kill the process due to score being -999 and a health check is configured to check health of pod. 


Additional info:

Making the following changes will set qosClass Guaranteed with oom-score set to 0. This is not a complete fix but it will reduce the kills of this process. 
 

# oc project openshift-sdn 

# oc edit ds ovs 

Change the limits to equal the request

Setting the following will change teh 
        resources:
          limits:
            cpu: 200m
            memory: 400Mi
          requests:
            cpu: 200m
            memory: 400Mi

Comment 1 Casey Callendrello 2019-01-25 16:12:02 UTC
We should also configure `ovs-ctl status` as a liveness probe.

Assigning to phil.

Comment 2 Phil Cameron 2019-01-28 18:40:15 UTC
In speaking with our ovs contact, this is operating as designed. Adding a liveness probe, while possible, is likely not going to get the desired results since when it restarts vswitchd the same resource pressure will exist and OOM will likely be invoked again. Ultimately, either more resources are need or reduced load.

Comment 3 Phil Cameron 2019-01-28 18:54:59 UTC
Relaxed resource limits and added a liveness probe. The extent that this is useful will become apparent when it is tried on the problem cluster.

https://github.com/openshift/cluster-network-operator/pull/80

Comment 4 Casey Callendrello 2019-01-31 13:35:09 UTC
You'll need to fix this in 3.10 and 3.11, too.

Comment 5 Dan Winship 2019-01-31 22:02:48 UTC
(In reply to Ryan Howe from comment #0)
> Steps to Reproduce:
> 
> Invoke oom-killer  
> 
> kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or
> sacrifice child
> kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB,
> anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB

In what context did you encounter this exactly? OVS normally runs a monitor process that should restart ovs-vswitchd if it dies or is killed.

There was a bug at one point where ovs-vswitchd was beeing OOMkilled *at startup*, but that should be fixed with current openshift-ansible.

Comment 6 Phil Cameron 2019-02-01 20:29:18 UTC
 1671820 if for the fix in 3.11, the fix will be cherry-picked to fix the bug here.

Comment 7 Phil Cameron 2019-02-08 20:26:37 UTC
https://github.com/openshift/openshift-ansible/pull/11150 MERGED

Comment 15 errata-xmlrpc 2019-03-14 02:15:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0405


Note You need to log in before you can comment on or make changes to this bug.