Description of problem: Changes from 3.9 to 3.10 now has OVS running in a pod. By default the oom-score is 992 due to being in qosClass Burstable. The leaves it open to getting killed by oom-killer. If ovs-vswitchd gets killed the container will still be running but will stay in a bad state as no health check is configured. Version-Release number of selected component (if applicable): 3.10 How reproducible: 100% Steps to Reproduce: Invoke oom-killer kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or sacrifice child kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB, anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB 1. kill `pgrep ovs-vswitchd` Actual results: OVS pod continues to run but is in bad state Expected results: oom-killer does not kill the process due to score being -999 and a health check is configured to check health of pod. Additional info: Making the following changes will set qosClass Guaranteed with oom-score set to 0. This is not a complete fix but it will reduce the kills of this process. # oc project openshift-sdn # oc edit ds ovs Change the limits to equal the request Setting the following will change teh resources: limits: cpu: 200m memory: 400Mi requests: cpu: 200m memory: 400Mi
We should also configure `ovs-ctl status` as a liveness probe. Assigning to phil.
In speaking with our ovs contact, this is operating as designed. Adding a liveness probe, while possible, is likely not going to get the desired results since when it restarts vswitchd the same resource pressure will exist and OOM will likely be invoked again. Ultimately, either more resources are need or reduced load.
Relaxed resource limits and added a liveness probe. The extent that this is useful will become apparent when it is tried on the problem cluster. https://github.com/openshift/cluster-network-operator/pull/80
You'll need to fix this in 3.10 and 3.11, too.
(In reply to Ryan Howe from comment #0) > Steps to Reproduce: > > Invoke oom-killer > > kernel: Out of memory: Kill process 6779 (ovs-vswitchd) score 992 or > sacrifice child > kernel: Killed process 6779 (ovs-vswitchd) total-vm:443008kB, > anon-rss:46600kB, file-rss:13548kB, shmem-rss:0kB In what context did you encounter this exactly? OVS normally runs a monitor process that should restart ovs-vswitchd if it dies or is killed. There was a bug at one point where ovs-vswitchd was beeing OOMkilled *at startup*, but that should be fixed with current openshift-ansible.
1671820 if for the fix in 3.11, the fix will be cherry-picked to fix the bug here.
https://github.com/openshift/openshift-ansible/pull/11150 MERGED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0405