Bug 1571379 - Infra nodes NotReady due to ovs pods getting oomkilled
Summary: Infra nodes NotReady due to ovs pods getting oomkilled
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.10.0
Assignee: Ben Bennett
QA Contact: Meng Bo
Whiteboard: aos-scalability-310
Depends On:
TreeView+ depends on / blocked
Reported: 2018-04-24 16:01 UTC by jmencak
Modified: 2018-07-30 19:14 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2018-07-30 19:13:48 UTC
Target Upstream Version:

Attachments (Terms of Use)
Docker/atomic-openshift-node + journalctl -xe logs (92.89 KB, application/x-xz)
2018-04-24 16:01 UTC, jmencak
no flags Details

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:14:10 UTC

Description jmencak 2018-04-24 16:01:49 UTC
Created attachment 1426152 [details]
Docker/atomic-openshift-node + journalctl -xe logs

Description of problem:
When installing OCP on OpenStack the chance of having infra nodes in NotReady is very high.

Version-Release number of selected component (if applicable):

[openshift@master-0 ~]$ oc version
oc v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master-2.scale-ci.example.com:8443
openshift v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8

[cloud-user@ansible-host openshift-ansible]$ git describe

How reproducible:
5 out of 6 installs so far

Steps to Reproduce:
1. Try a multi-master install of OpenShift on OCP

Actual results:
Infra nodes in NotReady state

[openshift@master-0 ~]$ oc get nodes
NAME                                STATUS     ROLES           AGE       VERSION
app-node-0.scale-ci.example.com     Ready      compute         2h        v1.10.0+b81c8f8
app-node-1.scale-ci.example.com     Ready      compute         2h        v1.10.0+b81c8f8
cns-0.scale-ci.example.com          Ready      compute         2h        v1.10.0+b81c8f8
cns-1.scale-ci.example.com          Ready      compute         2h        v1.10.0+b81c8f8
cns-2.scale-ci.example.com          Ready      compute         2h        v1.10.0+b81c8f8
infra-node-0.scale-ci.example.com   NotReady   compute,infra   2h        v1.10.0+b81c8f8
infra-node-1.scale-ci.example.com   NotReady   compute,infra   2h        v1.10.0+b81c8f8
infra-node-2.scale-ci.example.com   NotReady   compute,infra   2h        v1.10.0+b81c8f8
master-0.scale-ci.example.com       Ready      master          2h        v1.10.0+b81c8f8
master-1.scale-ci.example.com       Ready      master          2h        v1.10.0+b81c8f8
master-2.scale-ci.example.com       Ready      master          2h        v1.10.0+b81c8f8

root@infra-node-2: ~ # journalctl -u atomic-openshift-node|grep cni|tail -n1
Apr 24 11:55:27 infra-node-2.scale-ci.example.com atomic-openshift-node[9400]: E0424 11:55:27.752505    9400 kubelet.go:2125] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Expected results:
All nodes in Ready state, no errors in "journalctl -u atomic-openshift-node"

Additional info:
However, this bug seems to be different, as there are no lstat issues there.

Comment 1 Scott Dodson 2018-04-24 19:58:53 UTC
Only thing that sticks out to me from the logs is that this environment is still pulling an image /openshift3/node:v3.10.0 for the node sync pod.

Which is https://bugzilla.redhat.com/show_bug.cgi?id=1570257

We need to get that fix in and try again. The old image hasn't been updated in a while and it will also double the number of images which have to be downloaded which slows installation unless all images are pre-cached.

Comment 2 jmencak 2018-04-25 04:28:23 UTC
Thanks for looking into this Scott.

I've used openshift-ansible-3.10.0-0.28.0-45-g818ef02 with https://github.com/openshift/openshift-ansible/pull/8122 manually applied.

/openshift3/node:v3.10.0 is no longer pulled, but infra nodes are still NotReady.

root@master-0: ~/openshift-ansible # docker images |grep openshift3|sed 's|^[^/]*/||'
openshift3/ose-haproxy-router                   v3.10.0-0.28.0      87e58469f582        14 hours ago        1.64 GB
openshift3/ose-node                             v3.10.0             89f735b7eb1f        14 hours ago        1.83 GB
openshift3/ose-node                             v3.10.0-0.28.0      89f735b7eb1f        14 hours ago        1.83 GB
openshift3/ose-docker-builder                   v3.10.0-0.28.0      0d4065236422        14 hours ago        1.62 GB
openshift3/ose-sti-builder                      v3.10.0-0.28.0      c730e08aed33        14 hours ago        1.62 GB
openshift3/ose-deployer                         v3.10.0-0.28.0      99f82e8d33e5        14 hours ago        1.62 GB
openshift3/ose-control-plane                    v3.10.0             46d30bb81ad9        16 hours ago        1.62 GB
openshift3/ose-docker-registry                  v3.10.0-0.28.0      ed628f3a20b2        16 hours ago        436 MB
openshift3/ose-keepalived-ipfailover            v3.10.0-0.28.0      e6d2e9b399e4        16 hours ago        392 MB
openshift3/ose-web-console                      v3.10.0             fddfd2420628        16 hours ago        470 MB
openshift3/ose-web-console                      v3.10.0-0.28.0      fddfd2420628        16 hours ago        470 MB
openshift3/ose-service-catalog                  v3.10.0             9d848896a8b9        16 hours ago        461 MB
openshift3/prometheus-node-exporter             v3.10.0             685a2be8025c        24 hours ago        223 MB
openshift3/registry-console                     v3.10               066b245fbf1a        26 hours ago        231 MB
openshift3/logging-fluentd                      v3.10               8a98f9c0384e        26 hours ago        286 MB
openshift3/ose-template-service-broker          v3.10               1c9e686124e9        27 hours ago        284 MB
openshift3/ose-pod                              v3.10.0             2f4c3754b74f        28 hours ago        214 MB
openshift3/ose-pod                              v3.10.0-0.28.0      2f4c3754b74f        28 hours ago        214 MB
openshift3/ruby-20-rhel7                                latest              e5833aa7cf85        15 months ago       443 MB
openshift3/python-33-rhel7                              latest              e18350a7786c        15 months ago       521 MB
openshift3/php-55-rhel7                                 latest              2d6fbdfafa33        15 months ago       569 MB
openshift3/nodejs-010-rhel7                             latest              226d0b1b7987        15 months ago       430 MB
openshift3/perl-516-rhel7                               latest              45c996c39407        15 months ago       475 MB

Comment 7 Scott Dodson 2018-04-25 15:03:13 UTC
I'm out of ideas, here's a diff of the SDN pod logs between a working and non working node. Working on the left.


Comment 8 Dan Williams 2018-04-26 16:28:49 UTC
Analysis: ovs-vswitchd is not running in the OVS container on an affected node.  This causes openshift-sdn to block in ovs-vsctl since that process waits for the change to happen, and openshift-sdn does not pass "--timeout=X" to ovs-vsctl.

The vswitchd is being told to quit by something; it is not exiting abnormally.  I currently suspect some error in the openshift-ansible ovs container babysitting script from https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_sdn/files/sdn-ovs.yaml or a race condition where kubelet overlaps execution of the app containers within the pod, causing an old instance which runs "ovs-ctl stop" to actually stop a newly started instance.

Comment 9 Dan Williams 2018-04-26 16:59:55 UTC
vswitchd on the infra nodes appears to be getting OOM-killed:

Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8651]  1001  8651      691      123       6        0          -998 pod
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8756]     0  8756     2920      227      11        0           999 bash
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8863]     0  8863    13142      195      28        0           999 monitor
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8864]     0  8864    13238      586      29        0           999 ovsdb-server
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8951]     0  8951    14493      753      30        0           999 monitor
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8952]     0  8952   643321    75418     214        0           999 ovs-vswitchd
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 9310]     0  9310     1090       87       7        0           999 sleep
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: Memory cgroup out of memory: Kill process 9557 (ovs-vswitchd) score 1943 or sacrifice child
Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: Killed process 8952 (ovs-vswitchd) total-vm:2573284kB, anon-rss:289176kB, file-rss:12496kB, shmem-rss:0kB

Comment 12 Dan Williams 2018-04-26 17:37:00 UTC
You'd think an OOM kill would show up in 'oc describe pod XXXXX' in the events section, but I guess not...

Comment 13 Dan Williams 2018-04-26 18:07:47 UTC
Related: https://github.com/openshift/origin/pull/19531

Comment 14 jmencak 2018-04-27 10:40:08 UTC
(In reply to Dan Williams from comment #11)
> https://github.com/openshift/openshift-ansible/pull/8166

I've successfully installed v3.10.0-0.29.0 with openshift-ansible v3.10.0-0.30.0 + the PR.  Infra nodes are Ready now.  Thanks.

Comment 15 Dan Williams 2018-04-27 21:46:59 UTC
Final analysis...

ovs-vswitchd allocates some internal pthreads based on the number of CPU cores available on the system.  The default stack size on Fedora/RHEL appears to be 8192K so right off the bat the vswitch has #cores * 8192K allocated.  So far OK.

But for some reason vswitchd actually dirties all 8192K of those stacks, why I have no reason.  So they all end up as RSS on vswitch start.  I'll have to pass that over to the OVS team to figure out.  Filed https://bugzilla.redhat.com/show_bug.cgi?id=1572797 to track that.

Anyway, we can work around the memory usage by restricting the number of threads that ovs-vswitchd creates:

ovs-vsctl set Open_vSwitch . other_config:n-revalidator-threads=X
ovs-vsctl set Open_vSwitch . other_config:n-handler-threads=Y

Where X and Y are some suitable number like X=2 and Y=5 (default is ~1:2.5 revalidator:handler), which would result in 64M of RSS devoted to pthreads.  Still better than 320MB that 40 cores would use by default.

Comment 17 Meng Bo 2018-05-18 08:54:12 UTC
Checked with v3.10.0-0.47.0 for both openshift-ansible and atomic-openshift

The env setup successfully without oomkill.

And the ovs pod resource has been increased 
      "resources": {
                    "limits": {
                        "cpu": "200m",
                        "memory": "400Mi"
                    "requests": {
                        "cpu": "100m",
                        "memory": "300Mi"

Comment 19 errata-xmlrpc 2018-07-30 19:13:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.