Bug 1571379
Summary: | Infra nodes NotReady due to ovs pods getting oomkilled | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jiří Mencák <jmencak> | ||||
Component: | Networking | Assignee: | Ben Bennett <bbennett> | ||||
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.10.0 | CC: | aos-bugs, dcbw, dma, jeder, mbruzek, mifiedle, wmeng | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.10.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | aos-scalability-310 | ||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: |
undefined
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-07-30 19:13:48 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jiří Mencák
2018-04-24 16:01:49 UTC
Only thing that sticks out to me from the logs is that this environment is still pulling an image /openshift3/node:v3.10.0 for the node sync pod. Which is https://bugzilla.redhat.com/show_bug.cgi?id=1570257 We need to get that fix in and try again. The old image hasn't been updated in a while and it will also double the number of images which have to be downloaded which slows installation unless all images are pre-cached. Thanks for looking into this Scott. I've used openshift-ansible-3.10.0-0.28.0-45-g818ef02 with https://github.com/openshift/openshift-ansible/pull/8122 manually applied. /openshift3/node:v3.10.0 is no longer pulled, but infra nodes are still NotReady. root@master-0: ~/openshift-ansible # docker images |grep openshift3|sed 's|^[^/]*/||' openshift3/ose-haproxy-router v3.10.0-0.28.0 87e58469f582 14 hours ago 1.64 GB openshift3/ose-node v3.10.0 89f735b7eb1f 14 hours ago 1.83 GB openshift3/ose-node v3.10.0-0.28.0 89f735b7eb1f 14 hours ago 1.83 GB openshift3/ose-docker-builder v3.10.0-0.28.0 0d4065236422 14 hours ago 1.62 GB openshift3/ose-sti-builder v3.10.0-0.28.0 c730e08aed33 14 hours ago 1.62 GB openshift3/ose-deployer v3.10.0-0.28.0 99f82e8d33e5 14 hours ago 1.62 GB openshift3/ose-control-plane v3.10.0 46d30bb81ad9 16 hours ago 1.62 GB openshift3/ose-docker-registry v3.10.0-0.28.0 ed628f3a20b2 16 hours ago 436 MB openshift3/ose-keepalived-ipfailover v3.10.0-0.28.0 e6d2e9b399e4 16 hours ago 392 MB openshift3/ose-web-console v3.10.0 fddfd2420628 16 hours ago 470 MB openshift3/ose-web-console v3.10.0-0.28.0 fddfd2420628 16 hours ago 470 MB openshift3/ose-service-catalog v3.10.0 9d848896a8b9 16 hours ago 461 MB openshift3/prometheus-node-exporter v3.10.0 685a2be8025c 24 hours ago 223 MB openshift3/registry-console v3.10 066b245fbf1a 26 hours ago 231 MB openshift3/logging-fluentd v3.10 8a98f9c0384e 26 hours ago 286 MB openshift3/ose-template-service-broker v3.10 1c9e686124e9 27 hours ago 284 MB openshift3/ose-pod v3.10.0 2f4c3754b74f 28 hours ago 214 MB openshift3/ose-pod v3.10.0-0.28.0 2f4c3754b74f 28 hours ago 214 MB openshift3/ruby-20-rhel7 latest e5833aa7cf85 15 months ago 443 MB openshift3/python-33-rhel7 latest e18350a7786c 15 months ago 521 MB openshift3/php-55-rhel7 latest 2d6fbdfafa33 15 months ago 569 MB openshift3/nodejs-010-rhel7 latest 226d0b1b7987 15 months ago 430 MB openshift3/perl-516-rhel7 latest 45c996c39407 15 months ago 475 MB I'm out of ideas, here's a diff of the SDN pod logs between a working and non working node. Working on the left. https://www.diffchecker.com/O22hVjQG Analysis: ovs-vswitchd is not running in the OVS container on an affected node. This causes openshift-sdn to block in ovs-vsctl since that process waits for the change to happen, and openshift-sdn does not pass "--timeout=X" to ovs-vsctl. The vswitchd is being told to quit by something; it is not exiting abnormally. I currently suspect some error in the openshift-ansible ovs container babysitting script from https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_sdn/files/sdn-ovs.yaml or a race condition where kubelet overlaps execution of the app containers within the pod, causing an old instance which runs "ovs-ctl stop" to actually stop a newly started instance. vswitchd on the infra nodes appears to be getting OOM-killed: Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8651] 1001 8651 691 123 6 0 -998 pod Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8756] 0 8756 2920 227 11 0 999 bash Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8863] 0 8863 13142 195 28 0 999 monitor Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8864] 0 8864 13238 586 29 0 999 ovsdb-server Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8951] 0 8951 14493 753 30 0 999 monitor Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 8952] 0 8952 643321 75418 214 0 999 ovs-vswitchd Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: [ 9310] 0 9310 1090 87 7 0 999 sleep Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: Memory cgroup out of memory: Kill process 9557 (ovs-vswitchd) score 1943 or sacrifice child Apr 26 10:53:18 infra-node-0.scale-ci.example.com kernel: Killed process 8952 (ovs-vswitchd) total-vm:2573284kB, anon-rss:289176kB, file-rss:12496kB, shmem-rss:0kB You'd think an OOM kill would show up in 'oc describe pod XXXXX' in the events section, but I guess not... (In reply to Dan Williams from comment #11) > https://github.com/openshift/openshift-ansible/pull/8166 I've successfully installed v3.10.0-0.29.0 with openshift-ansible v3.10.0-0.30.0 + the PR. Infra nodes are Ready now. Thanks. Final analysis... ovs-vswitchd allocates some internal pthreads based on the number of CPU cores available on the system. The default stack size on Fedora/RHEL appears to be 8192K so right off the bat the vswitch has #cores * 8192K allocated. So far OK. But for some reason vswitchd actually dirties all 8192K of those stacks, why I have no reason. So they all end up as RSS on vswitch start. I'll have to pass that over to the OVS team to figure out. Filed https://bugzilla.redhat.com/show_bug.cgi?id=1572797 to track that. Anyway, we can work around the memory usage by restricting the number of threads that ovs-vswitchd creates: ovs-vsctl set Open_vSwitch . other_config:n-revalidator-threads=X ovs-vsctl set Open_vSwitch . other_config:n-handler-threads=Y Where X and Y are some suitable number like X=2 and Y=5 (default is ~1:2.5 revalidator:handler), which would result in 64M of RSS devoted to pthreads. Still better than 320MB that 40 cores would use by default. Checked with v3.10.0-0.47.0 for both openshift-ansible and atomic-openshift The env setup successfully without oomkill. And the ovs pod resource has been increased "resources": { "limits": { "cpu": "200m", "memory": "400Mi" }, "requests": { "cpu": "100m", "memory": "300Mi" } }, Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |