Bug 1620556
Summary: | [3.10.14] ovs Pods OOMKilled on baremetal nodes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> | ||||||
Component: | Networking | Assignee: | Dan Williams <dcbw> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 3.10.0 | CC: | aos-bugs, bbennett, dcbw, jkaur, jokerman, jtudelag, mmccomas, scuppett, shiywang, zzhao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 3.10.z | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-06-11 09:30:48 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Jaspreet Kaur
2018-08-23 07:32:52 UTC
> rpm -q openshift-ansible openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch > -q ansible ansible-2.4.4.0-1.el7ae.noarch > ansible --version ansible 2.4.4.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/home/jorget/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] > rpm -qa | grep openshift openshift-ansible-playbooks-3.10.21-1.git.0.6446011.el7.noarch atomic-openshift-hyperkube-3.10.14-1.git.0.ba8ae6d.el7.x86_64 atomic-openshift-excluder-3.10.14-1.git.0.ba8ae6d.el7.noarch openshift-ansible-roles-3.10.21-1.git.0.6446011.el7.noarch atomic-openshift-docker-excluder-3.10.14-1.git.0.ba8ae6d.el7.noarch atomic-openshift-3.10.14-1.git.0.ba8ae6d.el7.x86_64 atomic-openshift-clients-3.10.14-1.git.0.ba8ae6d.el7.x86_64 openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch openshift-ansible-docs-3.10.21-1.git.0.6446011.el7.noarch atomic-openshift-node-3.10.14-1.git.0.ba8ae6d.el7.x86_64 > oc version oc v3.10.14 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://XXXXXXXXX:8443 openshift v3.10.14 kubernetes v1.10.0+b81c8f8 We had a similar issue in https://bugzilla.redhat.com/show_bug.cgi?id=1571379 -- we fixed a bug where ovs-vswitchd was using 8 MiB per core. This was merged here - https://github.com/openshift/openshift-ansible/pull/8166/commits/6d9ad9d1ac4c95ea38a8b1aa7d94ac698724c755 How many cores and how much ram does the node have? Ideally we can actually clamp the memory usage, but if we're not able, we can add an override to openshift_ansible and give some guidelines. https://bugzilla.redhat.com/show_bug.cgi?id=1620556#c2, here you are: > free -h total used free shared buff/cache available Mem: 377G 8.9G 360G 27M 7.8G 367G Swap: 0B 0B 0B > lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz Stepping: 4 CPU MHz: 2600.000 BogoMIPS: 5200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 22528K NUMA node0 CPU(s): 0-7,32-39 NUMA node1 CPU(s): 8-15,40-47 NUMA node2 CPU(s): 16-23,48-55 NUMA node3 CPU(s): 24-31,56-63 Could I get: rpm -qv openvswitch ? Do the OVS pods get killed immediately, or does the OOM take some time? If you are able, could you grab: /proc/`pidof ovs-vswitchd`/maps (In reply to Dan Williams from comment #4) > Could I get: > > rpm -qv openvswitch > > ? As I already said, we are using OCP 3.10.14: $ oc -n openshift-sdn rsh ovs-drtgw sh-4.2# $ rpm -qv openvswitch openvswitch-2.9.0-47.el7fdp.2.x86_64 Created attachment 1478888 [details]
ovs pid maps (cat /proc/`pidof ovs-vswitchd`/maps)
(In reply to Dan Williams from comment #5) > Do the OVS pods get killed immediately, or does the OOM take some time? If > you are able, could you grab: I would same it takes some time, on a cluster with 8 nodes (All exactly the same HW, big baremetal), after the installer finishes OK, some of the OVS Pods started fine, while others were killed by OOMKiller repeatedly. I first tried to delete manually and in order, the OVS and then the SDN pod of each of the impacted nodes, with no luck. After that I increased both Mem and CPU limits to 1CPU and 1Gi. After that, I deleted manually all OVS and SDN Pods, and they all started fine. > > /proc/`pidof ovs-vswitchd`/maps I just attached the logs you are requesting, gathered inside one of the ovs Pods. Dont know if related, but keep in mind this log is from an OVS Pod running with 1CPU and 1Gi limit. Could I get /proc/<pidof ovs-vswitchd>/smaps from the system? Ping on this issue? smaps from ovs-vswitchd will help debug the issue further. Created attachment 1485488 [details]
ovs pid smaps (cat /proc/<pidof ovs-vswitchd>/smaps)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0786 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |