Bug 1551682

Summary: ovs-vswitchd segfaults after any instance is spawned on numa 1 socket
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: openvswitchAssignee: Kevin Traynor <ktraynor>
Status: CLOSED ERRATA QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: amuller, apevec, astupnik, atelang, atragler, chrisw, colleen.o.malley, fbaudin, fleitner, jraju, ksundara, ktraynor, majopela, mfuruta, mlinden, rhos-maint, rkhan, skramaja, srevivo, tredaelli, vchundur, yrachman
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch-2.6.1-30.git20180130.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 23:33:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Hill 2018-03-05 17:12:18 UTC
Description of problem:

ovs-vswitchd sigterms and has to be restarted often :

(gdb) bt
#0  rte_mempool_get_priv (mp=0x9e094c001003) at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1653
#1  rte_pktmbuf_priv_size (mp=0x9e094c001003) at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:987
#2  rte_pktmbuf_detach (m=0x7fdb53508d00) at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:1180
#3  0x000055cd7a90fc18 in __rte_pktmbuf_prefree_seg (m=0x7fdb53508d00) at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:1204
#4  ixgbe_tx_free_bufs (txq=0x7fe19294d140) at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h:146
#5  ixgbe_xmit_pkts_vec (tx_queue=0x7fe19294d140, tx_pkts=0x7fe845ffa3b0, nb_pkts=<optimized out>) at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c:555
#6  0x000055cd7aa2fedf in rte_eth_tx_burst (nb_pkts=<optimized out>, tx_pkts=0x7fe845ffa3b0, queue_id=8, port_id=<optimized out>)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2819
#7  netdev_dpdk_eth_tx_burst (cnt=1, pkts=0x7fe845ffa3b0, qid=8, dev=0x7fe192a26700) at lib/netdev-dpdk.c:1215
#8  netdev_dpdk_send__ (concurrent_txq=<optimized out>, may_steal=<optimized out>, batch=<optimized out>, qid=8, dev=<optimized out>) at lib/netdev-dpdk.c:1701
#9  netdev_dpdk_eth_send (netdev=<optimized out>, qid=<optimized out>, batch=<optimized out>, may_steal=<optimized out>, concurrent_txq=false) at lib/netdev-dpdk.c:1725
#10 0x000055cd7a995cd2 in netdev_send (netdev=<optimized out>, qid=qid@entry=8, batch=batch@entry=0x7fe845ffa3a8, may_steal=may_steal@entry=true, concurrent_txq=concurrent_txq@entry=false)
    at lib/netdev.c:718
#11 0x000055cd7a975a3f in dp_execute_cb (aux_=aux_@entry=0x7fe845ffa750, packets_=packets_@entry=0x7fe845ffa3a8, a=a@entry=0x7fe840006f6c, may_steal=<optimized out>)
    at lib/dpif-netdev.c:4470
#12 0x000055cd7a99d68e in odp_execute_actions (dp=dp@entry=0x7fe845ffa750, batch=batch@entry=0x7fe845ffa3a8, steal=steal@entry=true, actions=<optimized out>, actions_len=<optimized out>, 
    dp_execute_action=dp_execute_action@entry=0x55cd7a9757f0 <dp_execute_cb>) at lib/odp-execute.c:538
#13 0x000055cd7a9745f2 in dp_netdev_execute_actions (now=23068681, actions_len=<optimized out>, actions=<optimized out>, flow=0x7fe840018e80, may_steal=true, packets=<optimized out>, 
    pmd=0x55cd7cab0be0) at lib/dpif-netdev.c:4671
#14 packet_batch_per_flow_execute (now=23068681, pmd=<optimized out>, batch=0x7fe845ffa398) at lib/dpif-netdev.c:3981
#15 dp_netdev_input__ (pmd=pmd@entry=0x55cd7cab0be0, packets=packets@entry=0x7fe845ffa7b0, md_is_valid=md_is_valid@entry=false, port_no=<optimized out>) at lib/dpif-netdev.c:4274
#16 0x000055cd7a9749fd in dp_netdev_input (port_no=<optimized out>, packets=0x7fe845ffa7b0, pmd=0x55cd7cab0be0) at lib/dpif-netdev.c:4283
#17 dp_netdev_process_rxq_port (pmd=pmd@entry=0x55cd7cab0be0, rxq=<optimized out>, port=<optimized out>, port=<optimized out>) at lib/dpif-netdev.c:2871
#18 0x000055cd7a974c9f in pmd_thread_main (f_=0x55cd7cab0be0) at lib/dpif-netdev.c:3133
#19 0x000055cd7a9e1d56 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:342
#20 0x00007fe8e021be25 in start_thread (arg=0x7fe845ffb700) at pthread_create.c:308
#21 0x00007fe8df61d34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Version-Release number of selected component (if applicable):


How reproducible:
Often

Steps to Reproduce:
1. In this environment, it happens by itself
2.
3.

Actual results:
ovs-vswitchd segfaults

Expected results:
Shouldn't segfault

Additional info:
[dhill@collab-shell overcloud-compute-0]$ cat installed-rpms  | grep openvs
openstack-neutron-openvswitch-9.4.1-5.el7ost.noarch         Wed Nov  8 18:14:35 2017
openvswitch-2.6.1-16.git20161206.el7ost.x86_64              Wed Nov  8 17:56:35 2017
openvswitch-ovn-central-2.6.1-16.git20161206.el7ost.x86_64  Wed Nov  8 18:14:35 2017
openvswitch-ovn-common-2.6.1-16.git20161206.el7ost.x86_64   Wed Nov  8 17:56:50 2017
openvswitch-ovn-host-2.6.1-16.git20161206.el7ost.x86_64     Wed Nov  8 18:14:34 2017
python-openvswitch-2.6.1-16.git20161206.el7ost.noarch       Wed Nov  8 17:50:40 2017
[dhill@collab-shell overcloud-compute-0]$ cat installed-rpms  | grep -i kernel
erlang-kernel-18.3.4.5-3.el7ost.1.x86_64                    Wed Nov  8 18:02:08 2017
kernel-3.10.0-693.5.2.el7.x86_64                            Wed Nov  8 17:16:53 2017
kernel-devel-3.10.0-693.5.2.el7.x86_64                      Wed Nov  8 18:21:11 2017
kernel-headers-3.10.0-693.5.2.el7.x86_64                    Wed Nov  8 18:20:56 2017
kernel-tools-3.10.0-693.5.2.el7.x86_64                      Wed Nov  8 17:17:03 2017
kernel-tools-libs-3.10.0-693.5.2.el7.x86_64                 Wed Nov  8 17:15:25 2017
[dhill@collab-shell overcloud-compute-0]$ cat installed-rpms  | grep -i dpdk
dpdk-16.11.2-4.el7.x86_64                                   Wed Nov  8 18:14:36 2017

Comment 3 colleen.o.malley 2018-03-07 08:11:32 UTC
The issue seems to occur when VMs start on the second NUMA node. It doesn't happen every time, but it happens enough so that the second NUMA node is unusable in our environment.

Comment 6 David Hill 2018-03-07 14:26:18 UTC
It's not crashing at the smae place:

#0  rte_mempool_default_cache (mp=<optimized out>, mp=<optimized out>, lcore_id=<optimized out>)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1017
#1  rte_mempool_put_bulk (n=1, obj_table=0x7f8c3dff96e8, mp=0x0)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1174
#2  rte_mempool_put (obj=0x7f7f54be7f00, mp=0x0)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1227
#3  ixgbe_tx_free_bufs (txq=0x7f859294d140)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h:148
#4  ixgbe_xmit_pkts_vec (tx_queue=0x7f859294d140, tx_pkts=0x7f8c3dffa3b0, nb_pkts=<optimized out>)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c:555
#5  0x000055b69a8b2edf in rte_eth_tx_burst (nb_pkts=<optimized out>, tx_pkts=0x7f8c3dffa3b0, 
    queue_id=8, port_id=<optimized out>)
    at /usr/src/debug/openvswitch-2.6.1/dpdk-16.11/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2819
#6  netdev_dpdk_eth_tx_burst (cnt=1, pkts=0x7f8c3dffa3b0, qid=8, dev=0x7f8592a26700)
    at lib/netdev-dpdk.c:1215
#7  netdev_dpdk_send__ (concurrent_txq=<optimized out>, may_steal=<optimized out>, 
    batch=<optimized out>, qid=8, dev=<optimized out>) at lib/netdev-dpdk.c:1701
#8  netdev_dpdk_eth_send (netdev=<optimized out>, qid=<optimized out>, batch=<optimized out>, 
    may_steal=<optimized out>, concurrent_txq=false) at lib/netdev-dpdk.c:1725
#9  0x000055b69a818cd2 in netdev_send (netdev=<optimized out>, qid=qid@entry=8, 
    batch=batch@entry=0x7f8c3dffa3a8, may_steal=may_steal@entry=true, 
    concurrent_txq=concurrent_txq@entry=false) at lib/netdev.c:718
#10 0x000055b69a7f8a3f in dp_execute_cb (aux_=aux_@entry=0x7f8c3dffa750, 
    packets_=packets_@entry=0x7f8c3dffa3a8, a=a@entry=0x7f8c28004c0c, may_steal=<optimized out>)
    at lib/dpif-netdev.c:4470
#11 0x000055b69a82068e in odp_execute_actions (dp=dp@entry=0x7f8c3dffa750, 
    batch=batch@entry=0x7f8c3dffa3a8, steal=steal@entry=true, actions=<optimized out>, 
    actions_len=<optimized out>, 
    dp_execute_action=dp_execute_action@entry=0x55b69a7f87f0 <dp_execute_cb>) at lib/odp-execute.c:538
#12 0x000055b69a7f75f2 in dp_netdev_execute_actions (now=52435103, actions_len=<optimized out>, 
    actions=<optimized out>, flow=0x7f8c28003800, may_steal=true, packets=<optimized out>, 
    pmd=0x55b69d463a50) at lib/dpif-netdev.c:4671
#13 packet_batch_per_flow_execute (now=52435103, pmd=<optimized out>, batch=0x7f8c3dffa398)
    at lib/dpif-netdev.c:3981
#14 dp_netdev_input__ (pmd=pmd@entry=0x55b69d463a50, packets=packets@entry=0x7f8c3dffa7b0, 
    md_is_valid=md_is_valid@entry=false, port_no=<optimized out>) at lib/dpif-netdev.c:4274
#15 0x000055b69a7f79fd in dp_netdev_input (port_no=<optimized out>, packets=0x7f8c3dffa7b0, 
    pmd=0x55b69d463a50) at lib/dpif-netdev.c:4283
#16 dp_netdev_process_rxq_port (pmd=pmd@entry=0x55b69d463a50, rxq=<optimized out>, 
    port=<optimized out>, port=<optimized out>) at lib/dpif-netdev.c:2871
#17 0x000055b69a7f7c9f in pmd_thread_main (f_=0x55b69d463a50) at lib/dpif-netdev.c:3133
#18 0x000055b69a864d56 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:342
#19 0x00007f8cd7cfae25 in start_thread (arg=0x7f8c3dffb700) at pthread_create.c:308
#20 0x00007f8cd70fc34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 36 Yariv 2018-06-24 08:33:16 UTC
Verified on puddle 2018-06-18.1
with the following 

Be aware to TripleO nova cpu quote when verifying
[root@compute-0 ~]# rpm -qa | grep openvswitch-2.9
openvswitch-2.9.0-19.el7fdp.1.x86_64
python-openvswitch-2.9.0-19.el7fdp.1.noarch
[root@compute-0 ~]#

Comment 40 errata-xmlrpc 2018-06-27 23:33:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2102

Comment 47 Red Hat Bugzilla 2023-09-15 01:26:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days