Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1578981

Summary: [OVS] [qede] OVS daemon crashed in qede_rss_hash_update while configuring OVS-dpdk with qede dpdk interface
Product: Red Hat Enterprise Linux 7 Reporter: Jean-Tsung Hsiao <jhsiao>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
Status: CLOSED ERRATA QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.5CC: abeausol, arahman, atragler, ctrautma, jhsiao, kfida, kzhang, qding, rasesh.mody, rkhan, tredaelli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-44.el7fdn Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-21 13:36:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Potential fix
none
test build
none
upstream fix
none
test build
none
Console log for qede stress testing
none
fix dma allocation failure none

Description Jean-Tsung Hsiao 2018-05-16 17:51:45 UTC
Description of problem: [qede] OVS daemon crashed  in qede_rss_hash_update while configuring OVS-dpdk with qede dpdk interface

(gdb) bt
#0  0x00005588d13dfab2 in qede_rss_hash_update (eth_dev=eth_dev@entry=0x5588d1c08340 <rte_eth_devices+16512>, 
    rss_conf=rss_conf@entry=0x7ffe5370add0)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1953
#1  0x00005588d13e2d94 in qede_config_rss (eth_dev=eth_dev@entry=0x5588d1c08340 <rte_eth_devices+16512>)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1095
#2  0x00005588d13e33b8 in qede_dev_start (eth_dev=0x5588d1c08340 <rte_eth_devices+16512>)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1161
#3  0x00005588d127a0d5 in rte_eth_dev_start (port_id=1)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/lib/librte_ether/rte_ethdev.c:1021
#4  0x00005588d1554c4c in dpdk_eth_dev_init (dev=0x7f5c3f800900) at lib/netdev-dpdk.c:869
#5  netdev_dpdk_reconfigure (netdev=0x7f5c3f8019c0) at lib/netdev-dpdk.c:3632
#6  0x00005588d1479147 in port_reconfigure (port=0x5588d2a78840) at lib/dpif-netdev.c:3341
#7  reconfigure_datapath (dp=dp@entry=0x5588d2b00040) at lib/dpif-netdev.c:3822
#8  0x00005588d1479cc7 in do_add_port (dp=dp@entry=0x5588d2b00040, devname=devname@entry=0x5588d2ac79e0 "dpdk-10", 
    type=0x5588d15adaaf "dpdk", port_no=port_no@entry=2) at lib/dpif-netdev.c:1584
#9  0x00005588d1479e4d in dpif_netdev_port_add (dpif=<optimized out>, netdev=0x7f5c3f8019c0, port_nop=0x7ffe5370b38c)
    at lib/dpif-netdev.c:1610
#10 0x00005588d14803fe in dpif_port_add (dpif=0x5588d2a7d990, netdev=netdev@entry=0x7f5c3f8019c0, 
    port_nop=port_nop@entry=0x7ffe5370b3ec) at lib/dpif.c:580
#11 0x00005588d1431130 in port_add (ofproto_=0x5588d2a5afc0, netdev=0x7f5c3f8019c0) at ofproto/ofproto-dpif.c:3649
#12 0x00005588d1427ae1 in ofproto_port_add (ofproto=0x5588d2a5afc0, netdev=0x7f5c3f8019c0, ofp_portp=0x7ffe5370b4e8)
    at ofproto/ofproto.c:2006
#13 0x00005588d14159d5 in bridge_add_ports__ (br=br@entry=0x5588d2a5c440, 
    wanted_ports=wanted_ports@entry=0x5588d2a5c520, with_requested_port=with_requested_port@entry=true)
    at vswitchd/bridge.c:1799
#14 0x00005588d1417638 in bridge_add_ports (wanted_ports=0x5588d2a5c520, br=0x5588d2a5c440) at vswitchd/bridge.c:942
#15 bridge_reconfigure (ovs_cfg=0x5588d2a81930) at vswitchd/bridge.c:663
#16 0x00005588d141ab99 in bridge_run () at vswitchd/bridge.c:3018
---Type <return> to continue, or q <return> to quit---
#17 0x00005588d1256dad in main (argc=12, argv=0x7ffe5370b9f8) at vswitchd/ovs-vswitchd.c:119
(gdb) quit
[root@netqe9 ~]# 


Version-Release number of selected component (if applicable):
[root@netqe9 ~]# uname -a
Linux netqe9.knqe.lab.eng.bos.redhat.com 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@netqe9 ~]# rpm -q openvswitch
openvswitch-2.9.0-36.el7fdp.x86_64

How reproducible: Reproducible


Steps to Reproduce:
1. Install openvswitch-2.9.0-36.el7fdp.x86_64
2. Systemctl start openvswitch
3. Use a script to config an OVS-dpdk bridge with qede dpdk interfaces. Attached right below is my script example.
4. Use del-br to delete the bridge just configured.
5. Run the same script again --- it will hang.

{
### Jean's script example --- Should modfify to match the test bed.

ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x000002
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="4096,1"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xa00a00
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk-10 \
    -- set interface dpdk-10 type=dpdk ofport_request=10 options:dpdk-devargs=0000:85:00.0
ovs-vsctl add-port ovsbr0 dpdk-11 \
    -- set interface dpdk-11 type=dpdk ofport_request=11 options:dpdk-devargs=0000:85:00.1

ovs-vsctl --timeout 10 set Interface dpdk-10 options:n_rxq=2
ovs-vsctl --timeout 10 set Interface dpdk-11 options:n_rxq=2

ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 in_port=10,actions=output:11
ovs-ofctl add-flow ovsbr0 in_port=11,actions=output:10
ovs-ofctl dump-flows ovsbr0
}

Actual results: The OVS daemon got segfault.


Expected results: Should not.


Additional info:

Comment 2 Paolo Abeni 2018-05-17 17:27:25 UTC
It looks like the issue is still present upstream (as I don't see any change in the relevant driver).

The crash happens in qede_rss_hash_update:

  rss_params.rss_ind_table[i] = qdev->fp_array[idx].rxq->handle;

Rasesh, can you please have a look? the issue is very bad for us, we expect to run the same steps described in the script quite often.

Paolo

Comment 3 Rasesh Mody 2018-05-18 22:25:34 UTC
Hi Paolo,

We tried to recreate the issue in-house using the script mentioned here with openvswitch-2.9.0-36.el7fdp.x86_64.rpm. We tried about 5 iterations but couldn't reproduce a crash. Please let us know if we are missing some step here.

Thanks!
-Rasesh

Comment 4 Jean-Tsung Hsiao 2018-05-21 02:32:43 UTC
(In reply to Rasesh Mody from comment #3)
> Hi Paolo,
> 
> We tried to recreate the issue in-house using the script mentioned here with
> openvswitch-2.9.0-36.el7fdp.x86_64.rpm. We tried about 5 iterations but
> couldn't reproduce a crash. Please let us know if we are missing some step
> here.
> 
> Thanks!
> -Rasesh

Hi Rasesh,

Make sure you run "del-br" before executing the script again.

Thanks!

Jean

Comment 5 Jean-Tsung Hsiao 2018-05-21 02:59:37 UTC
Hi Ramesh,

I just reproduced the issue again.

Make sure the bridge get built correctly --- check log and run "ovs-vsctl show".

[root@netqe9 jhsiao]# ovs-vsctl show
5d023537-fac3-46fe-9af4-cee010d6bc24
    Bridge "ovsbr0"
        Port "dpdk-11"
            Interface "dpdk-11"
                type: dpdk
                options: {dpdk-devargs="0000:85:00.1", n_rxq="2"}
        Port "dpdk-10"
            Interface "dpdk-10"
                type: dpdk
                options: {dpdk-devargs="0000:85:00.0", n_rxq="2"}
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
    ovs_version: "2.9.0"
[root@netqe9 jhsiao]# ovs-vsctl del-br ovsbr0
[root@netqe9 jhsiao]# ovs-vsctl show
5d023537-fac3-46fe-9af4-cee010d6bc24
    ovs_version: "2.9.0"
[root@netqe9 jhsiao]# sh /home/jhsiao/ovs_loopback_qede.sh

Comment 6 Rasesh Mody 2018-05-22 05:54:52 UTC
Hi Jean,

We are re-trying a repro in-house. We might need your help in collecting the PMD debug logs from your setup. Not sure if we can collect DPDK logs with log level set to debug and with QEDE PMD debug logs enabled with current FDP build. There is a general log level for DPDK and it is not defaulted to debug. There is config option for QEDE PMD that enables driver debug logs.

Thanks!
-Rasesh

Comment 7 Jean-Tsung Hsiao 2018-05-23 01:08:54 UTC
(In reply to Rasesh Mody from comment #6)
> Hi Jean,
> 
> We are re-trying a repro in-house. We might need your help in collecting the
> PMD debug logs from your setup. Not sure if we can collect DPDK logs with
> log level set to debug and with QEDE PMD debug logs enabled with current FDP
> build. There is a general log level for DPDK and it is not defaulted to
> debug. There is config option for QEDE PMD that enables driver debug logs.
> 
> Thanks!
> -Rasesh

Hi Rasesh,

I just discovered this: If I put "options:n_rxq=2" at the end of each "ovs-vsctl add-port ovsbr0 dpdk-X" line, then there is no daemon crash. See the modified script attached below.

Not sure if you were setting rxq this way.

Thanks!

Jean

{
ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x000002
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="4096,1"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xa00a00
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk-10 \
    -- set interface dpdk-10 type=dpdk ofport_request=10 options:dpdk-devargs=0000:85:00.0 \
                options:n_rxq=2

ovs-vsctl add-port ovsbr0 dpdk-11 \
    -- set interface dpdk-11 type=dpdk ofport_request=11 options:dpdk-devargs=0000:85:00.1 \
                options:n_rxq=2


ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 in_port=10,actions=output:11
ovs-ofctl add-flow ovsbr0 in_port=11,actions=output:10
ovs-ofctl dump-flows ovsbr0
}

Comment 8 Rasesh Mody 2018-05-24 05:29:57 UTC
We were using the original version of the script as it without modification.

One thing we noticed in our setup was that the OVS demon once crashed automatically gets restarted. There was no indication in vswitchd logs, however kernel logs does show segfault in vswitchd process when trying to do add-port the second time after deletion of the bridge.

Can you help collect driver debug logs with a test build from your setup?

Thanks!
-Rasesh

Comment 9 Paolo Abeni 2018-05-24 08:11:22 UTC
(In reply to Rasesh Mody from comment #8)
> We were using the original version of the script as it without modification.

as noted in the script itself, the script must be modified to be effective. Specifically the 'options:dpdk-devargs' must match your hosts value (and the running kernel needs the appropriate bool parameter, etc.).

May I guess you actually handled the above?

Thanks,

Paolo

Comment 10 Jean-Tsung Hsiao 2018-05-24 18:24:01 UTC
(In reply to Rasesh Mody from comment #8)
> We were using the original version of the script as it without modification.
> 
> One thing we noticed in our setup was that the OVS demon once crashed
> automatically gets restarted. There was no indication in vswitchd logs,
> however kernel logs does show segfault in vswitchd process when trying to do
> add-port the second time after deletion of the bridge.
> 
> Can you help collect driver debug logs with a test build from your setup?
> 
> Thanks!
> -Rasesh

Hi Rasesh,

Yes, there should be no more messages in the log as the daemon already died.

Per the conversation between you and Christ, I can work with you this afternoon on my test bed.

I'll send email you the login info once I set up the test bed.

Thanks!

Jean

Comment 11 Rasesh Mody 2018-05-24 18:41:48 UTC
(In reply to Paolo Abeni from comment #9)
> (In reply to Rasesh Mody from comment #8)
> > We were using the original version of the script as it without modification.
> 
> as noted in the script itself, the script must be modified to be effective.
> Specifically the 'options:dpdk-devargs' must match your hosts value (and the
> running kernel needs the appropriate bool parameter, etc.).
> 
> May I guess you actually handled the above?
> 
> Thanks,
> 
> Paolo

What I meant is, we modified only the configuration params like cores and 'options:dpdk-devargs' to be used which are specific to our host.
We didn't modify any steps in the script.

Thanks!
-Rasesh

Comment 12 Jean-Tsung Hsiao 2018-05-24 21:22:45 UTC
Hi Rasesh,
Does the following gdb bt provide more info than before?
Thanks!
Jean
============================
Program received signal SIGSEGV, Segmentation fault.
0x00005574e872dab2 in qede_rss_hash_update (
    eth_dev=eth_dev@entry=0x5574e8f56340 <rte_eth_devices+16512>, 
    rss_conf=rss_conf@entry=0x7ffde4992a00)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1953
1953			rss_params.rss_ind_table[i] = qdev->fp_array[idx].rxq->handle;
(gdb) bt
#0  0x00005574e872dab2 in qede_rss_hash_update (
    eth_dev=eth_dev@entry=0x5574e8f56340 <rte_eth_devices+16512>, 
    rss_conf=rss_conf@entry=0x7ffde4992a00)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1953
#1  0x00005574e8730d94 in qede_config_rss (
    eth_dev=eth_dev@entry=0x5574e8f56340 <rte_eth_devices+16512>)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1095
#2  0x00005574e87313b8 in qede_dev_start (
    eth_dev=0x5574e8f56340 <rte_eth_devices+16512>)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_ethdev.c:1161
#3  0x00005574e85c80d5 in rte_eth_dev_start (port_id=1)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/lib/librte_ether/rte_ethdev.c:1021
#4  0x00005574e88a2c4c in dpdk_eth_dev_init (dev=0x7f6fbf9c2ec0)
    at lib/netdev-dpdk.c:869
#5  netdev_dpdk_reconfigure (netdev=0x7f6fbf9c3f80) at lib/netdev-dpdk.c:3632
#6  0x00005574e87c7147 in port_reconfigure (port=0x5574e9801f00)
    at lib/dpif-netdev.c:3341
#7  reconfigure_datapath (dp=dp@entry=0x5574e9882030) at lib/dpif-netdev.c:3822
#8  0x00005574e87c7cc7 in do_add_port (dp=dp@entry=0x5574e9882030, 
---Type <return> to continue, or q <return> to quit---
    devname=devname@entry=0x5574e97fde60 "dpdk-10", 
    type=0x5574e88fbaaf "dpdk", port_no=port_no@entry=2)
    at lib/dpif-netdev.c:1584
#9  0x00005574e87c7e4d in dpif_netdev_port_add (dpif=<optimized out>, 
    netdev=0x7f6fbf9c3f80, port_nop=0x7ffde4992fbc) at lib/dpif-netdev.c:1610
#10 0x00005574e87ce3fe in dpif_port_add (dpif=0x5574e97fbda0, 
    netdev=netdev@entry=0x7f6fbf9c3f80, port_nop=port_nop@entry=0x7ffde499301c)
    at lib/dpif.c:580
#11 0x00005574e877f130 in port_add (ofproto_=0x5574e985d360, 
    netdev=0x7f6fbf9c3f80) at ofproto/ofproto-dpif.c:3649
#12 0x00005574e8775ae1 in ofproto_port_add (ofproto=0x5574e985d360, 
    netdev=0x7f6fbf9c3f80, ofp_portp=0x7ffde4993118) at ofproto/ofproto.c:2006
#13 0x00005574e87639d5 in bridge_add_ports__ (br=br@entry=0x5574e97dbe10, 
    wanted_ports=wanted_ports@entry=0x5574e97dbef0, 
    with_requested_port=with_requested_port@entry=true)
    at vswitchd/bridge.c:1799
#14 0x00005574e8765638 in bridge_add_ports (wanted_ports=0x5574e97dbef0, 
    br=0x5574e97dbe10) at vswitchd/bridge.c:942
#15 bridge_reconfigure (ovs_cfg=0x5574e98027b0) at vswitchd/bridge.c:663
#16 0x00005574e8768b99 in bridge_run () at vswitchd/bridge.c:3018
#17 0x00005574e85a4dad in main (argc=12, argv=0x7ffde4993628)
    at vswitchd/ovs-vswitchd.c:119
(gdb)

Comment 13 Jean-Tsung Hsiao 2018-05-25 00:15:37 UTC
NOTE: Sometimes, a new OVS daemon will be spawned; but, sometimes, the sh command will hang.

Comment 14 Rasesh Mody 2018-05-25 06:03:18 UTC
Hi Jean,

For us, systemd always restarts OVS demon after failure and doesn't let the setup to be in failed state. The OVS demon recovers from previous failure caused by segfault and runs fine after recovery. We could not see demon crash clearly in vswitchd logs. Due to recovery the logs showed OVS running  successfully with bridge and all other configuration added in successive iterations.

Disabling restart on failure in systemd service settings for OVS demon leaves the system in failed state. We do have a repro and should be able to collect additional debug logs.

Thanks for setting up test bed for remote debug. We might need it at a later point.

Comment 15 Rasesh Mody 2018-06-01 06:59:28 UTC
Created attachment 1446541 [details]
Potential fix

Comment 16 Rasesh Mody 2018-06-01 07:02:43 UTC
There is port reconfiguration as there is a change in number of RXQs. RSS hash update is part of port config. Port RXQ reconfiguration RSS hash update is using stale RXQ handles for redirection table. We're modifying the logic to use proper RXQ handle.

Attached is a potential fix.

Thanks!
-Rasesh

Comment 18 Jean-Tsung Hsiao 2018-06-01 14:04:52 UTC
Hi Rasesh,

So far so good. There have been no segfaults after many sh's and del-br's.

NOTE: The daemon pid got changed once at the start. But, I couldn't reproduce it after reboot.

Below is the test rpm that I've been testing:

[root@netqe9 ~]# rpm -q openvswitch
openvswitch-2.9.0-37.el7fdp.bz1578981.1.x86_64
Thanks!

Jean

Comment 19 Rasesh Mody 2018-06-04 22:48:55 UTC
(In reply to Jean-Tsung Hsiao from comment #18)
> Hi Rasesh,
> 
> So far so good. There have been no segfaults after many sh's and del-br's.
> 
> NOTE: The daemon pid got changed once at the start. But, I couldn't
> reproduce it after reboot.
> 
> Below is the test rpm that I've been testing:
> 
> [root@netqe9 ~]# rpm -q openvswitch
> openvswitch-2.9.0-37.el7fdp.bz1578981.1.x86_64
> Thanks!
> 
> Jean

Thanks.

Comment 20 Rasesh Mody 2018-06-04 22:53:04 UTC
The fix was submitted to dpdk.org and marked for stable.

Comment 21 Jean-Tsung Hsiao 2018-06-06 18:19:03 UTC
There is a new crash with .1 and .2 test build.
Please check the following back trace. It's different from the original issue.

*** The crash happened at the 5th run all the time ***
for i in {1..5}; do echo Test $i; sh /home/jhsiao/ovs_loopback_qede.sh; sleep 30; ps -elf | grep ovs-vs; dmesg | grep segfault; done

*** gdb back trace ***

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f67a0fe9700 (LWP 76666)]
qede_recv_pkts (p_rxq=0x0, rx_pkts=0x7f67a0fe8770, nb_pkts=32)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_rxtx.c:1275
1275		struct qede_dev *qdev = rxq->qdev;
(gdb) bt
#0  qede_recv_pkts (p_rxq=0x0, rx_pkts=0x7f67a0fe8770, nb_pkts=32)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_rxtx.c:1275
#1  0x0000563bc5025d4b in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f67a0fe8770, queue_id=0, port_id=2)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2897
#2  netdev_dpdk_rxq_recv (rxq=<optimized out>, batch=0x7f67a0fe8760) at lib/netdev-dpdk.c:1923
#3  0x0000563bc4f73281 in netdev_rxq_recv (rx=<optimized out>, batch=batch@entry=0x7f67a0fe8760) at lib/netdev.c:701
#4  0x0000563bc4f4c82f in dp_netdev_process_rxq_port (pmd=pmd@entry=0x563bc6b4d850, rxq=0x563bc65f3490, port_no=3)
    at lib/dpif-netdev.c:3279
#5  0x0000563bc4f4cc3a in pmd_thread_main (f_=<optimized out>) at lib/dpif-netdev.c:4145
#6  0x0000563bc4fc98c6 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:348
#7  0x00007f67bf19add5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f67be598b3d in clone () from /lib64/libc.so.6

Comment 22 Rasesh Mody 2018-06-06 18:36:26 UTC
(In reply to Jean-Tsung Hsiao from comment #21)
> There is a new crash with .1 and .2 test build.
> Please check the following back trace. It's different from the original
> issue.
> 
> *** The crash happened at the 5th run all the time ***
> for i in {1..5}; do echo Test $i; sh /home/jhsiao/ovs_loopback_qede.sh;
> sleep 30; ps -elf | grep ovs-vs; dmesg | grep segfault; done
> 
> *** gdb back trace ***
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7f67a0fe9700 (LWP 76666)]
> qede_recv_pkts (p_rxq=0x0, rx_pkts=0x7f67a0fe8770, nb_pkts=32)
>     at
> /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_rxtx.c:1275
> 1275		struct qede_dev *qdev = rxq->qdev;
> (gdb) bt
> #0  qede_recv_pkts (p_rxq=0x0, rx_pkts=0x7f67a0fe8770, nb_pkts=32)
>     at
> /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/qede/qede_rxtx.c:1275
> #1  0x0000563bc5025d4b in rte_eth_rx_burst (nb_pkts=32,
> rx_pkts=0x7f67a0fe8770, queue_id=0, port_id=2)
>     at
> /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/x86_64-native-linuxapp-gcc/
> include/rte_ethdev.h:2897
> #2  netdev_dpdk_rxq_recv (rxq=<optimized out>, batch=0x7f67a0fe8760) at
> lib/netdev-dpdk.c:1923
> #3  0x0000563bc4f73281 in netdev_rxq_recv (rx=<optimized out>,
> batch=batch@entry=0x7f67a0fe8760) at lib/netdev.c:701
> #4  0x0000563bc4f4c82f in dp_netdev_process_rxq_port
> (pmd=pmd@entry=0x563bc6b4d850, rxq=0x563bc65f3490, port_no=3)
>     at lib/dpif-netdev.c:3279
> #5  0x0000563bc4f4cc3a in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:4145
> #6  0x0000563bc4fc98c6 in ovsthread_wrapper (aux_=<optimized out>) at
> lib/ovs-thread.c:348
> #7  0x00007f67bf19add5 in start_thread () from /lib64/libpthread.so.0
> #8  0x00007f67be598b3d in clone () from /lib64/libc.so.6

Do you see other error messages in OVS vswitchd logs during 5th iteration before we see the crash? I am wondering why it should happen only in 5th iteration.

Thanks!
-Rasesh

Comment 23 Rasesh Mody 2018-06-06 18:55:39 UTC
The revised fix has been accepted upstream. Attaching the test build with the revised fix.

Hi Jean,

Can you verify the original issue?

Thanks!
-Rasesh

Comment 24 Rasesh Mody 2018-06-06 18:56:30 UTC
Created attachment 1448413 [details]
test build

Comment 25 Rasesh Mody 2018-06-06 18:57:42 UTC
Created attachment 1448414 [details]
upstream fix

Comment 26 Rasesh Mody 2018-06-07 00:20:03 UTC
Created attachment 1448523 [details]
test build

Regarding the new issue reported above, we've observed a similar issue in our internal testing. It can happen due to multiple port-reconfiguration attempts which can lead to dma allocation failures and run into segfault.

We have a potential fix that we are submitting upstream.

Hi Jean,

Can you try the new test build, which has added fix for new issue?

Thanks!
-Rasesh

Comment 29 Jean-Tsung Hsiao 2018-06-07 01:46:30 UTC
Still see segfault at the 5th:

[476125.764833] pmd569[89356]: segfault at 548 ip 00005616255a6da7 sp 00007f1b64ff05e0 error 4
[476125.765013] ovs-vswitchd[88568]: segfault at 130 ip 00005616255a1951 sp 00007ffd9ec2a130 error 4

NOTE: This time got a companion segfault from a pmd process.

Comment 30 Jean-Tsung Hsiao 2018-06-07 14:42:50 UTC
Hi Rasesh,

Sorry, I was running the wrong test build.

With penvswitch-2.9.0-36.bz1578981.4.el7 there were no more segfault after 10 sh runs.

Thanks!

Jean

Comment 31 Jean-Tsung Hsiao 2018-06-07 14:48:05 UTC
Created attachment 1448735 [details]
Console log for qede stress testing

No more segfaults after a loop of 10 "sh /home/jhsiao/ovs_loopback_qede.sh".

See the attached console log.

Comment 32 Jean-Tsung Hsiao 2018-06-07 19:42:34 UTC
Hi Tim,

Based on my testing Rasesh's penvswitch-2.9.0-36.bz1578981.4.el7 package fixes the original crash, and the second crash caused by "sh /home/jhsiao/ovs_loopback_qede.sh" loop. It also fixes Bug # 1578590.

Thanks!

Jean

Comment 33 Rasesh Mody 2018-06-08 19:07:04 UTC
Created attachment 1449202 [details]
fix dma allocation failure

The fix for dma allocation failure is accepted upstream.

Comment 34 Jean-Tsung Hsiao 2018-06-18 23:56:17 UTC
The fix of this bug has been verified with OVS 2.9.0-47.

Comment 36 errata-xmlrpc 2018-06-21 13:36:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1962