Bug 1589264
Summary: | [OVS] [bnxt] OVS daemon got segfault when adding bnxt dpdk interface to OVS-dpdk bridge | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jean-Tsung Hsiao <jhsiao> | ||||
Component: | openvswitch | Assignee: | Davide Caratti <dcaratti> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jean-Tsung Hsiao <jhsiao> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.5 | CC: | ajit.khaparde, aloughla, atragler, ctrautma, jhsiao, kzhang, pvauter | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | openvswitch-2.9.0-51.el7fdn | Doc Type: | No Doc Update | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-08-15 13:53:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jean-Tsung Hsiao
2018-06-08 14:41:55 UTC
The daemon also got segfault. Below is a gdb back trace. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f3bfa7fc700 (LWP 3667)] bnxt_recv_pkts (rx_queue=0x0, rx_pkts=0x7f3bfa7fb770, nb_pkts=32) at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/bnxt/bnxt_rxr.c:536 536 struct bnxt_rx_ring_info *rxr = rxq->rx_ring; (gdb) bt #0 bnxt_recv_pkts (rx_queue=0x0, rx_pkts=0x7f3bfa7fb770, nb_pkts=32) at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/bnxt/bnxt_rxr.c:536 #1 0x00005627206d6d4b in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f3bfa7fb770, queue_id=1, port_id=0) at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2897 #2 netdev_dpdk_rxq_recv (rxq=<optimized out>, batch=0x7f3bfa7fb760) at lib/netdev-dpdk.c:1923 #3 0x0000562720624281 in netdev_rxq_recv (rx=<optimized out>, batch=batch@entry=0x7f3bfa7fb760) at lib/netdev.c:701 #4 0x00005627205fd82f in dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f3cb8358010, rxq=0x562721bd3110, port_no=2) at lib/dpif-netdev.c:3279 #5 0x00005627205fdc3a in pmd_thread_main (f_=<optimized out>) at lib/dpif-netdev.c:4145 #6 0x000056272067a8c6 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:348 #7 0x00007f3cd7c78dd5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f3cd7076b3d in clone () from /lib64/libc.so.6 (gdb) q A debugging session is active. Inferior 1 [process 3506] will be detached. Quit anyway? (y or n) y Quitting: Can't detach Thread 0x7f3c197fa700 (LWP 3665): No such process [root@netqe22 ~]# For bnxt with different firmware there is no such issue. [root@netqe16 jhsiao]# ethtool -i p7p1 driver: bnxt_en version: 1.8.0 firmware-version: 20.6.55.0 expansion-rom-version: bus-info: 0000:05:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no [root@netqe16 jhsiao]# hello Jean-Tsung, - is this issue systematic? - do you have a reproducer script for the segfault at comment #2? thank you in advance! -- davide Nothing special. Just add bnxt to OVS-dpdk bridge. Below is my example. # Please change parameters accordingly ovs-vsctl set Open_vSwitch . other_config={} ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x000004 ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096" ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true ovs-vsctl --if-exists del-br ovsbr0 ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x500500 ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev ovs-vsctl add-port ovsbr0 dpdk-10 \ -- set interface dpdk-10 type=dpdk ofport_request=10 options:dpdk-devargs=0000:05:00.0 ovs-vsctl add-port ovsbr0 dpdk-11 \ -- set interface dpdk-11 type=dpdk ofport_request=11 options:dpdk-devargs=0000:05:00.1 ovs-vsctl --timeout 10 set Interface dpdk-10 options:n_rxq=2 ovs-vsctl --timeout 10 set Interface dpdk-11 options:n_rxq=2 ovs-ofctl del-flows ovsbr0 ovs-ofctl add-flow ovsbr0 in_port=10,actions=output:11 ovs-ofctl add-flow ovsbr0 in_port=11,actions=output:10 ovs-ofctl dump-flows ovsbr0 And this is a VF and not a PF. Right? (In reply to Ajit Khaparde from comment #6) > And this is a VF and not a PF. Right? It does not look like a VF, see comment #3. But I don't have FW 20.x, can you please check if this happens with version 20.8.x ? thanks! -- davide (In reply to Davide Caratti from comment #7) > (In reply to Ajit Khaparde from comment #6) > > And this is a VF and not a PF. Right? > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can > you please check if this happens with version 20.8.x ? > > thanks! > -- > davide scratch my question. I was assuming that the fault was reproducible on old FWs, not new FWs, but now I read correctly: FW 20.x -> no segfault FW 212.x -> segfault (In reply to Davide Caratti from comment #8) > (In reply to Davide Caratti from comment #7) > > (In reply to Ajit Khaparde from comment #6) > > > And this is a VF and not a PF. Right? > > > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can > > you please check if this happens with version 20.8.x ? > > > > thanks! > > -- > > davide > > scratch my question. I was assuming that the fault was reproducible on old > FWs, not new FWs, but now I read correctly: > > FW 20.x -> no segfault > FW 212.x -> segfault Correct! I am very surprised. NOTE: I don't own the server at this moment. It's being used for other testing. (In reply to Jean-Tsung Hsiao from comment #9) > (In reply to Davide Caratti from comment #8) > > (In reply to Davide Caratti from comment #7) > > > (In reply to Ajit Khaparde from comment #6) > > > > And this is a VF and not a PF. Right? > > > > > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can > > > you please check if this happens with version 20.8.x ? > > > > > > thanks! > > > -- > > > davide > > > > scratch my question. I was assuming that the fault was reproducible on old > > FWs, not new FWs, but now I read correctly: > > > > FW 20.x -> no segfault > > FW 212.x -> segfault > > Correct! I am very surprised. > > NOTE: I don't own the server at this moment. It's being used for other > testing. When you get your server back, can you try a patch? I am yet to see a segfault with the firmware I have on my setup. So I am trying to get to the exact version you are using and try again. (In reply to Ajit Khaparde from comment #10) > (In reply to Jean-Tsung Hsiao from comment #9) > > (In reply to Davide Caratti from comment #8) > > > (In reply to Davide Caratti from comment #7) > > > > (In reply to Ajit Khaparde from comment #6) > > > > > And this is a VF and not a PF. Right? > > > > > > > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can > > > > you please check if this happens with version 20.8.x ? > > > > > > > > thanks! > > > > -- > > > > davide > > > > > > scratch my question. I was assuming that the fault was reproducible on old > > > FWs, not new FWs, but now I read correctly: > > > > > > FW 20.x -> no segfault > > > FW 212.x -> segfault > > > > Correct! I am very surprised. > > > > NOTE: I don't own the server at this moment. It's being used for other > > testing. > > When you get your server back, can you try a patch? > I am yet to see a segfault with the firmware I have on my setup. > So I am trying to get to the exact version you are using and try again. hello Ajit, thanks for looking at this! I can reproduce the segfault and the reported errors on netdev90, using latest FDP openvswitch: if you share the code, I can build/test the patch you mention in comment #10 and give you feedback: please let me know how you want to proceed. regards, -- davide (In reply to Ajit Khaparde from comment #10) > (In reply to Jean-Tsung Hsiao from comment #9) > > (In reply to Davide Caratti from comment #8) > > > (In reply to Davide Caratti from comment #7) > > > > (In reply to Ajit Khaparde from comment #6) > > > > > And this is a VF and not a PF. Right? > > > > > > > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can > > > > you please check if this happens with version 20.8.x ? > > > > > > > > thanks! > > > > -- > > > > davide > > > > > > scratch my question. I was assuming that the fault was reproducible on old > > > FWs, not new FWs, but now I read correctly: > > > > > > FW 20.x -> no segfault > > > FW 212.x -> segfault > > > > Correct! I am very surprised. > > > > NOTE: I don't own the server at this moment. It's being used for other > > testing. > > When you get your server back, can you try a patch? > I am yet to see a segfault with the firmware I have on my setup. > So I am trying to get to the exact version you are using and try again. Where can I get the test build? Created attachment 1452067 [details]
If new MTU is not greater than mbuf size don't update HW
Can you try the attached patch.
(In reply to Ajit Khaparde from comment #13) > Created attachment 1452067 [details] > If new MTU is not greater than mbuf size don't update HW > > Can you try the attached patch. Sorry, I am waiting for a test build, not a patch. (In reply to Jean-Tsung Hsiao from comment #14) > (In reply to Ajit Khaparde from comment #13) > > Created attachment 1452067 [details] > > If new MTU is not greater than mbuf size don't update HW > > > > Can you try the attached patch. > hello, I made a build applying the attached patch on top of latest FDN and did a quick retest. There are still some ERR messages in the PMD log, 2018-06-18T10:03:14.528Z|00466|dpdk|INFO|PMD: Force Link Down 2018-06-18T10:03:14.529Z|00467|dpdk|ERR|PMD: bnxt_hwrm_port_clr_stats error 65535:0:00000000:0000 2018-06-18T10:03:14.530Z|00468|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2 2018-06-18T10:03:14.550Z|00469|dpdk|INFO|PMD: New MTU is 1500 2018-06-18T10:03:14.571Z|00470|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2 2018-06-18T10:03:14.576Z|00471|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2 2018-06-18T10:03:14.580Z|00472|dpdk|INFO|PMD: bnxt_init_chip(): intr_vector = 2 2018-06-18T10:03:14.589Z|00473|dpdk|INFO|PMD: Port 1 Link Down but the segfault does not seem to happen anymore. @Jean, can you confirm? Thanks for the update Davide. We have a firmware fix and a PMD change for bnxt_hwrm_port_clr_stats. I will take a look at bnxt_hwrm_vnic_tpa_cfg error. (In reply to Davide Caratti from comment #15) > (In reply to Jean-Tsung Hsiao from comment #14) > > (In reply to Ajit Khaparde from comment #13) > > > Created attachment 1452067 [details] > > > If new MTU is not greater than mbuf size don't update HW > > > > > > Can you try the attached patch. > > > > hello, I made a build applying the attached patch on top of latest FDN and > did a quick retest. There are still some ERR messages in the PMD log, > > 2018-06-18T10:03:14.528Z|00466|dpdk|INFO|PMD: Force Link Down > 2018-06-18T10:03:14.529Z|00467|dpdk|ERR|PMD: bnxt_hwrm_port_clr_stats error > 65535:0:00000000:0000 > 2018-06-18T10:03:14.530Z|00468|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error > 2:0:00000000:01f2 > 2018-06-18T10:03:14.550Z|00469|dpdk|INFO|PMD: New MTU is 1500 > 2018-06-18T10:03:14.571Z|00470|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error > 2:0:00000000:01f2 > 2018-06-18T10:03:14.576Z|00471|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error > 2:0:00000000:01f2 > 2018-06-18T10:03:14.580Z|00472|dpdk|INFO|PMD: bnxt_init_chip(): intr_vector > = 2 > 2018-06-18T10:03:14.589Z|00473|dpdk|INFO|PMD: Port 1 Link Down > > but the segfault does not seem to happen anymore. @Jean, can you confirm? Hi Davide, Yes, I got the same result as you did. Thanks for the test build. Jean (In reply to Davide Caratti from comment #20) > https://mails.dpdk.org/archives/dev/2018-June/104698.html Thanks for updating Davide. I was planning to update the bug once the patch was applied. But this will work as well. Waiting for netqe22 to verify the fix. The fix has been verified using OVS 2.9.0-55. The openvwitch component is delivered through the fast datapath channel, it is not documented in release notes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2432 |