Bug 1589264

Summary: [OVS] [bnxt] OVS daemon got segfault when adding bnxt dpdk interface to OVS-dpdk bridge
Product: Red Hat Enterprise Linux 7 Reporter: Jean-Tsung Hsiao <jhsiao>
Component: openvswitchAssignee: Davide Caratti <dcaratti>
Status: CLOSED ERRATA QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: ajit.khaparde, aloughla, atragler, ctrautma, jhsiao, kzhang, pvauter
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-51.el7fdn Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-15 13:53:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
If new MTU is not greater than mbuf size don't update HW none

Description Jean-Tsung Hsiao 2018-06-08 14:41:55 UTC
Description of problem: [OVS] [bnxt] Encountered bnxt specific ERRs while configuring OVS-dpdk with bnxt dpdk interface

2018-06-08T14:17:42.467Z|00151|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 10 destroyed.
2018-06-08T14:17:42.469Z|00152|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 20 destroyed.
2018-06-08T14:17:42.471Z|00153|dpif_netdev|INFO|PMD thread on numa_id: 0, core id:  8 destroyed.
2018-06-08T14:17:42.472Z|00154|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 22 destroyed.
2018-06-08T14:17:42.474Z|00155|dpif_netdev|INFO|PMD thread on numa_id: 0, core id:  0 created.
2018-06-08T14:17:42.475Z|00156|dpif_netdev|INFO|PMD thread on numa_id: 1, core id:  1 created.
2018-06-08T14:17:42.475Z|00157|dpif_netdev|INFO|There are 1 pmd threads on numa node 0
2018-06-08T14:17:42.475Z|00158|dpif_netdev|INFO|There are 1 pmd threads on numa node 1
2018-06-08T14:17:42.475Z|00159|dpdk|INFO|PMD: Force Link Down
2018-06-08T14:17:42.477Z|00160|dpdk|ERR|PMD: bnxt_hwrm_port_clr_stats error 65535:0:00000000:0000
2018-06-08T14:17:42.482Z|00161|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2
2018-06-08T14:17:42.502Z|00162|dpdk|INFO|PMD: New MTU is 1500
2018-06-08T14:17:42.503Z|00163|dpdk|ERR|PMD: bnxt_hwrm_vnic_plcmode_cfg error 2:0:00000000:01cb
2018-06-08T14:17:42.503Z|00164|netdev_dpdk|ERR|Interface dpdk-10 MTU (1500) setup error: Unknown error -2
2018-06-08T14:17:42.503Z|00165|netdev_dpdk|ERR|Interface dpdk-10(rxq:1 txq:3) configure error: Unknown error -2
2018-06-08T14:17:42.503Z|00166|dpif_netdev|ERR|Failed to set interface dpdk-10 new configuration



Version-Release number of selected component (if applicable):
[root@netqe22 jhsiao]# ethtool -i p5p1
driver: bnxt_en
version: 1.8.0
firmware-version: 212.0.92.0
expansion-rom-version: 
bus-info: 0000:07:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no


[root@netqe22 jhsiao]# rpm -q openvswitch
openvswitch-2.9.0-37.el7fdp.x86_64
[root@netqe22 jhsiao]# uname -a
Linux netqe22.knqe.lab.eng.bos.redhat.com 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


How reproducible: Reproducible


Steps to Reproduce:
1. Configure an OVS-dpdk bridge using bnxt with the same firmware mentioned above.
2.
3.

Actual results:
Got ERRs that resulted in failing to add bnxt dpdk interface to OVS-dpdk bridge.

Expected results:
Should succeed!

Additional info:

Comment 2 Jean-Tsung Hsiao 2018-06-08 15:59:54 UTC
The daemon also got segfault. Below is a gdb back trace.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3bfa7fc700 (LWP 3667)]
bnxt_recv_pkts (rx_queue=0x0, rx_pkts=0x7f3bfa7fb770, nb_pkts=32)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/bnxt/bnxt_rxr.c:536
536		struct bnxt_rx_ring_info *rxr = rxq->rx_ring;
(gdb) bt
#0  bnxt_recv_pkts (rx_queue=0x0, rx_pkts=0x7f3bfa7fb770, nb_pkts=32)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/drivers/net/bnxt/bnxt_rxr.c:536
#1  0x00005627206d6d4b in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f3bfa7fb770, 
    queue_id=1, port_id=0)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2897
#2  netdev_dpdk_rxq_recv (rxq=<optimized out>, batch=0x7f3bfa7fb760)
    at lib/netdev-dpdk.c:1923
#3  0x0000562720624281 in netdev_rxq_recv (rx=<optimized out>, 
    batch=batch@entry=0x7f3bfa7fb760) at lib/netdev.c:701
#4  0x00005627205fd82f in dp_netdev_process_rxq_port (pmd=pmd@entry=0x7f3cb8358010, 
    rxq=0x562721bd3110, port_no=2) at lib/dpif-netdev.c:3279
#5  0x00005627205fdc3a in pmd_thread_main (f_=<optimized out>)
    at lib/dpif-netdev.c:4145
#6  0x000056272067a8c6 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:348
#7  0x00007f3cd7c78dd5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f3cd7076b3d in clone () from /lib64/libc.so.6
(gdb) q
A debugging session is active.

	Inferior 1 [process 3506] will be detached.

Quit anyway? (y or n) y
Quitting: Can't detach Thread 0x7f3c197fa700 (LWP 3665): No such process
[root@netqe22 ~]#

Comment 3 Jean-Tsung Hsiao 2018-06-08 16:19:38 UTC
For bnxt with different firmware there is no such issue.

[root@netqe16 jhsiao]# ethtool -i p7p1
driver: bnxt_en
version: 1.8.0
firmware-version: 20.6.55.0
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no
[root@netqe16 jhsiao]#

Comment 4 Davide Caratti 2018-06-13 07:21:20 UTC
hello Jean-Tsung,

- is this issue systematic?
- do you have a reproducer script for the segfault at comment #2?

thank you in advance!
-- 
davide

Comment 5 Jean-Tsung Hsiao 2018-06-13 15:28:19 UTC
Nothing special. Just add bnxt to OVS-dpdk bridge. Below is my example.


# Please change parameters accordingly
ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x000004
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="4096,4096"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x500500
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk-10 \
    -- set interface dpdk-10 type=dpdk ofport_request=10 options:dpdk-devargs=0000:05:00.0
ovs-vsctl add-port ovsbr0 dpdk-11 \
    -- set interface dpdk-11 type=dpdk ofport_request=11 options:dpdk-devargs=0000:05:00.1

ovs-vsctl --timeout 10 set Interface dpdk-10 options:n_rxq=2
ovs-vsctl --timeout 10 set Interface dpdk-11 options:n_rxq=2

ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 in_port=10,actions=output:11
ovs-ofctl add-flow ovsbr0 in_port=11,actions=output:10
ovs-ofctl dump-flows ovsbr0

Comment 6 Ajit Khaparde 2018-06-13 15:39:56 UTC
And this is a VF and not a PF. Right?

Comment 7 Davide Caratti 2018-06-13 15:47:03 UTC
(In reply to Ajit Khaparde from comment #6)
> And this is a VF and not a PF. Right?

It does not look like a VF, see comment #3. But I don't have FW 20.x, can you please check if this happens with version 20.8.x ?

thanks!
-- 
davide

Comment 8 Davide Caratti 2018-06-13 15:51:23 UTC
(In reply to Davide Caratti from comment #7)
> (In reply to Ajit Khaparde from comment #6)
> > And this is a VF and not a PF. Right?
> 
> It does not look like a VF, see comment #3. But I don't have FW 20.x, can
> you please check if this happens with version 20.8.x ?
> 
> thanks!
> -- 
> davide

scratch my question. I was assuming that the fault was reproducible on old FWs, not new FWs, but now I read correctly:

FW 20.x -> no segfault 
FW 212.x -> segfault

Comment 9 Jean-Tsung Hsiao 2018-06-13 16:00:27 UTC
(In reply to Davide Caratti from comment #8)
> (In reply to Davide Caratti from comment #7)
> > (In reply to Ajit Khaparde from comment #6)
> > > And this is a VF and not a PF. Right?
> > 
> > It does not look like a VF, see comment #3. But I don't have FW 20.x, can
> > you please check if this happens with version 20.8.x ?
> > 
> > thanks!
> > -- 
> > davide
> 
> scratch my question. I was assuming that the fault was reproducible on old
> FWs, not new FWs, but now I read correctly:
> 
> FW 20.x -> no segfault 
> FW 212.x -> segfault

Correct! I am very surprised.

NOTE: I don't own the server at this moment. It's being used for other testing.

Comment 10 Ajit Khaparde 2018-06-15 00:51:24 UTC
(In reply to Jean-Tsung Hsiao from comment #9)
> (In reply to Davide Caratti from comment #8)
> > (In reply to Davide Caratti from comment #7)
> > > (In reply to Ajit Khaparde from comment #6)
> > > > And this is a VF and not a PF. Right?
> > > 
> > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can
> > > you please check if this happens with version 20.8.x ?
> > > 
> > > thanks!
> > > -- 
> > > davide
> > 
> > scratch my question. I was assuming that the fault was reproducible on old
> > FWs, not new FWs, but now I read correctly:
> > 
> > FW 20.x -> no segfault 
> > FW 212.x -> segfault
> 
> Correct! I am very surprised.
> 
> NOTE: I don't own the server at this moment. It's being used for other
> testing.

When you get your server back, can you try a patch?
I am yet to see a segfault with the firmware I have on my setup.
So I am trying to get to the exact version you are using and try again.

Comment 11 Davide Caratti 2018-06-15 10:22:49 UTC
(In reply to Ajit Khaparde from comment #10)
> (In reply to Jean-Tsung Hsiao from comment #9)
> > (In reply to Davide Caratti from comment #8)
> > > (In reply to Davide Caratti from comment #7)
> > > > (In reply to Ajit Khaparde from comment #6)
> > > > > And this is a VF and not a PF. Right?
> > > > 
> > > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can
> > > > you please check if this happens with version 20.8.x ?
> > > > 
> > > > thanks!
> > > > -- 
> > > > davide
> > > 
> > > scratch my question. I was assuming that the fault was reproducible on old
> > > FWs, not new FWs, but now I read correctly:
> > > 
> > > FW 20.x -> no segfault 
> > > FW 212.x -> segfault
> > 
> > Correct! I am very surprised.
> > 
> > NOTE: I don't own the server at this moment. It's being used for other
> > testing.
> 
> When you get your server back, can you try a patch?
> I am yet to see a segfault with the firmware I have on my setup.
> So I am trying to get to the exact version you are using and try again.

hello Ajit,

thanks for looking at this! I can reproduce the segfault and the reported errors on netdev90, using latest FDP openvswitch: if you share the code, I can build/test the patch you mention in comment #10 and give you feedback: please let me know how you want to proceed.

regards,
-- 
davide

Comment 12 Jean-Tsung Hsiao 2018-06-15 18:32:06 UTC
(In reply to Ajit Khaparde from comment #10)
> (In reply to Jean-Tsung Hsiao from comment #9)
> > (In reply to Davide Caratti from comment #8)
> > > (In reply to Davide Caratti from comment #7)
> > > > (In reply to Ajit Khaparde from comment #6)
> > > > > And this is a VF and not a PF. Right?
> > > > 
> > > > It does not look like a VF, see comment #3. But I don't have FW 20.x, can
> > > > you please check if this happens with version 20.8.x ?
> > > > 
> > > > thanks!
> > > > -- 
> > > > davide
> > > 
> > > scratch my question. I was assuming that the fault was reproducible on old
> > > FWs, not new FWs, but now I read correctly:
> > > 
> > > FW 20.x -> no segfault 
> > > FW 212.x -> segfault
> > 
> > Correct! I am very surprised.
> > 
> > NOTE: I don't own the server at this moment. It's being used for other
> > testing.
> 
> When you get your server back, can you try a patch?
> I am yet to see a segfault with the firmware I have on my setup.
> So I am trying to get to the exact version you are using and try again.

Where can I get the test build?

Comment 13 Ajit Khaparde 2018-06-15 20:58:23 UTC
Created attachment 1452067 [details]
If new MTU is not greater than mbuf size don't update HW

Can you try the attached patch.

Comment 14 Jean-Tsung Hsiao 2018-06-16 18:06:41 UTC
(In reply to Ajit Khaparde from comment #13)
> Created attachment 1452067 [details]
> If new MTU is not greater than mbuf size don't update HW
> 
> Can you try the attached patch.

Sorry, I am waiting for a test build, not a patch.

Comment 15 Davide Caratti 2018-06-18 10:10:44 UTC
(In reply to Jean-Tsung Hsiao from comment #14)
> (In reply to Ajit Khaparde from comment #13)
> > Created attachment 1452067 [details]
> > If new MTU is not greater than mbuf size don't update HW
> > 
> > Can you try the attached patch.
> 

hello, I made a build applying the attached patch on top of latest FDN and did a quick retest. There are still some ERR messages in the PMD log,

2018-06-18T10:03:14.528Z|00466|dpdk|INFO|PMD: Force Link Down
2018-06-18T10:03:14.529Z|00467|dpdk|ERR|PMD: bnxt_hwrm_port_clr_stats error 65535:0:00000000:0000
2018-06-18T10:03:14.530Z|00468|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2
2018-06-18T10:03:14.550Z|00469|dpdk|INFO|PMD: New MTU is 1500
2018-06-18T10:03:14.571Z|00470|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2
2018-06-18T10:03:14.576Z|00471|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error 2:0:00000000:01f2
2018-06-18T10:03:14.580Z|00472|dpdk|INFO|PMD: bnxt_init_chip(): intr_vector = 2
2018-06-18T10:03:14.589Z|00473|dpdk|INFO|PMD: Port 1 Link Down

but the segfault does not seem to happen anymore. @Jean, can you confirm?

Comment 17 Ajit Khaparde 2018-06-18 13:47:58 UTC
Thanks for the update Davide.
We have a firmware fix and a PMD change for bnxt_hwrm_port_clr_stats.
I will take a look at bnxt_hwrm_vnic_tpa_cfg error.

Comment 18 Jean-Tsung Hsiao 2018-06-18 14:20:30 UTC
(In reply to Davide Caratti from comment #15)
> (In reply to Jean-Tsung Hsiao from comment #14)
> > (In reply to Ajit Khaparde from comment #13)
> > > Created attachment 1452067 [details]
> > > If new MTU is not greater than mbuf size don't update HW
> > > 
> > > Can you try the attached patch.
> > 
> 
> hello, I made a build applying the attached patch on top of latest FDN and
> did a quick retest. There are still some ERR messages in the PMD log,
> 
> 2018-06-18T10:03:14.528Z|00466|dpdk|INFO|PMD: Force Link Down
> 2018-06-18T10:03:14.529Z|00467|dpdk|ERR|PMD: bnxt_hwrm_port_clr_stats error
> 65535:0:00000000:0000
> 2018-06-18T10:03:14.530Z|00468|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error
> 2:0:00000000:01f2
> 2018-06-18T10:03:14.550Z|00469|dpdk|INFO|PMD: New MTU is 1500
> 2018-06-18T10:03:14.571Z|00470|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error
> 2:0:00000000:01f2
> 2018-06-18T10:03:14.576Z|00471|dpdk|ERR|PMD: bnxt_hwrm_vnic_tpa_cfg error
> 2:0:00000000:01f2
> 2018-06-18T10:03:14.580Z|00472|dpdk|INFO|PMD: bnxt_init_chip(): intr_vector
> = 2
> 2018-06-18T10:03:14.589Z|00473|dpdk|INFO|PMD: Port 1 Link Down
> 
> but the segfault does not seem to happen anymore. @Jean, can you confirm?

Hi Davide,
Yes, I got the same result as you did.
Thanks for the test build.
Jean

Comment 22 Ajit Khaparde 2018-06-21 21:21:32 UTC
(In reply to Davide Caratti from comment #20)
> https://mails.dpdk.org/archives/dev/2018-June/104698.html

Thanks for updating Davide.
I was planning to update the bug once the patch was applied.
But this will work as well.

Comment 24 Jean-Tsung Hsiao 2018-07-25 02:58:03 UTC
Waiting for netqe22 to verify the fix.

Comment 25 Jean-Tsung Hsiao 2018-07-25 14:04:28 UTC
The fix has been verified using OVS 2.9.0-55.

Comment 26 Timothy Redaelli 2018-08-10 13:45:33 UTC
The openvwitch component is delivered through the fast datapath channel, it is not documented in release notes.

Comment 28 errata-xmlrpc 2018-08-15 13:53:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2432