Bug 1376217

Summary: [fdBeta] OVS daemon crashed when guests running pktgen over OVS-dpdk bond
Product: Red Hat Enterprise Linux 7 Reporter: Jean-Tsung Hsiao <jhsiao>
Component: openvswitchAssignee: Kevin Traynor <ktraynor>
Status: CLOSED CURRENTRELEASE QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: atragler, fleitner, jhsiao, ktraynor, kzhang, rcain
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.5.0-20.git20160727.el7fdb Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1397196 1397197 (view as bug list) Environment:
Last Closed: 2017-01-12 15:54:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1397196, 1397197    
Attachments:
Description Flags
gdb back trace
none
core dump gzip file
none
Proposed fix RPM
none
Patch to stop incorrect calculation of free descriptors
none
Patch to add checks for invalid memory conversion result none

Description Jean-Tsung Hsiao 2016-09-14 21:52:21 UTC
Created attachment 1200985 [details]
gdb back trace

Description of problem: OVS daemon crashed when guests running pktgen over OVS-dpdk bond

When the pktgen running with -m "[2:3].0" it can deliver more than 1 Mpps to the other side when the rate for port 0 set to 10%. But, with -m "[1/2:3], ovs-daemon got core dumps consistently with the same traffic.


Version-Release number of selected component (if applicable):

[root@netqe10 /]# rpm -qa | grep openvswitch
kernel-kernel-networking-openvswitch-perf_check-1.0-51.noarch
openvswitch-2.5.0-10.git20160727.el7fdb.x86_64
openvswitch-debuginfo-2.5.0-10.git20160727.el7fdb.x86_64
openvswitch-devel-2.5.0-10.git20160727.el7fdb.x86_64
[root@netqe10 /]# rpm -qa | grep dpdk
dpdk-16.04-4.el7fdb.x86_64
dpdk-tools-16.04-4.el7fdb.x86_64
dpdk-debuginfo-16.04-4.el7fdb.x86_64
[root@netqe10 /]# uname -a
Linux netqe10.knqe.lab.eng.bos.redhat.com 3.10.0-505.el7.x86_64 #1 SMP Tue Sep 6 11:05:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux


How reproducible: reproducible


Steps to Reproduce:
1.Set up OVS-dpdk bond between two hosts, ecah having a vhostuser guest.
2.Run pktgen between guests
3.See description above for RX/TX core assignments. 

Actual results: ovs-vswitchd got core dump when two cores assigned to RX.


Expected results: Should not crash


Additional info:

Comment 1 Jean-Tsung Hsiao 2016-09-14 21:53:49 UTC
Created attachment 1200986 [details]
core dump gzip file

Comment 4 Kevin Traynor 2016-09-29 10:18:03 UTC
Hi Jean-Tsung,

Few questions on this...

* Pktgen running in the guest: what version of pktgen and dpdk are you using here?                                  

* Can you list your complete pktgen command line for pass/fail cases?

* What QEMU version are you using? 

* When you say there is an "OVS-DPDK bond between two hosts" can you clarify? 

Thanks,
Kevin.

Comment 5 Kevin Traynor 2016-09-29 16:55:24 UTC
Also, what do you have set for 

OVSDB: other_config:n-dpdk-rxqs and corresponding qemu params: queues and vectors?

thanks,
Kevin.

Comment 6 Kevin Traynor 2016-09-30 17:05:02 UTC
I've crashed vswitchd by using pktgen in the guest in a similar manner as described (pktgen multiple cores/queues for Rx). Crash does not occur when using OVS2.6/DPDK16.07 in the host.

Comment 7 Kevin Traynor 2016-10-07 13:21:22 UTC
Created attachment 1208153 [details]
Proposed fix RPM

RPM with proposed fix for this issue.

Comment 8 Kevin Traynor 2016-10-07 13:27:16 UTC
Hi Jean-Tsung,

Can you please test with the openvswitch rpm attached and let me know if it resolves this issue with your setup.

Please consider it a draft e.g. change log etc needs to be updated.

If you prefer me to send proposed fix in another format (source/patch etc), please let me know.

Kevin.

Comment 9 Jean-Tsung Hsiao 2016-10-07 18:43:57 UTC
Hi Kevin,

Not sure why pktgen failed with start with -m "[1/2:3].0".

Can you take a look the following log.

Thanks!

Jean

-------------------------------------------------------------

 Copyright (c) <2010-2016>, Intel Corporation. All rights reserved. Powered by Intel® DPDK
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 0
EAL: Detected lcore 2 as core 0 on socket 0
EAL: Detected lcore 3 as core 0 on socket 0
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 4 lcore(s)
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Setting up physically contiguous memory...
EAL: Ask a virtual area of 0x1000000 bytes
EAL: Virtual area found at 0x7fa46ae00000 (size = 0x1000000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fa46aa00000 (size = 0x200000)
EAL: Ask a virtual area of 0x48400000 bytes
EAL: Virtual area found at 0x7fa422400000 (size = 0x48400000)
EAL: Ask a virtual area of 0x36400000 bytes
EAL: Virtual area found at 0x7fa3ebe00000 (size = 0x36400000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fa3eba00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fa3eb600000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fa3eb200000 (size = 0x200000)
EAL: Requesting 512 pages of size 2MB from socket 0
EAL: TSC frequency is ~3399994 KHz
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ixgbe.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_e1000.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_null.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_fm10k.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vhost.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_enic.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_pcap.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_i40e.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_cxgbe.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vmxnet3_uio.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_af_packet.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ring.so.2
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bond.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_virtio.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ena.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bnx2x.so.1
EAL: Master lcore 0 is ready (tid=729f38c0;cpuset=[0])
EAL: lcore 2 is ready (tid=699fd700;cpuset=[2])
EAL: lcore 1 is ready (tid=6a1fe700;cpuset=[1])
EAL: lcore 3 is ready (tid=691fc700;cpuset=[3])
EAL: PCI device 0000:00:03.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI Port IO found start=0xc040
[1/2:3].0        lcores: RX( 1 2 )TX( 3 ) ports: RX( 0 )TX( 0 )












































   Copyright (c) <2010-2016>, Intel Corporation. All rights reserved.
   Pktgen created by: Keith Wiles -- >>> Powered by Intel® DPDK <<<

Lua 5.3.2  Copyright (C) 1994-2015 Lua.org, PUC-Rio
>>> Packet Burst 32, RX Desc 512, TX Desc 512, mbufs/port 4096, mbuf cache 512

=== port to lcore mapping table (# lcores 4) ===
   lcore:     0     1     2     3 
port   0:  D: T  1: 0  1: 0  0: 1 =  2: 1
Total   :  0: 0  1: 0  1: 0  0: 1
    Display and Timer on lcore 0, rx:tx counts per port/lcore

Configuring 1 ports, MBUF Size 1920, MBUF Cache Size 512
Lcore:
    1, RX-Only
                RX( 1): ( 0: 0) 
    2, RX-Only
                RX( 1): ( 0: 1) 
    3, TX-Only
                TX( 1): ( 0: 0) 

Port :
    0, nb_lcores  3, private 0x704e10, lcores:  1  2  3 



** Dev Info (rte_virtio_pmd:0) **
   max_vfs        :   0 min_rx_bufsize    :  64 max_rx_pktlen :  9728 max_rx_queues         :   1 max_tx_queues:   1
   max_mac_addrs  :  64 max_hash_mac_addrs:   0 max_vmdq_pools:     0
   rx_offload_capa:   0 tx_offload_capa   :   0 reta_size     :     0 flow_type_rss_offloads:0000000000000000
   vmdq_queue_base:   0 vmdq_queue_num    :   0 vmdq_pool_base:     0
** RX Conf **
   pthreash       :   0 hthresh          :   0 wthresh        :     0
   Free Thresh    :   0 Drop Enable      :   0 Deferred Start :     0
** TX Conf **
   pthreash       :   0 hthresh          :   0 wthresh        :     0
   Free Thresh    :   0 RS Thresh        :   0 Deferred Start :     0 TXQ Flags:00000f00

!PANIC!: Cannot configure device: port=0, Num queues 2,1 (2)Invalid argument
PANIC in pktgen_config_ports():
Cannot configure device: port=0, Num queues 2,1 (2)Invalid argument6: [pktgen() [0x4129c5]]
5: [/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fa46c586af5]]
4: [pktgen(main+0x4d9) [0x4123f9]]
3: [pktgen(pktgen_ipv6_ctor+0) [0x42f700]]
2: [/lib64/librte_eal.so.2(__rte_panic+0xd0) [0x7fa46f2f9940]]
1: [/lib64/librte_eal.so.2(rte_dump_stack+0x2d) [0x7fa46f301e4d]]
Aborted (core dumped)
[root@localhost ~]#

Comment 10 Kevin Traynor 2016-10-10 10:41:30 UTC
EAL: PCI device 0000:00:03.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
EAL:   Not managed by a supported kernel driver, skipped

This indicates that the virtio device is not bound to dpdk. You can use:

modprobe uio
modprobe uio-pci-generic
dpdk_nic_bind -b uio_pci_generic 00:03.0

thanks,
Kevin.

Comment 11 Jean-Tsung Hsiao 2016-10-11 00:57:11 UTC
(In reply to Kevin Traynor from comment #10)
> EAL: PCI device 0000:00:03.0 on NUMA socket -1
> EAL:   probe driver: 1af4:1000 rte_virtio_pmd
> EAL:   Not managed by a supported kernel driver, skipped
> 
> This indicates that the virtio device is not bound to dpdk. You can use:
> 
> modprobe uio
> modprobe uio-pci-generic
> dpdk_nic_bind -b uio_pci_generic 00:03.0
> 
> thanks,
> Kevin.

But, it worked with -m "[2:3]" as used to.

Anyway, I'll try your suggestion.

Thanks!

Jean

Comment 12 Jean-Tsung Hsiao 2016-10-11 01:20:15 UTC
Following your advice I am still seeing the same failure:

 modprobe uio
 modprobe uio-pci-generic
 dpdk_nic_bind -b uio_pci_generic 00:03.0
 . pktgen.sh

Copyright (c) <2010-2016>, Intel Corporation. All rights reserved. Powered by Intel® DPDK
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 0
EAL: Detected lcore 2 as core 0 on socket 0
EAL: Detected lcore 3 as core 0 on socket 0
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 4 lcore(s)
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Setting up physically contiguous memory...
EAL: Ask a virtual area of 0x1000000 bytes
EAL: Virtual area found at 0x7f4582e00000 (size = 0x1000000)
EAL: Ask a virtual area of 0x48800000 bytes
EAL: Virtual area found at 0x7f453a400000 (size = 0x48800000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7f453a000000 (size = 0x200000)
EAL: Ask a virtual area of 0x36000000 bytes
EAL: Virtual area found at 0x7f4503e00000 (size = 0x36000000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7f4503a00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7f4503600000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7f4503200000 (size = 0x200000)
EAL: Requesting 512 pages of size 2MB from socket 0
EAL: TSC frequency is ~3399998 KHz
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ixgbe.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_e1000.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_null.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_fm10k.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vhost.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_enic.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_pcap.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_i40e.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_cxgbe.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vmxnet3_uio.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_af_packet.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ring.so.2
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bond.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_virtio.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ena.so.1
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bnx2x.so.1
EAL: Master lcore 0 is ready (tid=8a9808c0;cpuset=[0])
EAL: lcore 1 is ready (tid=825fe700;cpuset=[1])
EAL: lcore 2 is ready (tid=81dfd700;cpuset=[2])
EAL: lcore 3 is ready (tid=815fc700;cpuset=[3])
EAL: PCI device 0000:00:03.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
EAL:   PCI memory mapped at 0x7f4583e00000
EAL: PCI Port IO found start=0xc040
[1/2:3].0        lcores: RX( 1 2 )TX( 3 ) ports: RX( 0 )TX( 0 )












































   Copyright (c) <2010-2016>, Intel Corporation. All rights reserved.
   Pktgen created by: Keith Wiles -- >>> Powered by Intel® DPDK <<<

Lua 5.3.2  Copyright (C) 1994-2015 Lua.org, PUC-Rio
>>> Packet Burst 32, RX Desc 512, TX Desc 512, mbufs/port 4096, mbuf cache 512

=== port to lcore mapping table (# lcores 4) ===
   lcore:     0     1     2     3 
port   0:  D: T  1: 0  1: 0  0: 1 =  2: 1
Total   :  0: 0  1: 0  1: 0  0: 1
    Display and Timer on lcore 0, rx:tx counts per port/lcore

Configuring 1 ports, MBUF Size 1920, MBUF Cache Size 512
Lcore:
    1, RX-Only
                RX( 1): ( 0: 0) 
    2, RX-Only
                RX( 1): ( 0: 1) 
    3, TX-Only
                TX( 1): ( 0: 0) 

Port :
    0, nb_lcores  3, private 0x704e10, lcores:  1  2  3 



** Dev Info (rte_virtio_pmd:0) **
   max_vfs        :   0 min_rx_bufsize    :  64 max_rx_pktlen :  9728 max_rx_queues         :   1 max_tx_queues:   1
   max_mac_addrs  :  64 max_hash_mac_addrs:   0 max_vmdq_pools:     0
   rx_offload_capa:   0 tx_offload_capa   :   0 reta_size     :     0 flow_type_rss_offloads:0000000000000000
   vmdq_queue_base:   0 vmdq_queue_num    :   0 vmdq_pool_base:     0
** RX Conf **
   pthreash       :   0 hthresh          :   0 wthresh        :     0
   Free Thresh    :   0 Drop Enable      :   0 Deferred Start :     0
** TX Conf **
   pthreash       :   0 hthresh          :   0 wthresh        :     0
   Free Thresh    :   0 RS Thresh        :   0 Deferred Start :     0 TXQ Flags:00000f00

!PANIC!: Cannot configure device: port=0, Num queues 2,1 (2)Invalid argument
PANIC in pktgen_config_ports():
Cannot configure device: port=0, Num queues 2,1 (2)Invalid argument6: [pktgen() [0x4129c5]]
5: [/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f4584513af5]]
4: [pktgen(main+0x4d9) [0x4123f9]]
3: [pktgen(pktgen_ipv6_ctor+0) [0x42f700]]
2: [/lib64/librte_eal.so.2(__rte_panic+0xd0) [0x7f4587286940]]
1: [/lib64/librte_eal.so.2(rte_dump_stack+0x2d) [0x7f458728ee4d]]
Aborted (core dumped)
[root@localhost ~]#

Comment 13 Jean-Tsung Hsiao 2016-10-11 03:23:56 UTC
(In reply to Jean-Tsung Hsiao from comment #12)
> Following your advice I am still seeing the same failure:
>

Hi Kevin,

I used multiple Q to bypass this pktgen failure issue --- using 4Q.

Then, I was able to reproduce the ovs-vswitchd crash issue using ovs-2.5.0-14.

Next, with the ovs-2.5.0-15 pkg you provided, pktgen was running successfuly without crashing the daemon.

So, I'll wait for the formal ovs-2.5.0-15 release to re-run the test.

Thanks!

Jean

Comment 14 Kevin Traynor 2016-10-11 10:16:55 UTC
(In reply to Jean-Tsung Hsiao from comment #13)
> (In reply to Jean-Tsung Hsiao from comment #12)
> > Following your advice I am still seeing the same failure:
> >
> 
> Hi Kevin,
> 
> I used multiple Q to bypass this pktgen failure issue --- using 4Q.
> 
> Then, I was able to reproduce the ovs-vswitchd crash issue using
> ovs-2.5.0-14.

I don't see an issue with ovs-2.5.0-14 with 4Q's (2Rx and 2Tx) in my setup. The issue I've reproduced and added a proposed fix in ovs-2.5.0-15 is for when the Q's are unsymmetrical (e.g. 2 Rx and 1 Tx). If you are seeing other new issues in ovs-2.5.0-14 please report separately so we won't get issue mixed together.

> 
> Next, with the ovs-2.5.0-15 pkg you provided, pktgen was running successfully
> without crashing the daemon.

Good! so can you confirm that when using ovs-2.5.0-15 the daemon did not crash with unsymmetrical Q's (e.g. [1/2:3].0 => 2 Rx and 1 Tx)?

> 
> So, I'll wait for the formal ovs-2.5.0-15 release to re-run the test.
> 
> Thanks!
> 
> Jean

Comment 17 Jean-Tsung Hsiao 2016-10-11 19:56:43 UTC
(In reply to Kevin Traynor from comment #14)
> (In reply to Jean-Tsung Hsiao from comment #13)
> > (In reply to Jean-Tsung Hsiao from comment #12)
> > > Following your advice I am still seeing the same failure:
> > >
> > 
> > Hi Kevin,
> > 
> > I used multiple Q to bypass this pktgen failure issue --- using 4Q.
> > 
> > Then, I was able to reproduce the ovs-vswitchd crash issue using
> > ovs-2.5.0-14.
> 
> I don't see an issue with ovs-2.5.0-14 with 4Q's (2Rx and 2Tx) in my setup.
> The issue I've reproduced and added a proposed fix in ovs-2.5.0-15 is for
> when the Q's are unsymmetrical (e.g. 2 Rx and 1 Tx). If you are seeing other
> new issues in ovs-2.5.0-14 please report separately so we won't get issue
> mixed together.
> 
> > 
> > Next, with the ovs-2.5.0-15 pkg you provided, pktgen was running successfully
> > without crashing the daemon.
> 
> Good! so can you confirm that when using ovs-2.5.0-15 the daemon did not
> crash with unsymmetrical Q's (e.g. [1/2:3].0 => 2 Rx and 1 Tx)?

That's what I did. I was using [1/2:3].0 for both sides with only one side sending the traffic --- without crashing the daemon on the receiving host.

> 
> > 
> > So, I'll wait for the formal ovs-2.5.0-15 release to re-run the test.
> > 
> > Thanks!
> > 
> > Jean

Comment 18 Kevin Traynor 2016-10-19 20:53:18 UTC
Created attachment 1212269 [details]
Patch to stop incorrect calculation of free descriptors

This is the patch of what Jean has tested and validated resolves this issue.

Comment 19 Kevin Traynor 2016-10-19 20:54:23 UTC
Created attachment 1212270 [details]
Patch to add checks for invalid memory conversion result

Comment 20 Flavio Leitner 2016-11-21 21:01:49 UTC
The first patch is enough to fix this bugzilla and that's what is going to happen  here for fdBeta.

I will clone another bug to apply the same patch to fdProd.

Finally, I will clone another bug to handle the second patch for fdBeta which requires more work.

Doing so will allow us to push the fix into the fdProd next batch.
Thanks,
fbl

Comment 23 Jean-Tsung Hsiao 2016-11-23 03:04:33 UTC
There were no OVS daemon crashes observed while running the reproducer with openvswitch-2.5.0-20.git20160727.el7fdb. So, the fix has been verified.