RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1506700 - Intel XL710 and OVS-DPDK bond have a fixed 0.01% frame loss
Summary: Intel XL710 and OVS-DPDK bond have a fixed 0.01% frame loss
Keywords:
Status: CLOSED DUPLICATE of bug 1559612
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch
Version: 7.5
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: pre-dev-freeze
: ---
Assignee: Eelco Chaudron
QA Contact: Hekai Wang
URL:
Whiteboard:
: 1522625 (view as bug list)
Depends On: 1551761
Blocks: 1339866
TreeView+ depends on / blocked
 
Reported: 2017-10-26 15:18 UTC by Federico Iezzi
Modified: 2021-09-09 12:45 UTC (History)
41 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1551761 1553786 (view as bug list)
Environment:
Last Closed: 2018-03-23 08:04:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pcap file (329.81 KB, application/octet-stream)
2017-12-22 08:57 UTC, Federico Iezzi
no flags Details
wireshark I/O analyzer (69.60 KB, image/png)
2017-12-22 08:58 UTC, Federico Iezzi
no flags Details
script for generating random unicast traffic (1.35 KB, application/x-shellscript)
2017-12-22 08:59 UTC, Federico Iezzi
no flags Details
Picture of recreation lab at RedHat (529.75 KB, image/png)
2017-12-22 10:56 UTC, Eelco Chaudron
no flags Details

Description Federico Iezzi 2017-10-26 15:18:01 UTC
Description of problem:
I have a number of environments with both OSP10z5 and an OSP11z2 as well latest updates and I'm using OVS 2.6.1-16 (due to a previous customer case for jumbo frame).
All of the compute nodes have both Intel X520 10Gbps and Intel XL710 40Gbps connected at both 10Gbps and 40Gbps.

When two XL710 NICs are bonded with any OVS-DPDK bonding technology (static A/B, LACP A/P, LACP A/A, SLB not tested though) I've observed a fixed 0.01% frame loss on the DPDK interfaces.
On the other hand, in the same environment, with the same configurations, using Intel X520 instead of the XL710, there is not even a single frame lost over a 15 hours test with the same bonding.

The frame drop happens at DPDK NIC RX level that would suggest not enough PMDs/isolation/tuning but that's not the case given the X520 result as well as the environment setup.
Even more weird, if using one XL710 port without any bonding, the frame loss is zero.

I run "perf record -F 99 -g -C 2,30 -- sleep 2h" on the PMD threads and there were not even a single interrupt or something is not pmd.

The traffic drop happens in a burst, the traffic can run for a few seconds/minutes and then a burst of lost packets will happen and then again the traffic will be stable for a few seconds/minutes before another burst of lost packets.

You will see in the following output, that X520 and XL710 are on different NUMA node.
During the tests, the VNF was local to the DPDK PHY NUMA node. 
We even moved the XL710 to the NUMA1 but the same issue has been experienced.

The guest is running a proprietary Packet Forwarding VNF using DPDK.
The host is RHEL 7.4. 

It's very important to underline:
 - No frame drop for over 15 hours on any bonding configuration with X520
 - No frame drop for with a single XL710
 - The frame loss happens with any bonding config as well as dedicated/non-dedicated PMD as well as with 1 or more QUEUES.

The traffic flow is 2x PHY-VM-PHY.

The traffic generated is 4Mpps, 2Mpps, 1Mpps each of them with 100Byte Frame Size (L2), 512, 1024, 1500 and 2048 as well.
The Ethernet payload is generic IP packets.

The bonded XL710 drops frame even at 100Kpps 100Byte...

Following the host configurations and details

##########################################
# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Stepping:              1
CPU MHz:               2400.185
BogoMIPS:              4800.37
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

##########################################
# cat /etc/tuned/cpu-partitioning-variables.conf | grep isolated
isolated_cores=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55

##########################################
# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.1.1.el7.x86_64 root=UUID=5e3423c9-0507-4f85-96db-022b4a18ef69 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=64 iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 rcu_nocbs=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 tuned.non_isolcpus=30000003 intel_pstate=disable nosoftlockup

##########################################
The issue has been expirenced w/ and w/o SMT (aka HT).
The VNF is RT and SMT is just enabled in this specific case, but usually it's not.
##########################################

##########################################
# lspci | grep Eth
01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
81:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
82:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
82:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)

##########################################
# tail /sys/bus/pci/devices/0000\:82\:00.*/numa_node 
==> /sys/bus/pci/devices/0000:82:00.0/numa_node <==
1

==> /sys/bus/pci/devices/0000:82:00.1/numa_node <==
1

##########################################
# tail /sys/bus/pci/devices/0000\:03\:00.*/numa_node 
==> /sys/bus/pci/devices/0000:03:00.0/numa_node <==
0

==> /sys/bus/pci/devices/0000:03:00.1/numa_node <==
0

##########################################
As I wrote above, the X520 and XL710 have been even exchanged but the same issue has been expirenced.
##########################################

##########################################
# driverctl list-overrides
0000:03:00.0 vfio-pci
0000:03:00.1 vfio-pci
0000:82:00.0 vfio-pci
0000:82:00.1 vfio-pci

##########################################
# lspci -vvv -s 0000:03:00.0
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
        Subsystem: Intel Corporation Ethernet Converged Network Adapter XL710-Q2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 36
        NUMA node: 0
        Region 0: Memory at 91000000 (64-bit, prefetchable) [size=16M]
        Region 3: Memory at 92008000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at 92100000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00001000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [e0] Vital Product Data
                Product Name: XL710 40GbE Controller
                Read-only fields:
                        [V0] Vendor specific: FFV18.0.16
                        [PN] Part number: KF46X
                        [MN] Manufacture ID: 31 30 32 38
                        [V1] Vendor specific: DSV1028VPDR.VER2.0
                        [V3] Vendor specific: DTINIC
                        [V4] Vendor specific: DCM1001FFFFFF2101FFFFFF1202FFFFFF2302FFFFFF1403FFFFFF2503FFFFFF1604FFFFFF2704FFFFFF1805FFFFFF2905FFFFFF1A06FFFFFF2B06FFFFFF1C07FFFFFF2D07FFFFFF1E08FFFFFF2F08FFFFFF
                        [V5] Vendor specific: NPY2
                        [V6] Vendor specific: PMTA
                        [V7] Vendor specific: NMVIntel Corp
                        [V8] Vendor specific: L1D0
                        [RV] Reserved: checksum good, 1 byte(s) reserved
                Read/write fields:
                        [Y1] System specific: CCF1
                End
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
        Capabilities: [140 v1] Device Serial Number 80-de-21-ff-ff-fe-fd-3c
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [1b0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [1d0 v1] #19
        Kernel driver in use: vfio-pci
        Kernel modules: i40e

##########################################
# lspci -vvv -s 0000:82:00.0
82:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01)
        Subsystem: Intel Corporation 10GbE 2P X520 Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 187
        NUMA node: 1
        Region 0: Memory at c8100000 (64-bit, non-prefetchable) [size=1M]
        Region 2: I/O ports at 8020 [size=32]
        Region 4: Memory at c8204000 (64-bit, non-prefetchable) [size=16K]
        Expansion ROM at c8280000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [e0] Vital Product Data
                Product Name: X520 10GbE Controller
                Read-only fields:
                        [PN] Part number: G73129
                        [MN] Manufacture ID: 31 30 32 38
                        [V0] Vendor specific: FFV18.0.16
                        [V1] Vendor specific: DSV1028VPDR.VER1.0
                        [V3] Vendor specific: DTINIC
                        [V4] Vendor specific: DCM10010081D521010081D5
                        [V5] Vendor specific: NPY2
                        [V6] Vendor specific: PMT12345678
                        [V7] Vendor specific: NMVIntel Corp
                        [RV] Reserved: checksum good, 3 byte(s) reserved
                End
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
        Capabilities: [140 v1] Device Serial Number a0-36-9f-ff-ff-d8-5e-78
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 128, stride: 2, Device ID: 10ed
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 000003c000400000 (64-bit, prefetchable)
                Region 3: Memory at 000003c000500000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: vfio-pci
        Kernel modules: ixgbe

##########################################
# ovs-vsctl get open_vswitch . other_config
{dpdk-extra="-n 4", dpdk-init="true", dpdk-lcore-mask="30000003", dpdk-socket-mem="4096,4096", pmd-cpu-mask="fc00000fc"}

##########################################
The issue has been expirenced w/ and w/o SMT (aka HT).
The VNF is RT and SMT is just enabled in this specific case, but usually it's not.
##########################################

##########################################
# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 2:
        isolated : true
        port: dpdk0     queue-id: 0
pmd thread numa_id 0 core_id 30:
        isolated : true
        port: dpdk0     queue-id: 1
pmd thread numa_id 0 core_id 34:
        isolated : false
        port: vhu4d717b90-39    queue-id: 0
        port: vhu066565d2-14    queue-id: 0
pmd thread numa_id 0 core_id 6:
        isolated : false
        port: vhu9ba1d7e7-2b    queue-id: 0
pmd thread numa_id 0 core_id 4:
        isolated : true
        port: dpdk1     queue-id: 0
pmd thread numa_id 0 core_id 32:
        isolated : true
        port: dpdk1     queue-id: 1

##########################################
Has you can see that's the best possible scenario:
One CPU Core (2PMD) isolated for each PHY. The PHY NIC has two queues (even only one queue has drop).
The VNF is spread over two PMD (one CPU core) but again the drop is on the DPDK PHY Nic.
##########################################

##########################################
# ovs-appctl dpif-netdev/pmd-stats-show|grep -A9 -E "core_id (2|4|30|32):"
pmd thread numa_id 0 core_id 2:
        emc hits:6971154401
        megaflow hits:18787
        avg. subtable lookups per hit:1.58
        miss:375
        lost:0
        polling cycles:2726409361980 (25.74%)
        processing cycles:7865260979784 (74.26%)
        avg cycles per packet: 1519.35 (10591670341764/6971203155)
        avg processing cycles per packet: 1128.25 (7865260979784/6971203155)
--
pmd thread numa_id 0 core_id 30:
        emc hits:6971365714
        megaflow hits:10
        avg. subtable lookups per hit:1.00
        miss:404
        lost:0
        polling cycles:2785958014653 (26.32%)
        processing cycles:7799614633899 (73.68%)
        avg cycles per packet: 1518.44 (10585572648552/6971366138)
        avg processing cycles per packet: 1118.81 (7799614633899/6971366138)
--
pmd thread numa_id 0 core_id 4:
        emc hits:52213
        megaflow hits:18823
        avg. subtable lookups per hit:1.00
        miss:650
        lost:0
        polling cycles:7167842875131 (99.98%)
        processing cycles:1346919081 (0.02%)
        avg cycles per packet: 79208814.43 (7169189794212/90510)
        avg processing cycles per packet: 14881.44 (1346919081/90510)
--
pmd thread numa_id 0 core_id 32:
        emc hits:29149
        megaflow hits:3
        avg. subtable lookups per hit:1.00
        miss:238
        lost:0
        polling cycles:7157381605371 (99.98%)
        processing cycles:1257975648 (0.02%)
        avg cycles per packet: 243549130.10 (7158639581019/29393)
        avg processing cycles per packet: 42798.48 (1257975648/29393)


##########################################
# bash check-drop.sh 
PHY DPDK0 Stats
rx_65_to_127_packets=14826589744
rx_dropped=104516692
tx_65_to_127_packets=14822039076
tx_dropped=0

PHY DPDK1 Stats
rx_65_to_127_packets=120275495
rx_dropped=2510269
tx_65_to_127_packets=116879771
tx_dropped=0

VHU vhu066565d2-14 Stats
rx_65_to_127_packets=4416367190
rx_dropped=0
tx_dropped=39

VHU vhu4d717b90-39 Stats
rx_65_to_127_packets=611
rx_dropped=0
tx_dropped=33

VHU vhu9ba1d7e7-2b Stats
rx_65_to_127_packets=4416393053
rx_dropped=0
tx_dropped=11

Version-Release number of selected component (if applicable):
RH-OSP10z5 and RH-OSP11z2 both using OVS 2.6.1-16 due to jumbo frame issue with XL710 (granted through support exception a few weeks ago).

How reproducible:
##########################################
# X520 A/B Bond config
#!/bin/bash

ovs-vsctl --may-exist add-bond br-provider bond3 dpdk2 dpdk3 -- set interface dpdk2 type=dpdk -- set interface dpdk3 type=dpdk -- set port bond3 bond_mode="active-backup"

ovs-vsctl set Interface br-provider mtu_request=9004

ovs-vsctl set Interface dpdk2 mtu_request=9004
ovs-vsctl set interface dpdk2 options:n_rxq="2"
ovs-vsctl set interface dpdk2 other_config:pmd-rxq-affinity="0:3,1:31"

ovs-vsctl set Interface dpdk3 mtu_request=9004
ovs-vsctl set interface dpdk3 options:n_rxq="2"
ovs-vsctl set interface dpdk3 other_config:pmd-rxq-affinity="0:5,1:33"

ovs-appctl dpif-netdev/pmd-rxq-show

# XL710 A/B Bond config
#!/bin/bash

ovs-vsctl --may-exist add-bond br-provider bond2 dpdk0 dpdk1 -- set interface dpdk0 type=dpdk -- set interface dpdk1 type=dpdk -- set port bond2 bond_mode="active-backup"

ovs-vsctl set Interface br-provider mtu_request=9004

ovs-vsctl set Interface dpdk0 mtu_request=9004
ovs-vsctl set interface dpdk0 options:n_rxq="2"
ovs-vsctl set interface dpdk0 other_config:pmd-rxq-affinity="0:2,1:30"

ovs-vsctl set Interface dpdk1 mtu_request=9004
ovs-vsctl set interface dpdk1 options:n_rxq="2"
ovs-vsctl set interface dpdk1 other_config:pmd-rxq-affinity="0:4,1:32"

ovs-appctl dpif-netdev/pmd-rxq-show
##########################################

Run some traffic and you should see frame drops.

Actual results:
Frame drops even with most simple active/backup bond with XL710.
Not even a single frame drop with all of the possible bond config using X520 after 15 hours load test.
Not even a single frame drop using a single XL710 port.

Expected results:
Even using the most simple active/backup bond, with XL710 I shouldn't have any traffic drop.

Additional info:

Comment 2 Franck Baudin 2017-10-26 16:38:16 UTC
XL710 requires large queues size to avoid drops, this is a well known issue, adding Kevin in the loop to see if we can provide a hotfix with 3072 bytes queues sizes in OVS-DPDK.

Comment 3 Federico Iezzi 2017-10-26 20:54:24 UTC
Hi Franck

are you talking about the following?

- https://communities.intel.com/community/tech/wired/blog/2017/01/09/intel-ethernet-x520-to-xl710-tuning-the-buffers-a-practical-guide-to-reduce-or-avoid-packet-loss-in-dpdk-applications
 - https://github.com/openvswitch/ovs/commit/b685696b8c813c9b7869eace797974b8ca69db10#diff-8d3414bfe59e75fbc75067f772ee89da

If that's the case and I correctly understand it, I would expect frame drops at some Mpps not just at 100Kpps, that even happens with LACP, so at the end of the day, each NIC is processing 50Kpps

Comment 4 Kevin Traynor 2017-10-27 15:27:44 UTC
(In reply to Franck Baudin from comment #2)
> XL710 requires large queues size to avoid drops, this is a well known issue,
> adding Kevin in the loop to see if we can provide a hotfix with 3072 bytes
> queues sizes in OVS-DPDK.

If you want to test out this theory, you can just change the default define

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 8adc723..5d2b1f0 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -133,6 +133,6 @@ BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF))
 #define SOCKET0              0
 
-#define NIC_PORT_RX_Q_SIZE 2048  /* Size of Physical NIC RX Queue, Max (n+32<=4096)*/
-#define NIC_PORT_TX_Q_SIZE 2048  /* Size of Physical NIC TX Queue, Max (n+32<=4096)*/
+#define NIC_PORT_RX_Q_SIZE 3072  /* Size of Physical NIC RX Queue, Max (n+32<=4096)*/
+#define NIC_PORT_TX_Q_SIZE 3072  /* Size of Physical NIC TX Queue, Max (n+32<=4096)*/

Comment 6 Federico Iezzi 2017-11-03 17:16:43 UTC
Tested using OVS 2.7.2-4 (GA for OSP12) and the same issue is present.

Next week a test with Intel XL710 using original Intel firmware is planned.

Comment 7 Federico Iezzi 2017-11-07 16:06:11 UTC
Tested with original XL710 from Intel latest NVM (6.0.1) and it didn't fix the problem.

# lspci -s 03:00.0 -vv
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
        Subsystem: Intel Corporation Ethernet Converged Network Adapter XL710-Q2
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 64
        NUMA node: 0
        Region 0: Memory at 92800000 (64-bit, prefetchable) [disabled] [size=8M]
        Region 3: Memory at 93008000 (64-bit, prefetchable) [disabled] [size=32K]
        Expansion ROM at 93900000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable- Count=129 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00001000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [e0] Vital Product Data
                Product Name: XL710 40GbE Controller
                Read-only fields:
                        [PN] Part number:
                        [EC] Engineering changes:
                        [FG] Unknown:
                        [LC] Unknown:
                        [MN] Manufacture ID:
                        [PG] Unknown:
                        [SN] Serial number:
                        [V0] Vendor specific:
                        [RV] Reserved: checksum good, 0 byte(s) reserved
                Read/write fields:
                        [V1] Vendor specific:
                End
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
        Capabilities: [140 v1] Device Serial Number d0-53-9f-ff-ff-fe-fd-3c
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 16, stride: 1, Device ID: 154c
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000000093a00000 (64-bit, prefetchable)
                Region 3: Memory at 0000000094200000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [1b0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [1d0 v1] #19
        Kernel driver in use: vfio-pci
        Kernel modules: i40e

Comment 8 Franck Baudin 2017-11-10 18:28:32 UTC
The PMD threads are sharing a lock, and are using system calls. I didn’t dig into the code, however, this is likely to end-up with packet loss. I’m using two XL710 on RHOSP10.z5, deployed by director, and I have re-pinned the emulator thread. I’m running the VM cross NUMA, and in a similar configuration (9K MTU, bonded DPDK interfaces). I was not expecting to see sys-calls neither locks on PMD threads, is it related to the bonding? Tothe XL710? 


 8266 ?        S<Lsl 105:27 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

cd /proc/8266
grep -i Cpus_allowed_list task/*/status

task/8739/status:Cpus_allowed_list:	1
task/8740/status:Cpus_allowed_list:	45
task/8741/status:Cpus_allowed_list:	23
task/8742/status:Cpus_allowed_list:	67
[root@overcloud-compute-0 8266]# strace -p 8740
strace: Process 8740 attached
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
write(55, "\0", 1)                      = 1
^Cstrace: Process 8740 detached
[root@overcloud-compute-0 8266]# strace -p 8742
strace: Process 8742 attached
write(55, "\0", 1)                      = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 1
write(55, "\0", 1)                      = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 1
write(55, "\0", 1)                      = 1
^Cstrace: Process 8742 detached
[root@overcloud-compute-0 8266]# strace -p 8739
strace: Process 8739 attached
write(111, "\1\0^\0\0\r\\E'\377S\241\201\0\0\2\10\0E\300\0006h\307\0\0\1g\266\247\n'"..., 72) = 72
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\1\0^\0\0\r\\E'\377S\241\10\0E\300\0006h\307\0\0\1g\266\247\n'\256\376\340\0"..., 68) = 68
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(111, "33\0\0\0\r\\E'\377S\241\201\0\0\2\206\335l\0\0\0\0008g\1\376\200\0R\0\0"..., 114) = 114
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "33\0\0\0\r\\E'\377S\241\206\335l\0\0\0\0008g\1\376\200\0R\0\0'\256\0\0"..., 110) = 110
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(55, "\0", 1)                      = 1
write(55, "\0", 1)                      = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8)       = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60

perf top -C 67 -z
Samples: 8K of event 'cycles', Event count (approx.): 5346503026 [z]
Overhead  Shared Object       Symbol
  30.57%  ovs-vswitchd        [.] rte_vhost_enqueue_burst
  26.79%  ovs-vswitchd        [.] dp_netdev_process_rxq_port.isra.31
  12.32%  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
  10.41%  ovs-vswitchd        [.] i40e_recv_pkts_vec
   4.27%  ovs-vswitchd        [.] miniflow_extract
   3.57%  ovs-vswitchd        [.] __netdev_dpdk_vhost_send
   3.26%  ovs-vswitchd        [.] pmd_thread_main
   2.74%  ovs-vswitchd        [.] dp_netdev_input__
   1.86%  ovs-vswitchd        [.] netdev_rxq_recv
   1.10%  ovs-vswitchd        [.] non_atomic_ullong_add
   0.79%  libc-2.17.so        [.] __memcmp_sse4_1
   0.60%  ovs-vswitchd        [.] is_vhost_running
   0.30%  [vdso]              [.] __vdso_clock_gettime
   0.25%  ovs-vswitchd        [.] virtio_enqueue_offload
   0.12%  ovs-vswitchd        [.] time_msec
   0.12%  ovs-vswitchd        [.] dp_execute_cb
   0.10%  ovs-vswitchd        [.] memcmp@plt
   0.10%  ovs-vswitchd        [.] __popcountdi2
   0.10%  ovs-vswitchd        [.] odp_execute_actions
   0.10%  ovs-vswitchd        [.] tx_port_lookup
   0.08%  ovs-vswitchd        [.] rte_mov128
   0.06%  ovs-vswitchd        [.] rte_mov32
   0.06%  ovs-vswitchd        [.] time_timespec__
   0.04%  ovs-vswitchd        [.] rte_mov64
   0.04%  libpthread-2.17.so  [.] pthread_once
   0.04%  libc-2.17.so        [.] __clock_gettime
   0.03%  libpthread-2.17.so  [.] pthread_getspecific
   0.03%  ovs-vswitchd        [.] ovsrcu_init_module
   0.03%  ovs-vswitchd        [.] ovsrcu_try_quiesce
   0.03%  ovs-vswitchd        [.] netdev_send
   0.01%  ovs-vswitchd        [.] xclock_gettime
   0.01%  ovs-vswitchd        [.] get_device
   0.01%  ovs-vswitchd        [.] ovs_mutex_trylock_at
   0.01%  ovs-vswitchd        [.] seq_read_protected
   0.01%  ovs-vswitchd        [.] nl_attr_type
   0.01%  ovs-vswitchd        [.] nl_attr_get_u32
   0.01%  libpthread-2.17.so  [.] pthread_mutex_trylock

[root@overcloud-compute-0 8266]# ovs-vsctl --column=other_config list open_vswitch
other_config        : {dpdk-init="true", dpdk-lcore-mask="40000100000400001", dpdk-socket-mem="2048,2048", pmd-cpu-mask="80000200000800002"}

In the VM
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...

  ---------------------- Forward statistics for port 0  ----------------------
  RX-packets: 6944578778     RX-dropped: 0             RX-total: 6944578778
  TX-packets: 6944578191     TX-dropped: 587           TX-total: 6944578778
  ----------------------------------------------------------------------------

  +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
  RX-packets: 6944578778     RX-dropped: 0             RX-total: 6944578778
  TX-packets: 6944578191     TX-dropped: 587           TX-total: 6944578778
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

So the OVS-DPDK PMD seems preempted...

Comment 9 Timothy Redaelli 2017-11-13 11:22:32 UTC
Is this problem only present on multi-NUMA machines?

Comment 10 Federico Iezzi 2017-11-13 12:06:38 UTC
UMA is deprecated by Intel for any datacenter use-cases since the Nehalem microarchitectures (2007) and even a few years before with AMD.

In datacenter and telco environments, all of the servers are at least SMP 2P.

I have no hardware for testing and eventually reproduce this problem.

Comment 32 Eelco Chaudron 2017-12-20 14:21:39 UTC
I've have not been able to replicate the problem in-house, however, our architect was able to reduce the setup's complexity at the customer site.

All they have is a simple rule that matches traffic ingressing on the first
port of the bond (the other port is on standby) and sending it back out.
Traffic being send in is at 2Mpps, and has the following pattern:

  - 78bytes packet, VLAN 304, SMAC:=00:00:dd:c8:12:0a, DMAC: 24:6e:96:5d:4a:54,
    IP(1.1.1.1/2.2.2.2)/UDP(69/69)

Here are some of the configuration/flow dumps:
  NOTE: br-int is not used for replication, and no VMs are running!!

# ovs-vsctl show
0acb4b20-7d91-4e2c-b290-54a1fe56775a
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port int-br-provider
            Interface int-br-provider
                type: patch
                options: {peer=phy-br-provider}
        Port br-int
            Interface br-int
                type: internal
        Port "vhu07828ecb-01"
            tag: 1
            Interface "vhu07828ecb-01"
                type: dpdkvhostuser
        Port "vhu2dc05752-6e"
            tag: 2
            Interface "vhu2dc05752-6e"
                type: dpdkvhostuser
        Port "vhu046f0de6-d2"
            tag: 3
            Interface "vhu046f0de6-d2"
                type: dpdkvhostuser
        Port "vhu76d6839c-a0"
            tag: 1
            Interface "vhu76d6839c-a0"
                type: dpdkvhostuser
    Bridge br-provider
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-provider
            Interface br-provider
                type: internal
        Port phy-br-provider
            Interface phy-br-provider
                type: patch
                options: {peer=int-br-provider}
        Port "bond2"
            Interface "dpdk0"
                type: dpdk
                options: {dpdk-devargs="0000:19:00.0"}
            Interface "dpdk1"
                type: dpdk
                options: {dpdk-devargs="0000:19:00.1"}
    ovs_version: "2.8.0.nowdg"

# ovs-vsctl list Open_vSwitch
_uuid               : 0acb4b20-7d91-4e2c-b290-54a1fe56775a
bridges             : [8b47bb6b-384d-4397-a157-f946f8fc7ee2, ef19c71d-de76-459e-946f-5ec264bde2f6]
cur_cfg             : 372
datapath_types      : [netdev, system]
db_version          : "7.15.0"
external_ids        : {hostname="overcloud-compute-dpdk-1.localdomain", rundir="/var/run/openvswitch", system-id="7a73873f-edd2-42e5-a128-ff0a1f32b061"}
iface_types         : [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, geneve, gre, internal, lisp, patch, stt, system, tap, vxlan]
manager_options     : [61c9fe00-880e-47b1-96da-729a9005cceb]
next_cfg            : 372
other_config        : {dpdk-init="true", dpdk-lcore-mask="3", dpdk-socket-mem="4096,4096", pmd-cpu-mask="1554"}
ovs_version         : "2.8.0.nowdg"
ssl                 : []
statistics          : {}
system_type         : rhel
system_version      : "7.4"

    # ovs-ofctl dump-flows br-provider
    NXST_FLOW reply (xid=0x4):
     cookie=0x0, duration=1337.315s, table=0, n_packets=288476187, n_bytes=21347237838, idle_age=0, priority=5000,dl_src=00:00:05:68:76:ef,dl_dst=fa:16:3e:48:f6:16 actions=mod_dl_src:fa:16:3e:b3:8c:02,mod_dl_dst:00:00:dd:c8:12:0a,mod_vlan_vid:304,IN_PORT
     cookie=0xbf45ee45528cfba2, duration=2220.097s, table=0, n_packets=368, n_bytes=31588, idle_age=521, priority=4,in_port=3,dl_vlan=2 actions=mod_vlan_vid:305,NORMAL
     cookie=0xbf45ee45528cfba2, duration=2220.052s, table=0, n_packets=2462, n_bytes=227320, idle_age=1, priority=4,in_port=3,dl_vlan=1 actions=mod_vlan_vid:306,NORMAL
     cookie=0xbf45ee45528cfba2, duration=2220.006s, table=0, n_packets=371, n_bytes=31714, idle_age=523, priority=4,in_port=3,dl_vlan=3 actions=mod_vlan_vid:304,NORMAL
     cookie=0xbf45ee45528cfba2, duration=2221.159s, table=0, n_packets=0, n_bytes=0, idle_age=10245, priority=2,in_port=3 actions=drop
     cookie=0xbf45ee45528cfba2, duration=2221.205s, table=0, n_packets=699381, n_bytes=49820119, idle_age=0, priority=0 actions=NORMAL

    # ovs-appctl fdb/show br-provider
     port  VLAN  MAC                Age
        1   304  00:04:96:8f:ca:d6    2
        1   305  00:04:96:98:1a:be    1
        1   306  00:04:96:8f:ca:d6    1
        1   304  00:04:96:98:1a:be    1
        1   213  00:00:5e:00:01:01    1
        1   373  00:00:5e:00:01:01    1
        1   304  00:00:5e:00:02:64    1
        1   213  00:e0:2b:00:00:01    1
        1   304  00:00:dd:c8:12:0a    0
        1   304  00:00:5e:00:01:64    0
        1   370  00:00:5e:00:01:01    0
        1   306  00:04:96:98:1a:be    0
        1   307  00:04:96:98:1a:be    0
        1   371  00:00:5e:00:01:01    0
        1   369  00:00:5e:00:01:01    0
        1   307  00:04:96:8f:ca:d6    0
        1   305  00:00:5e:00:01:64    0
        1   305  00:00:5e:00:02:64    0
        1   372  00:00:5e:00:01:01    0
        1   306  00:00:5e:00:01:64    0
        1   306  00:00:5e:00:02:64    0
        1   305  00:04:96:8f:ca:d6    0
        1   307  00:00:5e:00:01:64    0
        1   371  40:a6:77:4b:f7:c5    0
        1   370  40:a6:77:4b:07:c5    0

The only difference compared to my replication efforts is that there is more
traffic than only from the traffic generator. I've tested enabling LLDP and
RSTP on my Juniper, but still no luck in replication. I've asked Federico to
shutoff all extra traffic and try it again.

The customer was using the 5.x version of the firmware when doing these tests.

Going over the code the only real difference between active-backup bond, and
just one none bond port are the frequent link checks. So I modified the code
the skip this to see if it solved the drops, and it did (not this shows diff
on 2.6, but the customer did the actual test with 2.8.0):

    diff -r -U5 openvswitch-2.6.1_org/ofproto/bond.c openvswitch-2.6.1/ofproto/bond.c
    --- openvswitch-2.6.1_org/ofproto/bond.c	2016-09-28 02:26:58.044647850 -0400
    +++ openvswitch-2.6.1/ofproto/bond.c	2017-12-20 02:18:53.902070806 -0500
    @@ -1667,11 +1667,12 @@
     bond_link_status_update(struct bond_slave *slave)
     {
         struct bond *bond = slave->bond;
         bool up;

    -    up = netdev_get_carrier(slave->netdev) && slave->may_enable;
    +    // up = netdev_get_carrier(slave->netdev) && slave->may_enable;
    +    up = true && slave->may_enable;
         if ((up == slave->enabled) != (slave->delay_expires == LLONG_MAX)) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
             VLOG_INFO_RL(&rl, "interface %s: link state %s",
                          slave->name, up ? "up" : "down");
             if (up == slave->enabled) {


Did one additional change, as I noticed two threads are executing the link
checks in parallel (watchdog and bond). So I tried the below change only, but
it did not solve the issue:

diff -p -r -U5 openvswitch-2.6.1_org/lib/netdev-dpdk.c openvswitch-2.6.1/lib/netdev-dpdk.c
--- openvswitch-2.6.1_org/lib/netdev-dpdk.c	2016-09-28 02:26:57.968641194 -0400
+++ openvswitch-2.6.1/lib/netdev-dpdk.c	2017-12-20 04:37:41.599983888 -0500
@@ -600,13 +600,13 @@ dpdk_watchdog(void *dummy OVS_UNUSED)
     for (;;) {
         ovs_mutex_lock(&dpdk_mutex);
         LIST_FOR_EACH (dev, list_node, &dpdk_list) {
             ovs_mutex_lock(&dev->mutex);
-            if (dev->type == DPDK_DEV_ETH) {
-                check_link_status(dev);
-            }
+            /* if (dev->type == DPDK_DEV_ETH) { */
+            /*     check_link_status(dev); */
+            /* } */
             ovs_mutex_unlock(&dev->mutex);
         }
         ovs_mutex_unlock(&dpdk_mutex);
         xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
     }

Comment 34 Eelco Chaudron 2017-12-21 08:53:24 UTC
*** Bug 1522625 has been marked as a duplicate of this bug. ***

Comment 35 Jason Dian 2017-12-21 09:02:27 UTC
from https://bugzilla.redhat.com/show_bug.cgi?id=1522625#c35   the driver didn't pick them up. So packet drop.

Comment 37 Federico Iezzi 2017-12-22 08:56:59 UTC
Hi all,

In the past days with the customer, I worked understanding why in-house we struggle reproducing the issue.
All of their environments are production-comparable.

One major difference is noise traffic coming from other devices (L2 switches, multiple VRRPs, LLDP, EDP, etc).

We've analyzed the traffic multiple times without finding anything suspicious.
The most "weird" thing is a number of broadcasts, about a hundred per minute.

It turns out that generating fake random unicast traffic through ping cmd, using 5 different sources, we just reproduced the issue.

Even more severe, it's not the frequency: once spot the pattern, we thought that hundreds of broadcast could have an effect, generating a few thousand using the same source didn't trigger the issue but increasing the sources with very few requests per second made the OVS-DPDK bond dropping.

The unicast traffic has been generated from a 3rd node on the same network.

I'm going to attach the relevant PCAP that match the initial drops (below output). The pcap has a lot of junk in it but the compute node that receives the IXIA traffic (and the apr) sees the right stuff (no VRRP, no ICMPv6, etc etc etc).  

Thu Dec 21 17:11:03 CET 2017 dropped:1984334
Thu Dec 21 17:12:14 CET 2017 dropped:1989426
Thu Dec 21 17:13:29 CET 2017 dropped:2002305
Thu Dec 21 17:14:40 CET 2017 dropped:2012338
Thu Dec 21 17:14:42 CET 2017 dropped:2018647
Thu Dec 21 17:15:53 CET 2017 dropped:2036265
Thu Dec 21 17:15:55 CET 2017 dropped:2056995
Thu Dec 21 17:17:06 CET 2017 dropped:2059542
Thu Dec 21 17:19:32 CET 2017 dropped:2084900
Thu Dec 21 17:20:44 CET 2017 dropped:2096477
Thu Dec 21 17:20:45 CET 2017 dropped:2112150
Thu Dec 21 17:23:12 CET 2017 dropped:2113076

I'm going to attach the script too.

Comment 38 Federico Iezzi 2017-12-22 08:57:46 UTC
Created attachment 1371211 [details]
pcap file

Comment 39 Federico Iezzi 2017-12-22 08:58:21 UTC
Created attachment 1371212 [details]
wireshark I/O analyzer

Comment 40 Federico Iezzi 2017-12-22 08:59:03 UTC
Created attachment 1371213 [details]
script for generating random unicast traffic

Comment 41 Eelco Chaudron 2017-12-22 10:56:24 UTC
Created attachment 1371248 [details]
Picture of recreation lab at RedHat

Comment 42 Eelco Chaudron 2017-12-22 10:59:44 UTC
Finally, we have been able to also replicate this issue in-house at RedHat.
I've attached a diagram rhlab.png that represents the setup we are using.

Switch configuration:
=====================
  The switch has three VLANs configured. VLANs 304 and 305, have the respective
  VLAN tags configured and are assigned to port dpdk0 and the two test devices.

  The second XL710 port, dpdk1, has its own isolated VLAN and the only purpose
  is to keep the link up.


Xena (traffic generator):
=========================
  The Xena generates a steady 2Mpps, 78 bytes packet stream with the following
  characteristics:

    - SMAC:=00:00:dd:c8:12:0a, DMAC: 24:6e:96:5d:4a:54, VLAN 304, \
      IP(1.1.1.1/2.2.2.2)/UDP(69/69)

ARP Generator:
==============
  This is a system with a basic OVS configuration. Bridge br-provider has a
  single physical port connected to the Juniper switch. It also has a single
  ovs rule to forward all traffic out of this port(8):
    ovs-ofctl add-flow ovs_pvp_br0 "priority=0 actions=output:8").

  Then you run the make_drops.sh, which should result in ARP requests reaching
  dpdk0.


DUT configuration:
==================

  Configuration with the bond interface. Note that the first OpenFlow rule will
  reflect all the traffic back on dpdk0:

    ovs-vsctl del-br br-provider
    ovs-vsctl --may-exist add-br br-provider -- set bridge br-provider datapath_type=netdev
    ovs-vsctl --may-exist add-bond br-provider bond2 dpdk0 dpdk1 -- set interface dpdk0 type=dpdk -- set interface dpdk1 type=dpdk -- set port bond2 bond_mode="active-backup"
    ovs-vsctl set Interface dpdk0 options:dpdk-devargs=0000:05:00.0
    ovs-vsctl set Interface dpdk1 options:dpdk-devargs=0000:05:00.1

    ovs-ofctl del-flows br-provider
    ovs-ofctl add-flow br-provider "priority=5000,dl_src=00:00:dd:c8:12:0a,dl_dst=24:6e:96:5d:4a:54 actions=mod_dl_src:fa:16:3e:c4:29:42,mod_dl_dst:00:00:dd:c8:12:0a,mod_vlan_vid:305,output:IN_PORT"
    ovs-ofctl add-flow br-provider "priority=0 actions=NORMAL"


Configuration without the bond interface:

    ovs-vsctl del-br br-provider
    ovs-vsctl --may-exist add-br br-provider -- set bridge br-provider datapath_type=netdev
    ovs-vsctl --may-exist add-port br-provider dpdk0 -- set interface dpdk0 type=dpdk
    ovs-vsctl set Interface dpdk0 options:dpdk-devargs=0000:05:00.0

    ovs-ofctl del-flows br-provider
    ovs-ofctl add-flow br-provider "priority=5000,dl_src=00:00:dd:c8:12:0a,dl_dst=24:6e:96:5d:4a:54 actions=mod_dl_src:fa:16:3e:c4:29:42,mod_dl_dst:00:00:dd:c8:12:0a,mod_vlan_vid:305,output:IN_PORT"
    ovs-ofctl add-flow br-provider "priority=0 actions=NORMAL"


Replication:
============
Configure the DUT with the bond configuration, and start the traffic generator.
After a couple of seconds stop/clear the counters and start again (to avoid any
learning loss). Now, in addition, start the make_drops.sh script, and watch for
packet drops.

Sometimes the drops occur immediately, sometimes it takes a couple of minutes.
I've always seen it fail within 15 minutes.

If you repeat the test without the bond, you will not see the failures.
Longest I've waited was an hour.

You will also not see the failures with the bond interface when using the
patched version of OVS where it does not check for the bond link status.

Comment 43 Jason Dian 2017-12-26 02:13:21 UTC
are there any updates ? Huawei STC requests us to solve this issue in this month, otherwise the commercial release for Phase I will delay. This will be the first priority for now, so we need your fully support.

Comment 45 Andrew Theurer 2017-12-27 17:41:14 UTC
(In reply to Eelco Chaudron from comment #42)
> Finally, we have been able to also replicate this issue in-house at RedHat.
> I've attached a diagram rhlab.png that represents the setup we are using.
> 

Eelco, do you think it's possible to use testpmd with the same configuration as OVS?  If so, I am curious if it shows the same outcome.

Comment 46 Eelco Chaudron 2018-01-02 15:37:21 UTC
I continued debugging where I left off, thinking I noticed a slight increase in the per-packet cycles (pmd-stats-show). Did some profiling of the i40e tx/rx functions but nothing odd there.

After looking at stats I noticed that the packets being dropped end up increasing the rte_stats.imissed from the X710. Trying to increase the rxq's to the max did not help:

      ovs-vsctl set Interface dpdk0 options:n_rxq_desc=4096

I did find the following thread:
  http://dpdk.org/ml/archives/dev/2017-January/054105.html

However I do not have any configurable options like this on my Dell PowerEdge, maybe the customer can see if he has anything similar on his server and try.

I've been testing with OVS 2.8.0, but to be sure I tried OVS 2.9.0 with DPDK 17.11, but the same problem occurs.

Comment 47 Eelco Chaudron 2018-01-02 15:45:13 UTC
> Eelco, do you think it's possible to use testpmd with the same configuration
> as OVS?  If so, I am curious if it shows the same outcome.

You could probably loopback in testpmd, and continuously do the "show port info" command which will check the link status.

Comment 57 Mariusz Stachura 2018-01-12 15:15:56 UTC
Hello,

I'm trying to understand the issue, so my questions might not be very accurate.

Is it possible to check this with out DPDK (with out DPKD version of i40e).

Are there any logs from the i40e? Like PMD_DRV_LOG?

Comment 58 Eelco Chaudron 2018-01-12 15:50:15 UTC
(In reply to Mariusz Stachura from comment #57)
> Hello,
> 
> I'm trying to understand the issue, so my questions might not be very
> accurate.
> 
> Is it possible to check this with out DPDK (with out DPKD version of i40e).
> 
> Are there any logs from the i40e? Like PMD_DRV_LOG?

Hi Marius,

This is only an issue with OVS-DPDK, and is related to link pooling. You guys (Intel) is already aware of this, and are working on it. You might want to reach out to Brennan, Michael if you need more details.

Comment 63 carolyn.wyborny 2018-01-20 00:25:33 UTC
Is this a different issue than the one being worked with Mike Brennan of the DPDK team?  I don't see that interaction in this BZ.  Should this be linked to that?  If we need additional base driver work on this, I will have someone from my team work on it.

Comment 64 Eelco Chaudron 2018-01-22 07:26:49 UTC
Currently, Mike Brennan and Roy Fan Zhang from Intel are working on finding the root cause.

Comment 65 Jason Dian 2018-01-25 03:28:57 UTC
Can we release a workaround, if we can not fixed by intel in this week. huawei need this workaround.

Comment 66 Eelco Chaudron 2018-01-25 10:15:36 UTC
(In reply to Jason Dian from comment #65)
> Can we release a workaround, if we can not fixed by Intel in this week.
> huawei need this workaround.

The workaround is still being discussed upstream and until it has been approved it can not be applied to our OVS package. Secondly, this workaround will only apply to OVS 2.8 and up due to DPDK version specific dependencies. This version is only available on fastdatapath beta.

In the meantime, Intel has identified the root cause:

In a bonded link, running in “active-backup” mode, the ovs-vswitchd daemon periodically checks the link status of the ports in a bond; as part of this check, it needs to take a write lock on a hashmap that stores the details of all bonds. DPDK PMDs take a read lock on the same hash map when handling an upcall for unrecognized traffic streams, as part of the NORMAL action. The more unrecognized traffic streams that ingress a bonded port, the more frequently PMDs need to acquire the read lock on the bond hashmap. Since the ovs-vswitchd invokes the i40e poll mode callback function (which can take 20-40ms) while it holds the write lock, PMDs are unable to acquire the readlock that is required to process upcalls for ingress packets during that time; this effectively stalls the PMD, resulting in packet drops.


Having said this, fixing this might not be as straightforward for various reasons. I'll keep updating this BZ as progress is made.

Comment 67 Federico Iezzi 2018-01-25 14:35:20 UTC
Hi Eelco,

Very nice to see Intel found the root cause!
I’ve thought about your comment, wouldn’t be better having a write lock if a link status change is detected?

Thanks,
Federico

Comment 68 Eelco Chaudron 2018-01-27 10:23:02 UTC
(In reply to fiezzi from comment #67)

> Very nice to see Intel found the root cause!
> I’ve thought about your comment, wouldn’t be better having a write lock if a link status change is detected?

Maybe I do not get your question, as we do take a write lock for links state detection by hardware bond_run(). The PMD threads call bond_check_admissibility().

Comment 80 Eelco Chaudron 2018-03-23 08:04:49 UTC
Closing this BZ, as it will be fixed in 7.4 fast datapath trough BZ 1559612.

*** This bug has been marked as a duplicate of bug 1559612 ***


Note You need to log in before you can comment on or make changes to this bug.