Bug 1506700
| Summary: | Intel XL710 and OVS-DPDK bond have a fixed 0.01% frame loss | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Federico Iezzi <fiezzi> | ||||||||||
| Component: | openvswitch | Assignee: | Eelco Chaudron <echaudro> | ||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Hekai Wang <hewang> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 7.5 | CC: | aglotov, akaris, akarlsso, apevec, atelang, atheurer, atragler, carolyn.wyborny, chrisw, ctrautma, dbayly, djuran, echaudro, fbaudin, fherrman, fiezzi, jean-mickael.guerin, jraju, jshortt, knakai, ktraynor, kzhang, linville, mariusz.stachura, marjones, nhorman, ovs-qe, pablo.iranzo, patryk.malek, pvauter, rdian, rhos-maint, rkhan, sambhu.kalaga, sassmann, srevivo, supadhya, tjackson, tredaelli, vchundur, weiyongjun | ||||||||||
| Target Milestone: | pre-dev-freeze | Keywords: | Triaged, ZStream | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 1551761 1553786 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2018-03-23 08:04:49 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | 1551761 | ||||||||||||
| Bug Blocks: | 1339866 | ||||||||||||
| Attachments: |
|
||||||||||||
XL710 requires large queues size to avoid drops, this is a well known issue, adding Kevin in the loop to see if we can provide a hotfix with 3072 bytes queues sizes in OVS-DPDK. Hi Franck are you talking about the following? - https://communities.intel.com/community/tech/wired/blog/2017/01/09/intel-ethernet-x520-to-xl710-tuning-the-buffers-a-practical-guide-to-reduce-or-avoid-packet-loss-in-dpdk-applications - https://github.com/openvswitch/ovs/commit/b685696b8c813c9b7869eace797974b8ca69db10#diff-8d3414bfe59e75fbc75067f772ee89da If that's the case and I correctly understand it, I would expect frame drops at some Mpps not just at 100Kpps, that even happens with LACP, so at the end of the day, each NIC is processing 50Kpps (In reply to Franck Baudin from comment #2) > XL710 requires large queues size to avoid drops, this is a well known issue, > adding Kevin in the loop to see if we can provide a hotfix with 3072 bytes > queues sizes in OVS-DPDK. If you want to test out this theory, you can just change the default define diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 8adc723..5d2b1f0 100644 --- a/lib/netdev-dpdk.c +++ b/lib/netdev-dpdk.c @@ -133,6 +133,6 @@ BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF)) #define SOCKET0 0 -#define NIC_PORT_RX_Q_SIZE 2048 /* Size of Physical NIC RX Queue, Max (n+32<=4096)*/ -#define NIC_PORT_TX_Q_SIZE 2048 /* Size of Physical NIC TX Queue, Max (n+32<=4096)*/ +#define NIC_PORT_RX_Q_SIZE 3072 /* Size of Physical NIC RX Queue, Max (n+32<=4096)*/ +#define NIC_PORT_TX_Q_SIZE 3072 /* Size of Physical NIC TX Queue, Max (n+32<=4096)*/ Tested using OVS 2.7.2-4 (GA for OSP12) and the same issue is present. Next week a test with Intel XL710 using original Intel firmware is planned. Tested with original XL710 from Intel latest NVM (6.0.1) and it didn't fix the problem.
# lspci -s 03:00.0 -vv
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
Subsystem: Intel Corporation Ethernet Converged Network Adapter XL710-Q2
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 64
NUMA node: 0
Region 0: Memory at 92800000 (64-bit, prefetchable) [disabled] [size=8M]
Region 3: Memory at 93008000 (64-bit, prefetchable) [disabled] [size=32K]
Expansion ROM at 93900000 [disabled] [size=512K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable- Count=129 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00001000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [e0] Vital Product Data
Product Name: XL710 40GbE Controller
Read-only fields:
[PN] Part number:
[EC] Engineering changes:
[FG] Unknown:
[LC] Unknown:
[MN] Manufacture ID:
[PG] Unknown:
[SN] Serial number:
[V0] Vendor specific:
[RV] Reserved: checksum good, 0 byte(s) reserved
Read/write fields:
[V1] Vendor specific:
End
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
Capabilities: [140 v1] Device Serial Number d0-53-9f-ff-ff-fe-fd-3c
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
VF offset: 16, stride: 1, Device ID: 154c
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000000093a00000 (64-bit, prefetchable)
Region 3: Memory at 0000000094200000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1a0 v1] Transaction Processing Hints
Device specific mode supported
No steering table available
Capabilities: [1b0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [1d0 v1] #19
Kernel driver in use: vfio-pci
Kernel modules: i40e
The PMD threads are sharing a lock, and are using system calls. I didn’t dig into the code, however, this is likely to end-up with packet loss. I’m using two XL710 on RHOSP10.z5, deployed by director, and I have re-pinned the emulator thread. I’m running the VM cross NUMA, and in a similar configuration (9K MTU, bonded DPDK interfaces). I was not expecting to see sys-calls neither locks on PMD threads, is it related to the bonding? Tothe XL710?
8266 ? S<Lsl 105:27 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
cd /proc/8266
grep -i Cpus_allowed_list task/*/status
task/8739/status:Cpus_allowed_list: 1
task/8740/status:Cpus_allowed_list: 45
task/8741/status:Cpus_allowed_list: 23
task/8742/status:Cpus_allowed_list: 67
[root@overcloud-compute-0 8266]# strace -p 8740
strace: Process 8740 attached
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
write(55, "\0", 1) = 1
^Cstrace: Process 8740 detached
[root@overcloud-compute-0 8266]# strace -p 8742
strace: Process 8742 attached
write(55, "\0", 1) = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 1
write(55, "\0", 1) = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 1
write(55, "\0", 1) = 1
^Cstrace: Process 8742 detached
[root@overcloud-compute-0 8266]# strace -p 8739
strace: Process 8739 attached
write(111, "\1\0^\0\0\r\\E'\377S\241\201\0\0\2\10\0E\300\0006h\307\0\0\1g\266\247\n'"..., 72) = 72
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\1\0^\0\0\r\\E'\377S\241\10\0E\300\0006h\307\0\0\1g\266\247\n'\256\376\340\0"..., 68) = 68
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(111, "33\0\0\0\r\\E'\377S\241\201\0\0\2\206\335l\0\0\0\0008g\1\376\200\0R\0\0"..., 114) = 114
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "33\0\0\0\r\\E'\377S\241\206\335l\0\0\0\0008g\1\376\200\0R\0\0'\256\0\0"..., 110) = 110
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
write(55, "\0", 1) = 1
write(55, "\0", 1) = 1
futex(0x561cdfc8b3a0, FUTEX_WAKE_PRIVATE, 1) = 0
write(111, "\377\377\377\377\377\377\212\266\307/\30@\201\0\0\2\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@"..., 64) = 64
write(145, "\1\0\0\0\0\0\0\0", 8) = 8
write(106, "\377\377\377\377\377\377\212\266\307/\30@\10\6\0\1\10\0\6\4\0\1\212\266\307/\30@\300\0\2\n"..., 60) = 60
perf top -C 67 -z
Samples: 8K of event 'cycles', Event count (approx.): 5346503026 [z]
Overhead Shared Object Symbol
30.57% ovs-vswitchd [.] rte_vhost_enqueue_burst
26.79% ovs-vswitchd [.] dp_netdev_process_rxq_port.isra.31
12.32% ovs-vswitchd [.] netdev_dpdk_rxq_recv
10.41% ovs-vswitchd [.] i40e_recv_pkts_vec
4.27% ovs-vswitchd [.] miniflow_extract
3.57% ovs-vswitchd [.] __netdev_dpdk_vhost_send
3.26% ovs-vswitchd [.] pmd_thread_main
2.74% ovs-vswitchd [.] dp_netdev_input__
1.86% ovs-vswitchd [.] netdev_rxq_recv
1.10% ovs-vswitchd [.] non_atomic_ullong_add
0.79% libc-2.17.so [.] __memcmp_sse4_1
0.60% ovs-vswitchd [.] is_vhost_running
0.30% [vdso] [.] __vdso_clock_gettime
0.25% ovs-vswitchd [.] virtio_enqueue_offload
0.12% ovs-vswitchd [.] time_msec
0.12% ovs-vswitchd [.] dp_execute_cb
0.10% ovs-vswitchd [.] memcmp@plt
0.10% ovs-vswitchd [.] __popcountdi2
0.10% ovs-vswitchd [.] odp_execute_actions
0.10% ovs-vswitchd [.] tx_port_lookup
0.08% ovs-vswitchd [.] rte_mov128
0.06% ovs-vswitchd [.] rte_mov32
0.06% ovs-vswitchd [.] time_timespec__
0.04% ovs-vswitchd [.] rte_mov64
0.04% libpthread-2.17.so [.] pthread_once
0.04% libc-2.17.so [.] __clock_gettime
0.03% libpthread-2.17.so [.] pthread_getspecific
0.03% ovs-vswitchd [.] ovsrcu_init_module
0.03% ovs-vswitchd [.] ovsrcu_try_quiesce
0.03% ovs-vswitchd [.] netdev_send
0.01% ovs-vswitchd [.] xclock_gettime
0.01% ovs-vswitchd [.] get_device
0.01% ovs-vswitchd [.] ovs_mutex_trylock_at
0.01% ovs-vswitchd [.] seq_read_protected
0.01% ovs-vswitchd [.] nl_attr_type
0.01% ovs-vswitchd [.] nl_attr_get_u32
0.01% libpthread-2.17.so [.] pthread_mutex_trylock
[root@overcloud-compute-0 8266]# ovs-vsctl --column=other_config list open_vswitch
other_config : {dpdk-init="true", dpdk-lcore-mask="40000100000400001", dpdk-socket-mem="2048,2048", pmd-cpu-mask="80000200000800002"}
In the VM
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 6944578778 RX-dropped: 0 RX-total: 6944578778
TX-packets: 6944578191 TX-dropped: 587 TX-total: 6944578778
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 6944578778 RX-dropped: 0 RX-total: 6944578778
TX-packets: 6944578191 TX-dropped: 587 TX-total: 6944578778
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
So the OVS-DPDK PMD seems preempted...
Is this problem only present on multi-NUMA machines? UMA is deprecated by Intel for any datacenter use-cases since the Nehalem microarchitectures (2007) and even a few years before with AMD. In datacenter and telco environments, all of the servers are at least SMP 2P. I have no hardware for testing and eventually reproduce this problem. I've have not been able to replicate the problem in-house, however, our architect was able to reduce the setup's complexity at the customer site.
All they have is a simple rule that matches traffic ingressing on the first
port of the bond (the other port is on standby) and sending it back out.
Traffic being send in is at 2Mpps, and has the following pattern:
- 78bytes packet, VLAN 304, SMAC:=00:00:dd:c8:12:0a, DMAC: 24:6e:96:5d:4a:54,
IP(1.1.1.1/2.2.2.2)/UDP(69/69)
Here are some of the configuration/flow dumps:
NOTE: br-int is not used for replication, and no VMs are running!!
# ovs-vsctl show
0acb4b20-7d91-4e2c-b290-54a1fe56775a
Manager "ptcp:6640:127.0.0.1"
is_connected: true
Bridge br-int
Controller "tcp:127.0.0.1:6633"
is_connected: true
fail_mode: secure
Port int-br-provider
Interface int-br-provider
type: patch
options: {peer=phy-br-provider}
Port br-int
Interface br-int
type: internal
Port "vhu07828ecb-01"
tag: 1
Interface "vhu07828ecb-01"
type: dpdkvhostuser
Port "vhu2dc05752-6e"
tag: 2
Interface "vhu2dc05752-6e"
type: dpdkvhostuser
Port "vhu046f0de6-d2"
tag: 3
Interface "vhu046f0de6-d2"
type: dpdkvhostuser
Port "vhu76d6839c-a0"
tag: 1
Interface "vhu76d6839c-a0"
type: dpdkvhostuser
Bridge br-provider
Controller "tcp:127.0.0.1:6633"
is_connected: true
fail_mode: secure
Port br-provider
Interface br-provider
type: internal
Port phy-br-provider
Interface phy-br-provider
type: patch
options: {peer=int-br-provider}
Port "bond2"
Interface "dpdk0"
type: dpdk
options: {dpdk-devargs="0000:19:00.0"}
Interface "dpdk1"
type: dpdk
options: {dpdk-devargs="0000:19:00.1"}
ovs_version: "2.8.0.nowdg"
# ovs-vsctl list Open_vSwitch
_uuid : 0acb4b20-7d91-4e2c-b290-54a1fe56775a
bridges : [8b47bb6b-384d-4397-a157-f946f8fc7ee2, ef19c71d-de76-459e-946f-5ec264bde2f6]
cur_cfg : 372
datapath_types : [netdev, system]
db_version : "7.15.0"
external_ids : {hostname="overcloud-compute-dpdk-1.localdomain", rundir="/var/run/openvswitch", system-id="7a73873f-edd2-42e5-a128-ff0a1f32b061"}
iface_types : [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, geneve, gre, internal, lisp, patch, stt, system, tap, vxlan]
manager_options : [61c9fe00-880e-47b1-96da-729a9005cceb]
next_cfg : 372
other_config : {dpdk-init="true", dpdk-lcore-mask="3", dpdk-socket-mem="4096,4096", pmd-cpu-mask="1554"}
ovs_version : "2.8.0.nowdg"
ssl : []
statistics : {}
system_type : rhel
system_version : "7.4"
# ovs-ofctl dump-flows br-provider
NXST_FLOW reply (xid=0x4):
cookie=0x0, duration=1337.315s, table=0, n_packets=288476187, n_bytes=21347237838, idle_age=0, priority=5000,dl_src=00:00:05:68:76:ef,dl_dst=fa:16:3e:48:f6:16 actions=mod_dl_src:fa:16:3e:b3:8c:02,mod_dl_dst:00:00:dd:c8:12:0a,mod_vlan_vid:304,IN_PORT
cookie=0xbf45ee45528cfba2, duration=2220.097s, table=0, n_packets=368, n_bytes=31588, idle_age=521, priority=4,in_port=3,dl_vlan=2 actions=mod_vlan_vid:305,NORMAL
cookie=0xbf45ee45528cfba2, duration=2220.052s, table=0, n_packets=2462, n_bytes=227320, idle_age=1, priority=4,in_port=3,dl_vlan=1 actions=mod_vlan_vid:306,NORMAL
cookie=0xbf45ee45528cfba2, duration=2220.006s, table=0, n_packets=371, n_bytes=31714, idle_age=523, priority=4,in_port=3,dl_vlan=3 actions=mod_vlan_vid:304,NORMAL
cookie=0xbf45ee45528cfba2, duration=2221.159s, table=0, n_packets=0, n_bytes=0, idle_age=10245, priority=2,in_port=3 actions=drop
cookie=0xbf45ee45528cfba2, duration=2221.205s, table=0, n_packets=699381, n_bytes=49820119, idle_age=0, priority=0 actions=NORMAL
# ovs-appctl fdb/show br-provider
port VLAN MAC Age
1 304 00:04:96:8f:ca:d6 2
1 305 00:04:96:98:1a:be 1
1 306 00:04:96:8f:ca:d6 1
1 304 00:04:96:98:1a:be 1
1 213 00:00:5e:00:01:01 1
1 373 00:00:5e:00:01:01 1
1 304 00:00:5e:00:02:64 1
1 213 00:e0:2b:00:00:01 1
1 304 00:00:dd:c8:12:0a 0
1 304 00:00:5e:00:01:64 0
1 370 00:00:5e:00:01:01 0
1 306 00:04:96:98:1a:be 0
1 307 00:04:96:98:1a:be 0
1 371 00:00:5e:00:01:01 0
1 369 00:00:5e:00:01:01 0
1 307 00:04:96:8f:ca:d6 0
1 305 00:00:5e:00:01:64 0
1 305 00:00:5e:00:02:64 0
1 372 00:00:5e:00:01:01 0
1 306 00:00:5e:00:01:64 0
1 306 00:00:5e:00:02:64 0
1 305 00:04:96:8f:ca:d6 0
1 307 00:00:5e:00:01:64 0
1 371 40:a6:77:4b:f7:c5 0
1 370 40:a6:77:4b:07:c5 0
The only difference compared to my replication efforts is that there is more
traffic than only from the traffic generator. I've tested enabling LLDP and
RSTP on my Juniper, but still no luck in replication. I've asked Federico to
shutoff all extra traffic and try it again.
The customer was using the 5.x version of the firmware when doing these tests.
Going over the code the only real difference between active-backup bond, and
just one none bond port are the frequent link checks. So I modified the code
the skip this to see if it solved the drops, and it did (not this shows diff
on 2.6, but the customer did the actual test with 2.8.0):
diff -r -U5 openvswitch-2.6.1_org/ofproto/bond.c openvswitch-2.6.1/ofproto/bond.c
--- openvswitch-2.6.1_org/ofproto/bond.c 2016-09-28 02:26:58.044647850 -0400
+++ openvswitch-2.6.1/ofproto/bond.c 2017-12-20 02:18:53.902070806 -0500
@@ -1667,11 +1667,12 @@
bond_link_status_update(struct bond_slave *slave)
{
struct bond *bond = slave->bond;
bool up;
- up = netdev_get_carrier(slave->netdev) && slave->may_enable;
+ // up = netdev_get_carrier(slave->netdev) && slave->may_enable;
+ up = true && slave->may_enable;
if ((up == slave->enabled) != (slave->delay_expires == LLONG_MAX)) {
static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
VLOG_INFO_RL(&rl, "interface %s: link state %s",
slave->name, up ? "up" : "down");
if (up == slave->enabled) {
Did one additional change, as I noticed two threads are executing the link
checks in parallel (watchdog and bond). So I tried the below change only, but
it did not solve the issue:
diff -p -r -U5 openvswitch-2.6.1_org/lib/netdev-dpdk.c openvswitch-2.6.1/lib/netdev-dpdk.c
--- openvswitch-2.6.1_org/lib/netdev-dpdk.c 2016-09-28 02:26:57.968641194 -0400
+++ openvswitch-2.6.1/lib/netdev-dpdk.c 2017-12-20 04:37:41.599983888 -0500
@@ -600,13 +600,13 @@ dpdk_watchdog(void *dummy OVS_UNUSED)
for (;;) {
ovs_mutex_lock(&dpdk_mutex);
LIST_FOR_EACH (dev, list_node, &dpdk_list) {
ovs_mutex_lock(&dev->mutex);
- if (dev->type == DPDK_DEV_ETH) {
- check_link_status(dev);
- }
+ /* if (dev->type == DPDK_DEV_ETH) { */
+ /* check_link_status(dev); */
+ /* } */
ovs_mutex_unlock(&dev->mutex);
}
ovs_mutex_unlock(&dpdk_mutex);
xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
}
*** Bug 1522625 has been marked as a duplicate of this bug. *** from https://bugzilla.redhat.com/show_bug.cgi?id=1522625#c35 the driver didn't pick them up. So packet drop. Hi all, In the past days with the customer, I worked understanding why in-house we struggle reproducing the issue. All of their environments are production-comparable. One major difference is noise traffic coming from other devices (L2 switches, multiple VRRPs, LLDP, EDP, etc). We've analyzed the traffic multiple times without finding anything suspicious. The most "weird" thing is a number of broadcasts, about a hundred per minute. It turns out that generating fake random unicast traffic through ping cmd, using 5 different sources, we just reproduced the issue. Even more severe, it's not the frequency: once spot the pattern, we thought that hundreds of broadcast could have an effect, generating a few thousand using the same source didn't trigger the issue but increasing the sources with very few requests per second made the OVS-DPDK bond dropping. The unicast traffic has been generated from a 3rd node on the same network. I'm going to attach the relevant PCAP that match the initial drops (below output). The pcap has a lot of junk in it but the compute node that receives the IXIA traffic (and the apr) sees the right stuff (no VRRP, no ICMPv6, etc etc etc). Thu Dec 21 17:11:03 CET 2017 dropped:1984334 Thu Dec 21 17:12:14 CET 2017 dropped:1989426 Thu Dec 21 17:13:29 CET 2017 dropped:2002305 Thu Dec 21 17:14:40 CET 2017 dropped:2012338 Thu Dec 21 17:14:42 CET 2017 dropped:2018647 Thu Dec 21 17:15:53 CET 2017 dropped:2036265 Thu Dec 21 17:15:55 CET 2017 dropped:2056995 Thu Dec 21 17:17:06 CET 2017 dropped:2059542 Thu Dec 21 17:19:32 CET 2017 dropped:2084900 Thu Dec 21 17:20:44 CET 2017 dropped:2096477 Thu Dec 21 17:20:45 CET 2017 dropped:2112150 Thu Dec 21 17:23:12 CET 2017 dropped:2113076 I'm going to attach the script too. Created attachment 1371211 [details]
pcap file
Created attachment 1371212 [details]
wireshark I/O analyzer
Created attachment 1371213 [details]
script for generating random unicast traffic
Created attachment 1371248 [details]
Picture of recreation lab at RedHat
Finally, we have been able to also replicate this issue in-house at RedHat.
I've attached a diagram rhlab.png that represents the setup we are using.
Switch configuration:
=====================
The switch has three VLANs configured. VLANs 304 and 305, have the respective
VLAN tags configured and are assigned to port dpdk0 and the two test devices.
The second XL710 port, dpdk1, has its own isolated VLAN and the only purpose
is to keep the link up.
Xena (traffic generator):
=========================
The Xena generates a steady 2Mpps, 78 bytes packet stream with the following
characteristics:
- SMAC:=00:00:dd:c8:12:0a, DMAC: 24:6e:96:5d:4a:54, VLAN 304, \
IP(1.1.1.1/2.2.2.2)/UDP(69/69)
ARP Generator:
==============
This is a system with a basic OVS configuration. Bridge br-provider has a
single physical port connected to the Juniper switch. It also has a single
ovs rule to forward all traffic out of this port(8):
ovs-ofctl add-flow ovs_pvp_br0 "priority=0 actions=output:8").
Then you run the make_drops.sh, which should result in ARP requests reaching
dpdk0.
DUT configuration:
==================
Configuration with the bond interface. Note that the first OpenFlow rule will
reflect all the traffic back on dpdk0:
ovs-vsctl del-br br-provider
ovs-vsctl --may-exist add-br br-provider -- set bridge br-provider datapath_type=netdev
ovs-vsctl --may-exist add-bond br-provider bond2 dpdk0 dpdk1 -- set interface dpdk0 type=dpdk -- set interface dpdk1 type=dpdk -- set port bond2 bond_mode="active-backup"
ovs-vsctl set Interface dpdk0 options:dpdk-devargs=0000:05:00.0
ovs-vsctl set Interface dpdk1 options:dpdk-devargs=0000:05:00.1
ovs-ofctl del-flows br-provider
ovs-ofctl add-flow br-provider "priority=5000,dl_src=00:00:dd:c8:12:0a,dl_dst=24:6e:96:5d:4a:54 actions=mod_dl_src:fa:16:3e:c4:29:42,mod_dl_dst:00:00:dd:c8:12:0a,mod_vlan_vid:305,output:IN_PORT"
ovs-ofctl add-flow br-provider "priority=0 actions=NORMAL"
Configuration without the bond interface:
ovs-vsctl del-br br-provider
ovs-vsctl --may-exist add-br br-provider -- set bridge br-provider datapath_type=netdev
ovs-vsctl --may-exist add-port br-provider dpdk0 -- set interface dpdk0 type=dpdk
ovs-vsctl set Interface dpdk0 options:dpdk-devargs=0000:05:00.0
ovs-ofctl del-flows br-provider
ovs-ofctl add-flow br-provider "priority=5000,dl_src=00:00:dd:c8:12:0a,dl_dst=24:6e:96:5d:4a:54 actions=mod_dl_src:fa:16:3e:c4:29:42,mod_dl_dst:00:00:dd:c8:12:0a,mod_vlan_vid:305,output:IN_PORT"
ovs-ofctl add-flow br-provider "priority=0 actions=NORMAL"
Replication:
============
Configure the DUT with the bond configuration, and start the traffic generator.
After a couple of seconds stop/clear the counters and start again (to avoid any
learning loss). Now, in addition, start the make_drops.sh script, and watch for
packet drops.
Sometimes the drops occur immediately, sometimes it takes a couple of minutes.
I've always seen it fail within 15 minutes.
If you repeat the test without the bond, you will not see the failures.
Longest I've waited was an hour.
You will also not see the failures with the bond interface when using the
patched version of OVS where it does not check for the bond link status.
are there any updates ? Huawei STC requests us to solve this issue in this month, otherwise the commercial release for Phase I will delay. This will be the first priority for now, so we need your fully support. (In reply to Eelco Chaudron from comment #42) > Finally, we have been able to also replicate this issue in-house at RedHat. > I've attached a diagram rhlab.png that represents the setup we are using. > Eelco, do you think it's possible to use testpmd with the same configuration as OVS? If so, I am curious if it shows the same outcome. I continued debugging where I left off, thinking I noticed a slight increase in the per-packet cycles (pmd-stats-show). Did some profiling of the i40e tx/rx functions but nothing odd there.
After looking at stats I noticed that the packets being dropped end up increasing the rte_stats.imissed from the X710. Trying to increase the rxq's to the max did not help:
ovs-vsctl set Interface dpdk0 options:n_rxq_desc=4096
I did find the following thread:
http://dpdk.org/ml/archives/dev/2017-January/054105.html
However I do not have any configurable options like this on my Dell PowerEdge, maybe the customer can see if he has anything similar on his server and try.
I've been testing with OVS 2.8.0, but to be sure I tried OVS 2.9.0 with DPDK 17.11, but the same problem occurs.
> Eelco, do you think it's possible to use testpmd with the same configuration
> as OVS? If so, I am curious if it shows the same outcome.
You could probably loopback in testpmd, and continuously do the "show port info" command which will check the link status.
Hello, I'm trying to understand the issue, so my questions might not be very accurate. Is it possible to check this with out DPDK (with out DPKD version of i40e). Are there any logs from the i40e? Like PMD_DRV_LOG? (In reply to Mariusz Stachura from comment #57) > Hello, > > I'm trying to understand the issue, so my questions might not be very > accurate. > > Is it possible to check this with out DPDK (with out DPKD version of i40e). > > Are there any logs from the i40e? Like PMD_DRV_LOG? Hi Marius, This is only an issue with OVS-DPDK, and is related to link pooling. You guys (Intel) is already aware of this, and are working on it. You might want to reach out to Brennan, Michael if you need more details. Is this a different issue than the one being worked with Mike Brennan of the DPDK team? I don't see that interaction in this BZ. Should this be linked to that? If we need additional base driver work on this, I will have someone from my team work on it. Currently, Mike Brennan and Roy Fan Zhang from Intel are working on finding the root cause. Can we release a workaround, if we can not fixed by intel in this week. huawei need this workaround. (In reply to Jason Dian from comment #65) > Can we release a workaround, if we can not fixed by Intel in this week. > huawei need this workaround. The workaround is still being discussed upstream and until it has been approved it can not be applied to our OVS package. Secondly, this workaround will only apply to OVS 2.8 and up due to DPDK version specific dependencies. This version is only available on fastdatapath beta. In the meantime, Intel has identified the root cause: In a bonded link, running in “active-backup” mode, the ovs-vswitchd daemon periodically checks the link status of the ports in a bond; as part of this check, it needs to take a write lock on a hashmap that stores the details of all bonds. DPDK PMDs take a read lock on the same hash map when handling an upcall for unrecognized traffic streams, as part of the NORMAL action. The more unrecognized traffic streams that ingress a bonded port, the more frequently PMDs need to acquire the read lock on the bond hashmap. Since the ovs-vswitchd invokes the i40e poll mode callback function (which can take 20-40ms) while it holds the write lock, PMDs are unable to acquire the readlock that is required to process upcalls for ingress packets during that time; this effectively stalls the PMD, resulting in packet drops. Having said this, fixing this might not be as straightforward for various reasons. I'll keep updating this BZ as progress is made. Hi Eelco, Very nice to see Intel found the root cause! I’ve thought about your comment, wouldn’t be better having a write lock if a link status change is detected? Thanks, Federico (In reply to fiezzi from comment #67) > Very nice to see Intel found the root cause! > I’ve thought about your comment, wouldn’t be better having a write lock if a link status change is detected? Maybe I do not get your question, as we do take a write lock for links state detection by hardware bond_run(). The PMD threads call bond_check_admissibility(). Closing this BZ, as it will be fixed in 7.4 fast datapath trough BZ 1559612. *** This bug has been marked as a duplicate of bug 1559612 *** |
Description of problem: I have a number of environments with both OSP10z5 and an OSP11z2 as well latest updates and I'm using OVS 2.6.1-16 (due to a previous customer case for jumbo frame). All of the compute nodes have both Intel X520 10Gbps and Intel XL710 40Gbps connected at both 10Gbps and 40Gbps. When two XL710 NICs are bonded with any OVS-DPDK bonding technology (static A/B, LACP A/P, LACP A/A, SLB not tested though) I've observed a fixed 0.01% frame loss on the DPDK interfaces. On the other hand, in the same environment, with the same configurations, using Intel X520 instead of the XL710, there is not even a single frame lost over a 15 hours test with the same bonding. The frame drop happens at DPDK NIC RX level that would suggest not enough PMDs/isolation/tuning but that's not the case given the X520 result as well as the environment setup. Even more weird, if using one XL710 port without any bonding, the frame loss is zero. I run "perf record -F 99 -g -C 2,30 -- sleep 2h" on the PMD threads and there were not even a single interrupt or something is not pmd. The traffic drop happens in a burst, the traffic can run for a few seconds/minutes and then a burst of lost packets will happen and then again the traffic will be stable for a few seconds/minutes before another burst of lost packets. You will see in the following output, that X520 and XL710 are on different NUMA node. During the tests, the VNF was local to the DPDK PHY NUMA node. We even moved the XL710 to the NUMA1 but the same issue has been experienced. The guest is running a proprietary Packet Forwarding VNF using DPDK. The host is RHEL 7.4. It's very important to underline: - No frame drop for over 15 hours on any bonding configuration with X520 - No frame drop for with a single XL710 - The frame loss happens with any bonding config as well as dedicated/non-dedicated PMD as well as with 1 or more QUEUES. The traffic flow is 2x PHY-VM-PHY. The traffic generated is 4Mpps, 2Mpps, 1Mpps each of them with 100Byte Frame Size (L2), 512, 1024, 1500 and 2048 as well. The Ethernet payload is generic IP packets. The bonded XL710 drops frame even at 100Kpps 100Byte... Following the host configurations and details ########################################## # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Stepping: 1 CPU MHz: 2400.185 BogoMIPS: 4800.37 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 35840K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts ########################################## # cat /etc/tuned/cpu-partitioning-variables.conf | grep isolated isolated_cores=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 ########################################## # cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-693.1.1.el7.x86_64 root=UUID=5e3423c9-0507-4f85-96db-022b4a18ef69 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=64 iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 rcu_nocbs=2,4,6,8,10,12,14,16,18,20,22,24,26,3,5,7,9,11,13,15,17,19,21,23,25,27,30,32,34,36,38,40,42,44,46,48,50,52,54,31,33,35,37,39,41,43,45,47,49,51,53,55 tuned.non_isolcpus=30000003 intel_pstate=disable nosoftlockup ########################################## The issue has been expirenced w/ and w/o SMT (aka HT). The VNF is RT and SMT is just enabled in this specific case, but usually it's not. ########################################## ########################################## # lspci | grep Eth 01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 01:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 01:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 01:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) 03:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) 81:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01) 81:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01) 82:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01) 82:00.1 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01) ########################################## # tail /sys/bus/pci/devices/0000\:82\:00.*/numa_node ==> /sys/bus/pci/devices/0000:82:00.0/numa_node <== 1 ==> /sys/bus/pci/devices/0000:82:00.1/numa_node <== 1 ########################################## # tail /sys/bus/pci/devices/0000\:03\:00.*/numa_node ==> /sys/bus/pci/devices/0000:03:00.0/numa_node <== 0 ==> /sys/bus/pci/devices/0000:03:00.1/numa_node <== 0 ########################################## As I wrote above, the X520 and XL710 have been even exchanged but the same issue has been expirenced. ########################################## ########################################## # driverctl list-overrides 0000:03:00.0 vfio-pci 0000:03:00.1 vfio-pci 0000:82:00.0 vfio-pci 0000:82:00.1 vfio-pci ########################################## # lspci -vvv -s 0000:03:00.0 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) Subsystem: Intel Corporation Ethernet Converged Network Adapter XL710-Q2 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 36 NUMA node: 0 Region 0: Memory at 91000000 (64-bit, prefetchable) [size=16M] Region 3: Memory at 92008000 (64-bit, prefetchable) [size=32K] Expansion ROM at 92100000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable+ Count=129 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00001000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [e0] Vital Product Data Product Name: XL710 40GbE Controller Read-only fields: [V0] Vendor specific: FFV18.0.16 [PN] Part number: KF46X [MN] Manufacture ID: 31 30 32 38 [V1] Vendor specific: DSV1028VPDR.VER2.0 [V3] Vendor specific: DTINIC [V4] Vendor specific: DCM1001FFFFFF2101FFFFFF1202FFFFFF2302FFFFFF1403FFFFFF2503FFFFFF1604FFFFFF2704FFFFFF1805FFFFFF2905FFFFFF1A06FFFFFF2B06FFFFFF1C07FFFFFF2D07FFFFFF1E08FFFFFF2F08FFFFFF [V5] Vendor specific: NPY2 [V6] Vendor specific: PMTA [V7] Vendor specific: NMVIntel Corp [V8] Vendor specific: L1D0 [RV] Reserved: checksum good, 1 byte(s) reserved Read/write fields: [Y1] System specific: CCF1 End Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+ Capabilities: [140 v1] Device Serial Number 80-de-21-ff-ff-fe-fd-3c Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [1a0 v1] Transaction Processing Hints Device specific mode supported No steering table available Capabilities: [1b0 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [1d0 v1] #19 Kernel driver in use: vfio-pci Kernel modules: i40e ########################################## # lspci -vvv -s 0000:82:00.0 82:00.0 Ethernet controller: Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01) Subsystem: Intel Corporation 10GbE 2P X520 Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 187 NUMA node: 1 Region 0: Memory at c8100000 (64-bit, non-prefetchable) [size=1M] Region 2: I/O ports at 8020 [size=32] Region 4: Memory at c8204000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at c8280000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable+ Count=64 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00002000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 <8us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Vital Product Data Product Name: X520 10GbE Controller Read-only fields: [PN] Part number: G73129 [MN] Manufacture ID: 31 30 32 38 [V0] Vendor specific: FFV18.0.16 [V1] Vendor specific: DSV1028VPDR.VER1.0 [V3] Vendor specific: DTINIC [V4] Vendor specific: DCM10010081D521010081D5 [V5] Vendor specific: NPY2 [V6] Vendor specific: PMT12345678 [V7] Vendor specific: NMVIntel Corp [RV] Reserved: checksum good, 3 byte(s) reserved End Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+ Capabilities: [140 v1] Device Serial Number a0-36-9f-ff-ff-d8-5e-78 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00 VF offset: 128, stride: 2, Device ID: 10ed Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 000003c000400000 (64-bit, prefetchable) Region 3: Memory at 000003c000500000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: vfio-pci Kernel modules: ixgbe ########################################## # ovs-vsctl get open_vswitch . other_config {dpdk-extra="-n 4", dpdk-init="true", dpdk-lcore-mask="30000003", dpdk-socket-mem="4096,4096", pmd-cpu-mask="fc00000fc"} ########################################## The issue has been expirenced w/ and w/o SMT (aka HT). The VNF is RT and SMT is just enabled in this specific case, but usually it's not. ########################################## ########################################## # ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 2: isolated : true port: dpdk0 queue-id: 0 pmd thread numa_id 0 core_id 30: isolated : true port: dpdk0 queue-id: 1 pmd thread numa_id 0 core_id 34: isolated : false port: vhu4d717b90-39 queue-id: 0 port: vhu066565d2-14 queue-id: 0 pmd thread numa_id 0 core_id 6: isolated : false port: vhu9ba1d7e7-2b queue-id: 0 pmd thread numa_id 0 core_id 4: isolated : true port: dpdk1 queue-id: 0 pmd thread numa_id 0 core_id 32: isolated : true port: dpdk1 queue-id: 1 ########################################## Has you can see that's the best possible scenario: One CPU Core (2PMD) isolated for each PHY. The PHY NIC has two queues (even only one queue has drop). The VNF is spread over two PMD (one CPU core) but again the drop is on the DPDK PHY Nic. ########################################## ########################################## # ovs-appctl dpif-netdev/pmd-stats-show|grep -A9 -E "core_id (2|4|30|32):" pmd thread numa_id 0 core_id 2: emc hits:6971154401 megaflow hits:18787 avg. subtable lookups per hit:1.58 miss:375 lost:0 polling cycles:2726409361980 (25.74%) processing cycles:7865260979784 (74.26%) avg cycles per packet: 1519.35 (10591670341764/6971203155) avg processing cycles per packet: 1128.25 (7865260979784/6971203155) -- pmd thread numa_id 0 core_id 30: emc hits:6971365714 megaflow hits:10 avg. subtable lookups per hit:1.00 miss:404 lost:0 polling cycles:2785958014653 (26.32%) processing cycles:7799614633899 (73.68%) avg cycles per packet: 1518.44 (10585572648552/6971366138) avg processing cycles per packet: 1118.81 (7799614633899/6971366138) -- pmd thread numa_id 0 core_id 4: emc hits:52213 megaflow hits:18823 avg. subtable lookups per hit:1.00 miss:650 lost:0 polling cycles:7167842875131 (99.98%) processing cycles:1346919081 (0.02%) avg cycles per packet: 79208814.43 (7169189794212/90510) avg processing cycles per packet: 14881.44 (1346919081/90510) -- pmd thread numa_id 0 core_id 32: emc hits:29149 megaflow hits:3 avg. subtable lookups per hit:1.00 miss:238 lost:0 polling cycles:7157381605371 (99.98%) processing cycles:1257975648 (0.02%) avg cycles per packet: 243549130.10 (7158639581019/29393) avg processing cycles per packet: 42798.48 (1257975648/29393) ########################################## # bash check-drop.sh PHY DPDK0 Stats rx_65_to_127_packets=14826589744 rx_dropped=104516692 tx_65_to_127_packets=14822039076 tx_dropped=0 PHY DPDK1 Stats rx_65_to_127_packets=120275495 rx_dropped=2510269 tx_65_to_127_packets=116879771 tx_dropped=0 VHU vhu066565d2-14 Stats rx_65_to_127_packets=4416367190 rx_dropped=0 tx_dropped=39 VHU vhu4d717b90-39 Stats rx_65_to_127_packets=611 rx_dropped=0 tx_dropped=33 VHU vhu9ba1d7e7-2b Stats rx_65_to_127_packets=4416393053 rx_dropped=0 tx_dropped=11 Version-Release number of selected component (if applicable): RH-OSP10z5 and RH-OSP11z2 both using OVS 2.6.1-16 due to jumbo frame issue with XL710 (granted through support exception a few weeks ago). How reproducible: ########################################## # X520 A/B Bond config #!/bin/bash ovs-vsctl --may-exist add-bond br-provider bond3 dpdk2 dpdk3 -- set interface dpdk2 type=dpdk -- set interface dpdk3 type=dpdk -- set port bond3 bond_mode="active-backup" ovs-vsctl set Interface br-provider mtu_request=9004 ovs-vsctl set Interface dpdk2 mtu_request=9004 ovs-vsctl set interface dpdk2 options:n_rxq="2" ovs-vsctl set interface dpdk2 other_config:pmd-rxq-affinity="0:3,1:31" ovs-vsctl set Interface dpdk3 mtu_request=9004 ovs-vsctl set interface dpdk3 options:n_rxq="2" ovs-vsctl set interface dpdk3 other_config:pmd-rxq-affinity="0:5,1:33" ovs-appctl dpif-netdev/pmd-rxq-show # XL710 A/B Bond config #!/bin/bash ovs-vsctl --may-exist add-bond br-provider bond2 dpdk0 dpdk1 -- set interface dpdk0 type=dpdk -- set interface dpdk1 type=dpdk -- set port bond2 bond_mode="active-backup" ovs-vsctl set Interface br-provider mtu_request=9004 ovs-vsctl set Interface dpdk0 mtu_request=9004 ovs-vsctl set interface dpdk0 options:n_rxq="2" ovs-vsctl set interface dpdk0 other_config:pmd-rxq-affinity="0:2,1:30" ovs-vsctl set Interface dpdk1 mtu_request=9004 ovs-vsctl set interface dpdk1 options:n_rxq="2" ovs-vsctl set interface dpdk1 other_config:pmd-rxq-affinity="0:4,1:32" ovs-appctl dpif-netdev/pmd-rxq-show ########################################## Run some traffic and you should see frame drops. Actual results: Frame drops even with most simple active/backup bond with XL710. Not even a single frame drop with all of the possible bond config using X520 after 15 hours load test. Not even a single frame drop using a single XL710 port. Expected results: Even using the most simple active/backup bond, with XL710 I shouldn't have any traffic drop. Additional info: