Bug 2179366 - [17.1][DPDK] packet drop caused by vga controller induced latency on memory bus
Summary: [17.1][DPDK] packet drop caused by vga controller induced latency on memory bus
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 17.1
Assignee: Steve Baker
QA Contact: Jason Paroly
URL:
Whiteboard:
: 2178958 (view as bug list)
Depends On:
Blocks: 2293368 2317182
TreeView+ depends on / blocked
 
Reported: 2023-03-17 14:05 UTC by Miguel Angel Nieto
Modified: 2024-10-11 07:22 UTC (History)
35 users (show)

Fixed In Version: openstack-tripleo-common-15.4.1-1.20230518211051.cbb03c0.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-16 01:14:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 880008 0 None MERGED Allow custom console=tty0 argument 2023-05-31 13:39:45 UTC
OpenStack gerrit 880009 0 None MERGED Avoid repeated console= kernel arguments 2023-05-31 13:39:44 UTC
OpenStack gerrit 880010 0 None MERGED Suppress console=tty0 in kernel arguments 2023-06-04 08:46:01 UTC
OpenStack gerrit 881388 0 None MERGED Remove nomodeset from usage 2023-05-09 08:22:34 UTC
OpenStack gerrit 884644 0 None NEW Change DIB_BOOTLOADER_DEFAULT_CMDLINE empty string behaviour 2023-06-06 19:34:38 UTC
OpenStack gerrit 884780 0 None MERGED Ensure non-empty DIB_BOOTLOADER_DEFAULT_CMDLINE 2023-06-04 08:45:56 UTC
Red Hat Bugzilla 1848102 0 urgent CLOSED Restarting a docker container on the compute host introduces latency spikes in a RT KVM VM 2023-10-06 20:41:02 UTC
Red Hat Bugzilla 2139901 1 None None None 2024-02-11 04:29:04 UTC
Red Hat Issue Tracker NFV-2792 0 None None None 2023-03-27 13:57:30 UTC
Red Hat Issue Tracker OSP-23217 0 None None None 2023-03-17 14:06:57 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:15:01 UTC

Description Miguel Angel Nieto 2023-03-17 14:05:53 UTC
Description of problem:
Performance is very low when using e810 nic in a ovs-dpdk scenario. Even sending a very low throughput we can see packet loss in e810 nic

***********************
OVS configuration
***********************
    Bridge br-dpdk1
        fail_mode: standalone
        datapath_type: netdev
        Port dpdk3
            Interface dpdk3
                type: dpdk
                options: {dpdk-devargs="0000:3b:00.1"}
        Port patch-provnet-f6ce6a34-0212-4c3d-b69c-7eed2b8f056e-to-br-int
            Interface patch-provnet-f6ce6a34-0212-4c3d-b69c-7eed2b8f056e-to-br-int
                type: patch
                options: {peer=patch-br-int-to-provnet-f6ce6a34-0212-4c3d-b69c-7eed2b8f056e}
        Port br-dpdk1
            Interface br-dpdk1
                type: internal
    Bridge br-dpdk0
        fail_mode: standalone
        datapath_type: netdev
        Port patch-provnet-c2fd3e23-f6db-427b-94ee-6b3749cd1863-to-br-int
            Interface patch-provnet-c2fd3e23-f6db-427b-94ee-6b3749cd1863-to-br-int
                type: patch
                options: {peer=patch-br-int-to-provnet-c2fd3e23-f6db-427b-94ee-6b3749cd1863}
        Port dpdk2
            Interface dpdk2
                type: dpdk
                options: {dpdk-devargs="0000:3b:00.0"}
        Port br-dpdk0
            Interface br-dpdk0
                type: internal

[root@computeovsdpdksriov-r740 tripleo-admin]# lspci | grep 3b
3b:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
3b:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)

**********************
OVS stats
**********************

There are drops in port1 in both ovs bridges in dpdk port (dpdk2 and dpdk3). I am not injecting a very high rate (1mpps), there should not be drops

[root@computeovsdpdksriov-r740 tripleo-admin]# ovs-ofctl dump-ports br-dpdk0
OFPST_PORT reply (xid=0x2): 3 ports
  port  "patch-provnet-c": rx pkts=1874674292, bytes=112480457896, drop=?, errs=?, frame=?, over=?, crc=?
           tx pkts=1888115927, bytes=120839419708, drop=?, errs=?, coll=?
  port  dpdk2: rx pkts=1888134892, bytes=148675414617, drop=434902752, errs=0, frame=?, over=?, crc=?
           tx pkts=1874685878, bytes=119979897034, drop=0, errs=0, coll=?
  port LOCAL: rx pkts=42, bytes=3172, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=88128057, bytes=5640511868, drop=0, errs=0, coll=0

[root@computeovsdpdksriov-r740 tripleo-admin]# ovs-ofctl dump-ports br-dpdk1
OFPST_PORT reply (xid=0x2): 3 ports
  port  "patch-provnet-f": rx pkts=1874876125, bytes=112492567990, drop=?, errs=?, frame=?, over=?, crc=?
           tx pkts=1888349808, bytes=120854388032, drop=?, errs=?, coll=?
  port  dpdk3: rx pkts=1888358904, bytes=148805956909, drop=436718434, errs=0, frame=?, over=?, crc=?
           tx pkts=1874877823, bytes=119992181586, drop=0, errs=0, coll=?
  port LOCAL: rx pkts=42, bytes=3172, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=88544048, bytes=5667134976, drop=0, errs=0, coll=0


Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230315.n.1


How reproducible:
1. Deploy performance templates (ovn-dpdk) for e810 testing
2. Run performance
3. Performance is less than 0.5 mpps


Actual results:
Packet loss in e810 nic with very low throughput


Expected results:
There should not be packet loss in e810 with low throughput


Additional info:

Comment 1 Robin Jarry 2023-03-17 15:18:59 UTC
I have made some additional debugging on that platform.

Running testpmd on the same E810 NICs (I had to manually unpack some firmware images to get it to work, see bug #2179024), I get good throughput throughput performance but there's a small amount of packet loss which eventually leads to very bad zero-loss performance test results.

With X710 nics on the same machine, same setup, same testpmd configuration, I don't see the issue. It may be related to a dpdk driver problem.

========
SOFTWARE
========

BOOT_IMAGE=/vmlinuz-5.14.0-283.el9.x86_64 root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M default_hugepagesz=1GB hugepagesz=1G hugepages=64 iommu=pt intel_iommu=on tsx=off isolcpus=2-19,22-39 console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb skew_tick=1 nohz=on nohz_full=2-19,22-39 rcu_nocbs=2-19,22-39 tuned.non_isolcpus=00300003 intel_pstate=disable nosoftlockup

openvswitch3.0-3.0.0-36.el9fdp.x86_64
dpdk-22.11-1.el9.x86_64

========
HARDWARE
========

NUMA 0
======

Memory: 187GB
2MB hugepages: 0
1GB hugepages: 32

CPUs
----

Model name:                      Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Cores IDs:
0,20    2,22    4,24    6,26    8,28    10,30   12,32   14,34   16,36   18,38

NICs
----

SLOT          DRIVER     IFNAME     MAC                LINK/STATE  SPEED   DEVICE
0000:3b:00.0  ice        ens1f0     40:a6:b7:18:e0:60  1/up        10Gb/s  Ethernet Controller E810-C for QSFP
0000:3b:00.1  ice        ens1f1     40:a6:b7:18:e0:61  1/up        10Gb/s  Ethernet Controller E810-C for QSFP

[root@computeovsdpdksriov-r740 ~]# driver: ice
version: 5.14.0-283.el9.x86_64
firmware-version: 4.20 0x80017785 1.3346.0
expansion-rom-version: 
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

NUMA 1
======

Memory: 188GB
2MB hugepages: 0
1GB hugepages: 32

CPUs
----

Model name:                      Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Cores IDs:
1,21    3,23    5,25    7,27    9,29    11,31   13,33   15,35   17,37   19,39

NICs
----

SLOT          DRIVER  IFNAME  MAC                LINK/STATE  SPEED   DEVICE
0000:af:00.0  i40e    ens4f0  f8:f2:1e:42:5d:70  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
0000:af:00.1  i40e    ens4f1  f8:f2:1e:42:5d:70  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
0000:af:00.2  i40e    ens4f2  f8:f2:1e:42:5d:72  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+
0000:af:00.3  i40e    ens4f3  f8:f2:1e:42:5d:73  1/up        10Gb/s  Ethernet Controller X710 for 10GbE SFP+

[root@computeovsdpdksriov-r740 ~]# ethtool -i ens4f2
driver: i40e
version: 5.14.0-283.el9.x86_64
firmware-version: 6.00 0x80003a29 18.3.6
expansion-rom-version: 
bus-info: 0000:af:00.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

========
COMMANDS
========

dpdk-testpmd -l 0,2 -a 0000:3b:00.0 -a 0000:3b:00.1 -n 4 -- --nb-cores=1 -i --rxd=4096 --txd=4096 --forward-mode=mac --eth-peer=0,3c:fd:fe:33:a8:44 --eth-peer=1,3c:fd:fe:33:a8:46

dpdk-testpmd -l 1,3 -a 0000:af:00.2 -a 0000:af:00.3 -n 4 -- --nb-cores=1 -i --rxd=4096 --txd=4096 --forward-mode=mac --eth-peer=0,3c:fd:fe:33:a8:44 --eth-peer=1,3c:fd:fe:33:a8:46

====
TREX
====

./ndr --stl --opt-bin-search --bi-dir --ports 0 1 -v --profile stl/udp_1pkt.py

With E810:

{'BW per core [Gbit/sec @100% per core]': '12.75Gbit/Sec @100% per core',
 'CPU Utilization [%]': '1.29',
 'Drop Rate [%]': '0.03',
 'Elapsed Time [Sec]': '350.62',
 'Line Utilization [%]': '4.3',
 'Max Rate [bps]': '7.63 Gbps',
 'NDR points': ['655.56 Mbps'],
 'OPT RX Rate [bps]': '327.73 Mbps',
 'OPT Rate (Multiplier) [%]': '4.29604435115832',
 'OPT TX Rate [bps]': '327.78 Mbps',
 'Queue Full [%]': '0.0%',
 'RX [MPPS]': '1.28 MPPS',
 'RX [bps]': '655.46 Mbps',
 'TX [MPPS]': '1.28 MPPS',
 'TX [bps]': '655.56 Mbps',
 'Title': 'udp_1pkt',
 'Total Iterations': 6,
 'Total RX L1': '860.29 Mbps',
 'Total TX L1': '860.42 Mbps',
 'latency': {},
 'valid_latency': True}


With X710:

{'BW per core [Gbit/sec @100% per core]': '12.4Gbit/Sec @100% per core',
 'CPU Utilization [%]': '22.78',
 'Drop Rate [%]': '0.04',
 'Elapsed Time [Sec]': '210.36',
 'Line Utilization [%]': '74.16',
 'Max Rate [bps]': '7.62 Gbps',
 'NDR points': ['11.3 Gbps'],
 'OPT RX Rate [bps]': '5.65 Gbps',
 'OPT Rate (Multiplier) [%]': '74.18179379040447',
 'OPT TX Rate [bps]': '5.65 Gbps',
 'Queue Full [%]': '0.0%',
 'RX [MPPS]': '22.07 MPPS',
 'RX [bps]': '11.3 Gbps',
 'TX [MPPS]': '22.07 MPPS',
 'TX [bps]': '11.3 Gbps',
 'Title': 'udp_1pkt',
 'Total Iterations': 2,
 'Total RX L1': '14.83 Gbps',
 'Total TX L1': '14.83 Gbps',
 'latency': {},
 'valid_latency': True}

The throughput performance is comparable with both NICs (~22M pkts/s).

Comment 2 David Marchand 2023-03-21 08:11:59 UTC
I had a look at this system with rjarry, summarizing here what we did yesterday.


We can see some random imiss on the physical ports, even at low rate.
Yet, the testpmd datapath thread does not seem interrupted by some kernel task or interrupt (no context switch in /proc/XXX/task/YYY/status, no interrupt count that changes in /proc/interrupts during the test, nothing that stands out in perf record).

My current gut feeling is that there is some issue at the hw level: cpu, pci, nic... ?

The nic firmware and loaded ddp seems recent enough.
This nic is plugged in a x8 slot but I don't see a reason for random hiccups because of it.

We looked at the cpu side.
One odd thing is that cpupower reports that the core (running this thread) is not 100% of its time in C0.
Yet, looking at this core /sys/ c-state counters, there is no change in them during the test.

Robin disabled turbo boost which was enabled in the bios, but this does not change much the outcome.

Since a nic on this same system works fine in the other numa node, switching the nics could be something to try, but again, I have no decisive clue about the issue.

Comment 3 David Marchand 2023-03-21 10:20:26 UTC
Additionnally, something to try would be to (ex)change cables for the E810 nic.

Comment 4 David Marchand 2023-03-21 10:21:54 UTC
I reproduced the issue on this system with dpdk 20.11 package which seems to confort the idea that this is not a software issue.

Comment 5 Miguel Angel Nieto 2023-03-23 08:31:13 UTC
I deployed OSP 16.2 and I have not been able to reproduced the issue. I get a much higher performance (2.5 mpps), So I think this is not something related to the hardware or cables. It looks like something related with kernel or drivers or dpdk, but I would say it is not hardware.

This is the version I have tested:

RHOS-16.2-RHEL-8-20230301.n.1
Red Hat Enterprise Linux release 8.4 (Ootpa)
openvswitch2.15-2.15.0-130.el8fdp.x86_64
dpdk-20.11-4.el8_4.x86_64

Comment 6 Robin Jarry 2023-03-24 16:13:34 UTC
After some more debugging today, I may have found what is causing the intermittent packet loss. The eno1 interface (which is not used in the deployment nor related in any way to the E810 ports) is up and receives vrrp traffic. This causes random "martian packet" logs in the kernel buffer:

[88487.734192] IPv4: martian source 10.46.174.3 from 10.46.174.126, on dev eno1
[88487.826907] ll header: 00000000: ff ff ff ff ff ff 00 00 5e 00 01 01 08 06

I have made a direct correlation between these martian packet logs and the packet loss in the E810 NIC using testpmd.

0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
[88188.539303] IPv4: martian source 10.46.174.39 from 10.46.174.126, on dev eno1
[88188.633818] ll header: 00000000: ff ff ff ff ff ff 00 00 5e 00 01 01 08 06
---
0: RX=4.9M pkt/s DROP=21.5K pkt/s TX=4.9M pkt/s
1: RX=4.9M pkt/s DROP=21.6K pkt/s TX=4.9M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
[88196.057719] IPv4: martian source 10.46.174.6 from 10.46.174.126, on dev eno1
[88196.150947] ll header: 00000000: ff ff ff ff ff ff 00 00 5e 00 01 01 08 06
---
0: RX=4.9M pkt/s DROP=1.0K pkt/s TX=4.9M pkt/s
1: RX=4.9M pkt/s DROP=1.0K pkt/s TX=4.9M pkt/s
---
0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s
1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s

Running tcpdump on eno1 shows these packets:

[root@computeovsdpdksriov-r740 ~]# tcpdump --no-promiscuous-mode -nnei eno1
16:03:48.938699 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.25 tell 10.46.174.126, length 46
16:03:58.557135 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.40 tell 10.46.174.126, length 46
16:04:08.171754 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.7 tell 10.46.174.126, length 46

When enabling promisc, we see all the vrrp traffic:

[root@computeovsdpdksriov-r740 ~]# tcpdump -nnei eno1
16:05:03.781562 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:04.270320 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20
16:05:04.627667 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:05.048363 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20
16:05:05.559675 54:1e:56:0c:c6:c0 > 01:80:c2:00:00:00, 802.3, length 39: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03: STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 81b9.54:1e:56:0c:c6:c0.81f3, length 36
16:05:05.607348 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:05.901500 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20
16:05:06.372737 08:05:e2:75:7b:20 > 01:00:5e:00:00:0d, ethertype IPv4 (0x0800), length 72: 10.46.174.125 > 224.0.0.13: PIMv2, Hello, length 38
16:05:06.442372 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:06.877408 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20
16:05:07.301527 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:07.389763 54:1e:56:0c:c6:c0 > 01:80:c2:00:00:00, 802.3, length 39: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03: STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 81b9.54:1e:56:0c:c6:c0.81f3, length 36
16:05:07.658379 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20
16:05:08.280440 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:08.489423 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20
16:05:08.551678 08:05:e2:75:7b:20 > 33:33:00:00:00:0d, ethertype IPv6 (0x86dd), length 114: fe80::a05:e201:b975:7b20 > ff02::d: PIMv2, Hello, length 60
16:05:09.066432 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40
16:05:09.272567 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.2 tell 10.46.174.126, length 46
16:05:09.285456 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20

For some obscure reason which is yet to be discovered, disabling the log_martians option on the "all" interface, fixes the random packet loss on dpdk ports.

echo 0 > /proc/sys/net/ipv4/conf/all/log_martians

This really does not make any sense. The "martian packet" logs also show on the 16.2 deployment and they do not affect DPDK performance. Yesterday, we also tried to update a 16.2 compute node with a RHEL 9.2 kernel and there was no packet drop, in spite of the "martian packet" logs.

Maybe the kernel folks could shine a light on this?

Comment 7 Robin Jarry 2023-03-24 16:15:27 UTC
Obviously, setting the eno1 interface down also fixes the issue.

Comment 8 Miguel Angel Nieto 2023-03-27 06:47:14 UTC
Could it be that for some reason CPUs are interrupted when receiving these packets and cause the packet losses?

Comment 9 Robin Jarry 2023-03-28 15:28:38 UTC
Here are some additional observations:

None of the {non,}voluntary_ctxt_switches stats are increasing on the dpdk
lcore bound thread when I observe packet drop.

voluntary_ctxt_switches:        1
nonvoluntary_ctxt_switches:     283 <--- remains mostly constant

For the next two outputs, I used this script for reference:
https://git.sr.ht/~rjarry/dotfiles/tree/main/item/bin/irqstat

irqbalance seems not able to properly isolate the requested CPUs. This may be
due to an excessive number of VFs compared to the number of available slots in
the interrupt controllers (max 224 per core). However, the specific interface
that is causing the "martian source" messages is not assigned to CPU 2.

# ./irqstat -c 2
IRQ                                                                            AFFINITY  EFFECTIVE-CPUs
205:i40e-eno2-TxRx-16            0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38               2
218:i40e-eno2-TxRx-29            0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38               2
251:mlx5_comp1@pci:0000:5f:00.2                                                       2               2
263:mlx5_comp1@pci:0000:5f:00.3                                                       2               2
275:mlx5_comp1@pci:0000:5f:00.4                                                       2               2
311:mlx5_comp1@pci:0000:5f:00.7                                                       2               2
323:mlx5_comp1@pci:0000:5f:01.0                                                       2               2
655:mlx5_comp1@pci:0000:5f:01.4                                                       2               2
691:mlx5_comp1@pci:0000:5f:01.7                                                       2               2

Also, there are no eno1 interrupts firing on the lcore CPU 2. The interrupt
counters (that are wrongfully bound to CPU2) remain mostly constant:

# ./irqstat -se -c 2
IRQ                              CPU-2
205:i40e-eno2-TxRx-16               63
218:i40e-eno2-TxRx-29               49
251:mlx5_comp1@pci:0000:5f:00.2    132
263:mlx5_comp1@pci:0000:5f:00.3    144
275:mlx5_comp1@pci:0000:5f:00.4    134
287:mlx5_comp1@pci:0000:5f:00.5     11
299:mlx5_comp1@pci:0000:5f:00.6     11
311:mlx5_comp1@pci:0000:5f:00.7    124
323:mlx5_comp1@pci:0000:5f:01.0    460
619:mlx5_comp1@pci:0000:5f:01.1     12
631:mlx5_comp1@pci:0000:5f:01.2      6
643:mlx5_comp1@pci:0000:5f:01.3      8
667:mlx5_comp1@pci:0000:5f:01.5     10
679:mlx5_comp1@pci:0000:5f:01.6     10
691:mlx5_comp1@pci:0000:5f:01.7    140
703:mlx5_comp1@pci:0000:5f:02.0     11
715:mlx5_comp1@pci:0000:5f:02.1      7
727:mlx5_comp1@pci:0000:5f:02.2      5
739:mlx5_comp1@pci:0000:5f:02.3      7
751:mlx5_comp1@pci:0000:5f:02.4      3
763:mlx5_comp1@pci:0000:5f:02.5      2

Here are some excerpts from perf captures on the lcore CPU:

> rte-worker-2 217762 [002] 57183.022352:     608650   cycles:
>                    49613 ice_recv_pkts_vec_avx2+0x133 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.022608:     611186   cycles: 
>                    796ce pkt_burst_mac_forward.lto_priv.0+0x17e (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.022916:     609154   cycles: 
>                    49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.023265:     594908   cycles: 
>         ffffffffa9371cb7 vprintk_emit+0x187 ([kernel.kallsyms])
>         ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms])
>         ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms])
>         ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms])
>         ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms])
>         ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms])
>         ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms])
>         ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms])
>                    49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> [REPEATS 25 TIMES]
> 
> rte-worker-2 217762 [002] 57183.029195:     599669   cycles: 
>         ffffffffa9371cbe vprintk_emit+0x18e ([kernel.kallsyms])
>         ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms])
>         ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms])
>         ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms])
>         ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms])
>         ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms])
>         ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms])
>         ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms])
>                    49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.029682:     599248   cycles: 
>         ffffffffa98d60b7 io_serial_out+0x17 ([kernel.kallsyms])
>         ffffffffa98da5e8 serial8250_console_write+0x208 ([kernel.kallsyms])
>         ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms])
>         ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms])
>         ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms])
>         ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms])
>         ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms])
>         ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms])
>         ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms])
>         ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms])
>         ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms])
>         ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms])
>                    49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> [REPEATS 20 TIMES]
> 
> rte-worker-2 217762 [002] 57183.034892:     613613   cycles: 
>         ffffffffa98d6095 io_serial_in+0x15 ([kernel.kallsyms])
>         ffffffffa98d678d wait_for_lsr+0x3d ([kernel.kallsyms])
>         ffffffffa98da77c serial8250_console_write+0x39c ([kernel.kallsyms])
>         ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms])
>         ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms])
>         ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms])
>         ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms])
>         ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms])
>         ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms])
>         ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms])
>         ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms])
>         ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms])
>         ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms])
>                    49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.035142:     613387   cycles: 
>         ffffffffa97c2163 delay_tsc+0x33 ([kernel.kallsyms])
>         ffffffffa98d6775 wait_for_lsr+0x25 ([kernel.kallsyms])
>         ffffffffa98da77c serial8250_console_write+0x39c ([kernel.kallsyms])
>         ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms])
>         ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms])
>         ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms])
>         ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms])
>         ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms])
>         ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms])
>         ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms])
>         ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms])
>         ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms])
>         ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms])
>                    49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.147546:    1521803   cycles: 
>                    49785 ice_recv_pkts_vec_avx2+0x2a5 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.148160:    1377089   cycles: 
>                    494e5 ice_recv_pkts_vec_avx2+0x5 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
> 
> rte-worker-2 217762 [002] 57183.148714:    1275019   cycles: 
>                    82477 run_pkt_fwd_on_lcore+0x97 (/usr/bin/dpdk-testpmd)
>         8148535554415541 [unknown] ([unknown])
> 
> rte-worker-2 217762 [002] 57183.149229:    1116662   cycles: 
>                    4d5c2 ice_xmit_pkts_vec_avx2+0x7b2 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0)
>                    79760 pkt_burst_mac_forward.lto_priv.0+0x210 (/usr/bin/dpdk-testpmd)
>                168690980 [unknown] ([unknown])
> 
> rte-worker-2 217762 [002] 57183.149679:    1044895   cycles: 
>                    796d7 pkt_burst_mac_forward.lto_priv.0+0x187 (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.150101:     940919   cycles: 
>                    7970b pkt_burst_mac_forward.lto_priv.0+0x1bb (/usr/bin/dpdk-testpmd)
> 
> rte-worker-2 217762 [002] 57183.150480:     892927   cycles: 
>                    79556 pkt_burst_mac_forward.lto_priv.0+0x6 (/usr/bin/dpdk-testpmd)
>         8148535554415541 [unknown] ([unknown])

And on the CPUs that are bound to the eno1 interrupts at the time the martian
source packets are printed on the console:

> swapper     0 [000] 58566.812242:     409661   cycles:
>         ffffffffa98d6095 io_serial_in+0x15 ([kernel.kallsyms])
>         ffffffffa98d678d wait_for_lsr+0x3d ([kernel.kallsyms])
>         ffffffffa98da77c serial8250_console_write+0x39c ([kernel.kallsyms])
>         ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms])
>         ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms])
>         ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms])
>         ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms])
>         ffffffffa9d21bbf ip_handle_martian_source.cold+0x17 ([kernel.kallsyms])
>         ffffffffa9b8e507 __mkroute_input+0x337 ([kernel.kallsyms])
>         ffffffffa9b8fa3e ip_route_input_slow+0x66e ([kernel.kallsyms])
>         ffffffffa9b900ed ip_route_input_noref+0x4d ([kernel.kallsyms])
>         ffffffffa9bd87e9 arp_process+0x499 ([kernel.kallsyms])
>         ffffffffa9ae739d __netif_receive_skb_list_core+0x26d ([kernel.kallsyms])
>         ffffffffa9ae7a36 netif_receive_skb_list_internal+0x1c6 ([kernel.kallsyms])
>         ffffffffa9ae819f napi_complete_done+0x6f ([kernel.kallsyms])
>         ffffffffc0df27bc i40e_napi_poll+0x19c ([kernel.kallsyms])
>         ffffffffa9ae82f7 __napi_poll+0x27 ([kernel.kallsyms])
>         ffffffffa9ae87d3 net_rx_action+0x233 ([kernel.kallsyms])
>         ffffffffaa0000c7 __softirqentry_text_start+0xc7 ([kernel.kallsyms])
>         ffffffffa92fe1d5 __irq_exit_rcu+0xb5 ([kernel.kallsyms])
>         ffffffffa9d265c0 common_interrupt+0x80 ([kernel.kallsyms])
>         ffffffffa9e00ce2 asm_common_interrupt+0x22 ([kernel.kallsyms])
>         ffffffffa982c06f mwait_idle_with_hints.constprop.0+0x4f ([kernel.kallsyms])
>         ffffffffa9d37170 intel_idle_irq+0x20 ([kernel.kallsyms])
>         ffffffffa9a637e6 cpuidle_enter_state+0x86 ([kernel.kallsyms])
>         ffffffffa9a63ba9 cpuidle_enter+0x29 ([kernel.kallsyms])
>         ffffffffa934e45c cpuidle_idle_call+0x12c ([kernel.kallsyms])
>         ffffffffa934e56b do_idle+0x7b ([kernel.kallsyms])
>         ffffffffa934e789 cpu_startup_entry+0x19 ([kernel.kallsyms])
>         ffffffffa9d2916a rest_init+0xca ([kernel.kallsyms])
>         ffffffffab6124d0 arch_call_rest_init+0xa ([kernel.kallsyms])
>         ffffffffab61298c start_kernel+0x49e ([kernel.kallsyms])
>         ffffffffa920015a secondary_startup_64_no_verify+0xe5 ([kernel.kallsyms])

Removing console=tty0 fixes packet drops and gets rid of the "printk"
interrupts on CPU 2.

I don't know what could be causing these "printk" interruptions on isolated
cores. But they seem directly correlated with actual printk on serial console
triggered by martian packets during hardware RX interrupts on other CPUs.

Comment 10 Robin Jarry 2023-03-28 15:49:04 UTC
Out of curiosity, I ran this as well:

> # ./irqstat -se | grep -e ^CPU -e '^2 '
> CPU     NMI      LOC  SPU     PMI    IWI  RTR    RES     CAL    TLB  TRM  THR  DFR  MCE  MCP  PIN  NPI  PIW
> 2    104394    75127    0  104394     57    0     10   17736      0    0    0    0    0    2    0    0    0

These additional interrupts cannot be "bound" to specific CPUs.

When console=tty0 is removed from the boot command line, none of these counters are increasing.

When console=tty0 is present, the CAL (Function Call Interrupts) counter is increasing
but not in direct correlation with the "martian source" logs nor the packet drops.

Comment 11 Robin Jarry 2023-03-30 12:46:20 UTC
This looks not specific to E810. I noticed the same intermittent packet drop running the same test on a CX-5 NIC on the same numa node. 

Note that setting eno1 down (where the martian packets are received) fixes the issue. Simply echoing garbage to /dev/kmsg does not trigger packet drop.

Comment 12 Marcelo Ricardo Leitner 2023-04-04 11:20:48 UTC
Can someone please elaborate on why the Regression and (with it) why this would block 17.1?
I mean, did this used to work with 16.2?

Comment 13 Vadim Khitrin 2023-04-04 11:54:32 UTC
Hi Marcelo,
Yes, in the 16.2 release, we had great performance numbers for NFV workloads.
For NFV customers performance is super critical.

Comment 14 Antoine Tenart 2023-04-05 11:06:16 UTC
We narrowed down the packets dropping issue with Robin.

What happens is packets are being dropped when fbcon sends pixels to the VGA controller. We could reproduce the packet drops not only when receiving martian packets, but with any kernel log message or data written on the virtual console associated with fcbon (here, tty0 > fbcon > vga). Some reproducers below:

1. Triggering martian packets,

```
#!/bin/bash

ns="ns1 ns2"

for n in $ns; do
ip netns add $n
	ip -net $n link set lo up
done

ip netns exec ns2 sysctl -q net.ipv4.conf.all.log_martians=1
ip netns exec ns2 sysctl -q net.ipv4.conf.default.log_martians=1

ip link add veth12 netns ns1 type veth peer name veth21 netns ns2
ip -net ns1 link set veth12 up; ip -net ns2 link set veth21 up
ip -net ns1 a a 10.0.42.1/24 dev veth12

ip -net ns2 link add d0 type dummy
ip -net ns2 link set d0 up
ip -net ns2 a a 10.0.42.2/24 dev d0

sleep 1

ip netns exec ns1 ping -I veth12 -c2 10.0.42.2 &>/dev/null

for n in $ns; do
	ip netns pids $n | xargs -r kill
	ip netns del $n
done
exit 0
```

2. Generating logs in dmesg, e.g. `ip link set ethX up` (if the interface has a carrier). Also works well with veth and the above script w/o the ping.

3. Directly writing to the virtual console associated to the fbcon, `while read -n 128 foo; do printf '%s\n' "$foo"; done < /dev/urandom > /dev/tty0`.

4. Rotating the fbcon, `echo 1 > /sys/devices/virtual/graphics/fbcon/rotate`.

Etc.

While we can't disable fbcon at boot time, any of the following made the packet drops to stop:

1. By disabling the vtcon associated with fbcon, here vtcon1: `echo 0 > /sys/class/vtconsole/vtcon1/bind`. Re-binding the vtcon makes drops to be seen again.

2. By removing `console=tty0` from the kernel cmdline and rebooting. This stops the console to be sent to tty0, and ultimately the vga controller.

I reached out to Jocelyn from kernel graphics and (in addition to nicely answering our questions) pointed out one very similar bug they have, bz2139901, impacting specific h/w, which we were using[1]. Graphics guys can probably explain this better than I do, but my understanding is "when pixels are sent to the VGA card, bus transfer latency rises 10x" which prevents CPUs on the same NUMA node to access the NIC for a period of time and some packets can be dropped. This correlates with what Robin saw where drops where not seen in other NUMA nodes.

We did not see the issue previously because 'console=tty0' was not used before in the kernel cmdline in previous versions of OCP (using RHEL 8). But we did test a RHEL 8.6 kernel with 'console=tty0' and were able to reproduce the packet drops. Robin is looking into the history of this.

My understanding is this can't be fixed in the kernel, it's an h/w issue. My recommendation would be to remove 'console=tty0' from the kernel cmdline as a workaround. As a side note, the VGA output can still be used, only the console output isn't redirected there. This means an user accessing a virtual terminal on the VGA or an application displaying something can still trigger a similar issue. This is something we also tested, and while we were not able to trigger the same amount of drops as with fbcon, we could see some.

On a different topic, we briefly performed some tests about the interrupts issue and our *very initial* thought was this *might* be specific to mlx NICs. Robin offered to perform more tests in this regard and we'll follow-up. This will probably be moved to a dedicated bz.

[1] $ lspci | grep VGA
    03:00.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller [102b:0536] (rev 04)

Comment 15 Harald Jensås 2023-04-05 14:25:00 UTC
Not sure why this changed, but the `console=tty0` is in `/etc/default/grub` for OSP 17, but not in OSP 16.

overcloud-full-16.2-20221124.2.x86_64
  $ guestfish -a usr/share/rhosp-director-images/overcloud-full-16.2-20221124.2.x86_64/overcloud-full.qcow2
  ><fs> run
  ><fs> mount /dev/sda /
  ><fs> grep GRUB_CMDLINE_LINUX /etc/default/grub
  GRUB_CMDLINE_LINUX="console=ttyS0 console=ttyS0,115200n81 no_timer_check  crashkernel=auto rhgb quiet"

overcloud-full-17.0-20230111.1.x86_64
  $ guestfish -a usr/share/rhosp-director-images/overcloud-full-17.0-20230111.1.x86_64/overcloud-full.qcow2
  ><fs> run
  ><fs> list-filesystems
  /dev/sda: xfs
  ><fs> mount /dev/sda /
  ><fs> grep GRUB_CMDLINE_LINUX /etc/default/grub
  GRUB_CMDLINE_LINUX="console=ttyS0 console=ttyS0,115200n81 no_timer_check  crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M"
  GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset gfxpayload=text"

overcloud-hardened-uefi-full-17.0-20230111.1.x86_64
  $ guestfish -a usr/share/rhosp-director-images/overcloud-hardened-uefi-full-17.0-20230111.1.x86_64/overcloud-hardened-uefi-full.qcow2 
  ><fs> run
  ><fs> mount /dev/vg/lv_root /
  ><fs> grep GRUB_CMDLINE_LINUX /etc/default/grub
  GRUB_CMDLINE_LINUX="console=ttyS0 console=ttyS0,115200n81 no_timer_check  crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M"
  GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb"

Comment 17 Robin Jarry 2023-04-05 14:38:49 UTC
If I am not mistaken, this has been that way for a long time: https://opendev.org/openstack/tripleo-common/commit/fc07b696ac2c45c3f914b85b750e471fceb5c390

Comment 18 Robin Jarry 2023-04-05 14:40:07 UTC
The question is, why/how this console=tty0 was removed in earlier versions?

Comment 19 Flavio Leitner 2023-04-05 15:10:47 UTC
Because the workaround or the issue doesn't apply to OVS, can we reassign it to the proper component?
Thanks
fbl

Comment 20 Steve Baker 2023-04-05 21:45:48 UTC
I think the reworking of the diskimage-builder bootloader element resulted in default kernel args being set as intended, which includes console=ttyS0.

Assuming that console=ttyS0 is fine in the general case we could handle this either:
- in documentation to specify custom kernelargs which excludes console=ttyS0, or
- in a DPDK tripleo-ansible kernel args workaround to remove it, or
- in an image-customization workaround to remove it

I can't see the current state of DPDK documentation in 17.x [1] so I can't recommend the best approach yet.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.0/html/network_functions_virtualization_planning_and_configuration_guide/pr01

Comment 21 Robin Jarry 2023-04-05 21:58:20 UTC
Hi Steve,

There is no issue with ttyS0. The problem is specific to the VGA console: *tty0*. Also, this is not a problem only for DPDK. This bug effectively introduces a 150us memory bus latency on the whole NUMA node where the VGA controller is. This could affect any running VM on any CPU.

Unless I am mistaken, there is no way to fix it in the kernel. The only workaround is to get rid of console=tty0 for all deployments until there is a fix for the affected generations of hardware. Once we have a better view of the list of these affected VGA controllers / CPU models, we plan to issue a KB article.

NB: 17.0 was released *without NFV support* partly because of this bug.

Comment 22 Robin Jarry 2023-04-06 09:26:01 UTC
Interesting test I ran today which clearly shows that writing 128 bytes to the tty0 console every second causes the max thread timer latency to rise significantly:

> ~# timerlat
>                                      Timer Latency
>   0 00:00:12   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
> CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
>   0 #11629     |        1         0         1        41 |        3         2         3       677
>   1 #11625     |        1         0         1        61 |        4         2         4      1328
>   2 #11463     |        1         0         8     83812 |        3         2        17    166085
>   3 #11629     |        1         0         1         1 |        3         3         3         7
>   4 #11629     |        1         1         1         1 |        3         3         3         4
>   5 #11629     |        1         1         1         1 |        3         3         3         4
>   6 #11629     |        1         1         1         1 |        4         3         4         5
>   7 #11629     |        1         1         1         1 |        3         3         3         4
>   8 #11629     |        1         1         1         1 |        4         3         3         4
>   9 #11629     |        1         1         1         1 |        3         3         3         4
>  10 #11628     |        1         1         1         1 |        3         3         3         4
>  11 #11628     |        1         1         1         1 |        3         3         3         4
>  12 #11628     |        1         1         1         1 |        4         3         3         4
>  13 #11628     |        1         1         1         1 |        3         3         3         4
>  14 #11628     |        1         1         1         1 |        3         3         3         4
>  15 #11628     |        1         1         1         1 |        3         3         3         7
>  16 #11629     |        1         1         1         1 |        3         3         3         5
>  17 #11629     |        1         1         1         3 |        3         3         3         6
>  18 #11629     |        1         1         1         1 |        3         3         3         5
>  19 #11629     |        1         1         1         1 |        3         3         3         4
>  20 #11622     |        1         0         1       177 |        3         3         3      1199
>  21 #11627     |        1         0         1         8 |        3         3         3      1416
>  22 #11629     |        1         0         1         3 |        3         3         3         5
>  23 #11629     |        1         1         1         1 |        3         3         3         9
>  24 #11629     |        1         1         1         1 |        3         3         3         6
>  25 #11629     |        1         1         1         1 |        3         3         3         4
>  26 #11629     |        1         1         1         1 |        3         3         3         4
>  27 #11629     |        1         1         1         1 |        3         3         3         4
>  28 #11629     |        1         1         1         1 |        3         3         3         4
>  29 #11629     |        1         1         1         1 |        3         3         3         4
>  30 #11629     |        1         1         1         1 |        3         3         3         4
>  31 #11629     |        1         1         1         1 |        4         3         4         5
>  32 #11629     |        1         1         1         1 |        4         3         3         4
>  33 #11629     |        1         1         1         1 |        3         3         3         4
>  34 #11629     |        1         1         1         1 |        4         3         3         4
>  35 #11628     |        1         1         1         1 |        3         3         3         4
>  36 #11628     |        1         1         1         1 |        4         3         3         4
>  37 #11628     |        1         1         1         1 |        3         3         3         4
>  38 #11628     |        1         1         1         1 |        3         3         3         7
>  39 #11628     |        1         1         1         1 |        3         3         3         4

while read -n 128 toto; do printf '%s\n' "$toto"; sleep 1; done < /dev/urandom > /dev/tty0 &

> ~# timerlat
>                                      Timer Latency
>   0 00:00:34   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
> CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
>   0 #33651     |        1         0         3     85105 |        3         2         8    167377
>   1 #33815     |        1         0         1        61 |        4         2         3      1328
>   2 #33653     |        1         0         3     83812 |        3         2         8    166085
>   3 #33819     |        1         0         1        39 |        3         3         3        72
>   4 #33819     |        1         1         1         1 |        3         3         3         5
>   5 #33819     |        1         1         1        32 |        3         3         3        76
>   6 #33819     |        1         1         1         1 |        4         3         4         5
>   7 #33819     |        1         1         1        38 |        3         3         3        80
>   8 #33819     |        1         1         1         1 |        3         3         3         5
>   9 #33819     |        1         1         1        29 |        3         3         3        81
>  10 #33819     |        1         1         1         1 |        3         3         3         5
>  11 #33819     |        1         1         1        36 |        3         3         3        73
>  12 #33818     |        1         1         1         1 |        3         3         3         5
>  13 #33818     |        1         1         1        81 |        3         3         3       117
>  14 #33818     |        1         1         1         1 |        3         3         3         5
>  15 #33818     |        1         1         1        40 |        3         3         3        80
>  16 #33819     |        1         1         1         1 |        3         3         3         5
>  17 #33819     |        1         1         1         3 |        4         3         3         6
>  18 #33819     |        1         1         1         1 |        3         3         3         5
>  19 #33819     |        1         1         1        62 |        3         3         3        84
>  20 #33812     |        1         0         1       177 |        3         2         3      1199
>  21 #33817     |        1         0         1        61 |        3         2         3      1416
>  22 #33819     |        1         0         1         3 |        3         3         3         5
>  23 #33819     |        1         0         1        34 |        3         3         3        83
>  24 #33819     |        1         1         1         1 |        3         3         3         6
>  25 #33819     |        1         1         1        61 |        3         3         3       113
>  26 #33819     |        1         1         1         1 |        3         3         3         5
>  27 #33819     |        1         1         1        57 |        3         3         3        92
>  28 #33819     |        1         1         1         1 |        3         3         3         5
>  29 #33819     |        1         1         1        48 |        3         3         3        89
>  30 #33819     |        1         1         1         1 |        3         3         3         5
>  31 #33819     |        1         1         1        40 |        4         3         4       111
>  32 #33819     |        1         1         1         1 |        3         3         3         5
>  33 #33819     |        1         1         1        52 |        3         3         3       105
>  34 #33819     |        1         1         1         1 |        4         3         3         5
>  35 #33819     |        1         1         1        66 |        3         3         3       111
>  36 #33819     |        1         1         1         1 |        4         3         3         5
>  37 #33818     |        1         1         1        45 |        3         3         3       116
>  38 #33818     |        1         1         1         1 |        3         3         3         7
>  39 #33818     |        1         1         1        28 |        3         3         3       108

Comment 23 Steve Baker 2023-04-11 02:30:01 UTC
Lets consider the implications of removing console=tty0 from the default kernel arguments[1].

- Virt development and CI environments would not show console output in the graphical console, but would continue to show it in the serial console
- Customers who need to see the kernel boot through an actual video card can enable that temporarily at the grub menu. We could even document this

I'm not seeing much downside from removing console=tty0, any thoughts from anyone else?

[1] https://opendev.org/openstack/tripleo-common/src/branch/stable/wallaby/image-yaml/overcloud-hardened-images-uefi-python3.yaml#L41

Comment 32 Robin Jarry 2023-04-21 15:30:55 UTC
Hi all,

I have performed some additional debugging after reading BZ 1848102#c110. It turns out that the real culprit is the nomodeset option. **Removing** nomodeset actually fixes the issue, even when console=tty0 is present. I could not reproduce any packet drop and/or latency spikes while writing to the VGA controller (via tty0 or via fb0), neither by rotating the fb console.

Here is what the kernel docs have to say about nomodeset (I have underlined the important details):

>        nomodeset       Disable kernel modesetting. Most systems' firmware
>                        sets up a display mode and provides framebuffer memory
>                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                        for output. With nomodeset, DRM and fbdev drivers will
>                        not load if they could possibly displace the pre-
>                        initialized output. Only the system framebuffer will
>                        be available for use. The respective drivers will not
>                        perform display-mode changes or accelerated rendering.
>                                                        ^^^^^^^^^^^^^^^^^^^^^
>
>                        Useful as error fallback, or for testing and debugging.
>                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

https://github.com/torvalds/linux/blob/2af3e53a4dc08657f1b46f97f04ff4a0ab3cad8d/Documentation/admin-guide/kernel-parameters.txt#L3813-L3821

This nomodeset option was added without specific explanation at the same time than console=tty0: https://opendev.org/openstack/tripleo-common/commit/fc07b696ac2c45c3f914b85b750e471fceb5c390. It is possible than it was left here by mistake.

I tried removing nofb and vga=normal as well. These had absolutely no effect. I didn't even find any documentation about nofb. It is possible that this option has been deprecated for a long time. Since it does not hurt, it can be left here.

This nomodeset idea was raised by Marcelo Tosatti via a separate email thread (quoting him here):

> Wonder if this is not related to the memory type assigned to the memory-mapped I/O region for the graphics card:
> 
> 11.3.2 Choosing a Memory Type
> 
> The simplest system memory model does not use memory-mapped I/O with read or
> write side effects, does not include a frame buffer, and uses the write-back
> memory type for all memory. An I/O agent can perform direct memory access (DMA)
> to write-back memory and the cache protocol maintains cache coherency.
> 
> A system can use strong uncacheable memory for other memory-mapped I/O, and
> should always use strong uncacheable memory for memory-mapped I/O with read
> side effects. Dual-ported memory can be considered a write side effect, making
> relatively prompt writes desirable, because those writes cannot be observed at
> the other port until they reach the memory agent.
> 
> A system can use strong uncacheable, write-through, or write-combining memory
> for frame buffers or dual-ported memory that contains pixel values displayed on
> a screen. Frame buffer memory is typically large (a few megabytes) and is
> usually written more than it is read by the processor. Using strong uncacheable
> memory for a frame buffer gener- ates very large amounts of bus traffic,
> because operations on the entire buffer are implemented using partial writes
> rather than line writes.
> 
> Using write-through memory for a frame buffer can displace almost all other
> useful cached lines in the processor's L2 and L3 caches and L1 data cache.
> Therefore, systems should use write-combining memory for frame buffers whenever
> possible.
> 
> And: https://download.intel.com/design/PentiumII/applnots/24442201.pdf
> 
> /proc/mtrr and /sys/kernel/debug/x86/pat_memtype_list can be used to find the
> memory type assigned to the memory I/O regions (/proc/iomem and lspci can be
> used to map that to specific devices). Along with the combination rules:
> 
> https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-427.html

Here is the info I could extract.

Without nomodeset:

/proc/iomem:
>     91000000-91ffffff : PCI Bus 0000:03
>       91000000-91ffffff : 0000:03:00.0
>         91000000-91ffffff : mgadrmfb_vram
>     92000000-928fffff : PCI Bus 0000:03
>       92000000-927fffff : 0000:03:00.0
>       92808000-9280bfff : 0000:03:00.0
>         92808000-9280bfff : mgadrmfb_mmio

/sys/kernel/debug/x86/pat_memtype_list:
> PAT: [mem 0x0000000091000000-0x0000000092000000] write-combining
> PAT: [mem 0x0000000091000000-0x0000000092000000] write-combining

/proc/mtrr:
> reg00: base=0x000000000 (    0MB), size=262144MB, count=1: write-back

With nomodeset:

/proc/iomem:
>     91000000-91ffffff : PCI Bus 0000:03
>       91000000-91ffffff : 0000:03:00.0
>         91000000-912fffff : efifb
>     92000000-928fffff : PCI Bus 0000:03
>       92000000-927fffff : 0000:03:00.0
>       92808000-9280bfff : 0000:03:00.0

/sys/kernel/debug/x86/pat_memtype_list:
> PAT: [mem 0x0000000091000000-0x0000000091300000] write-combining

/proc/mtrr:
> reg00: base=0x000000000 (    0MB), size=262144MB, count=1: write-back

I'm not sure if any of it is relevant, but from what I can read, efifb memory is mapped when using nomodeset, instead of mgadrmfb_vram. It could very well be a firmware issue that causes writing to this efifb memory very slow. I have no idea. In any case, removing nomodeset makes writing pixels to the VGA controller much faster and it does not hold the memory bus on the whole NUMA node.

In conclusion, I would suggest the following:

1) Remove nomodeset: this option is for debugging only and should NOT be used on production.

2) Remove console=tty0: since nomodeset is removed, we *may* encounter issues with exotic VGA controllers. Let's be safe and avoid logging kernel messages on the VGA console. This has proven to be quite an effective workaround.

3) Limit printk log level to KERN_EMERG: kernel messages are available with dmesg and journalctl anyway. We might as well avoid logging to the serial consoles to shield from future issues (echo "kernel.printk = 0 4 0 0" > /etc/sysctl.d/95-printk.conf).

Comment 33 Robin Jarry 2023-04-27 11:55:55 UTC
*** Bug 2178958 has been marked as a duplicate of this bug. ***

Comment 38 Jocelyn Falempe 2023-05-15 09:43:58 UTC
FYI, I also pushed an RFC patches, to use DMA/IRQ with mgag200

https://patchwork.freedesktop.org/series/117380/

It should have good impact on latency.

Latency, measured with cyclictest -s -l 10000:
Without DMA:
# /dev/cpu_dma_latency set to 0us
policy: other/other: loadavg: 1.52 0.52 0.33 3/358 2025          
T: 0 ( 1977) P: 0 I:1000 C:  10000 Min:      7 Act:   56 Avg:   85 Max:    2542

With DMA:
# /dev/cpu_dma_latency set to 0us
policy: other/other: loadavg: 1.27 0.48 0.18 2/363 2498          
T: 0 ( 2403) P: 0 I:1000 C:  10000 Min:      8 Act:   62 Avg:   59 Max:     339


I still have some re-work to do before it can be merged upstream.

Comment 39 Marcelo Ricardo Leitner 2023-05-15 13:04:39 UTC
Who should be the assignee now? It's nobody@.

Comment 40 Robin Jarry 2023-05-15 13:18:23 UTC
It looks like the assignee got reset when changing components (tripleo-ansible → openstack-tripleo-common).

Julia, was this intentional?

Comment 41 Julia Kreger 2023-05-15 13:24:17 UTC
Apologies, I'll reset it to Steve Baker. When fixing the component, it auto-reset.

Comment 47 Vadim Khitrin 2023-05-29 08:01:23 UTC
Looking at the shipped Overcloud UEFI image in RPM `rhosp-director-images-uefi-x86_64-17.1-20230525.2.el9ost.noarch`, I can see see the kernel args we wanted to remove:

```
><fs> cat /boot/loader/entries/1565da793eb146cb86b44147abba75dc-5.14.0-284.11.1.el9_2.x86_64.conf
title Red Hat Enterprise Linux (5.14.0-284.11.1.el9_2.x86_64) 9.2 (Plow)
version 5.14.0-284.11.1.el9_2.x86_64
linux /vmlinuz-5.14.0-284.11.1.el9_2.x86_64
initrd /initramfs-5.14.0-284.11.1.el9_2.x86_64.img
options root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check  crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M  console=ttyS0,115200 no_timer_check nofb nomodeset gfxpayload=text boot=LABEL=mkfs_boot
grub_users $grub_users
grub_arg --unrestricted
grub_class rhel
```

It makes sense since we still saw a low performance in our CI.

Moving back to ASSIGNED, correct me if I am wrong, but we would also want this change to be propagated to shipped Overcloud images.

Comment 66 Vadim Khitrin 2023-06-08 05:11:08 UTC
Thanks, everyone.
We can confirm that on compose `RHOS-17.1-RHEL-9-20230607.n.2`, the default kernelargs do not include `nomodeset` and `tty0`

```
rpm -qa | grep rhosp-director
rhosp-director-images-base-17.1-20230606.6.el9ost.noarch
rhosp-director-images-metadata-17.1-20230606.6.el9ost.noarch
rhosp-director-images-ipa-x86_64-17.1-20230606.6.el9ost.noarch
rhosp-director-images-x86_64-17.1-20230606.6.el9ost.noarch
rhosp-director-images-17.1-20230606.6.el9ost.noarch
rhosp-director-images-minimal-17.1-20230606.6.el9ost.noarch
rhosp-director-images-uefi-x86_64-17.1-20230606.6.el9ost.noarch
```

```
><fs> cat /boot/loader/entries/73e33ffe5206403fa7854e9b58507275-5.14.0-284.11.1.el9_2.x86_64.conf /
error: incorrect number of arguments
usage: cat path
type 'help cat' for more help on cat
><fs> cat /boot/loader/entries/73e33ffe5206403fa7854e9b58507275-5.14.0-284.11.1.el9_2.x86_64.conf
title Red Hat Enterprise Linux (5.14.0-284.11.1.el9_2.x86_64) 9.2 (Plow)
version 5.14.0-284.11.1.el9_2.x86_64
linux /vmlinuz-5.14.0-284.11.1.el9_2.x86_64
initrd /initramfs-5.14.0-284.11.1.el9_2.x86_64.img
options root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check  crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M  console=ttyS0,115200 no_timer_check memtest=0 boot=LABEL=mkfs_boot
grub_users $grub_users
grub_arg --unrestricted
grub_class rhel
```

Comment 79 errata-xmlrpc 2023-08-16 01:14:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.