Description of problem: Performance is very low when using e810 nic in a ovs-dpdk scenario. Even sending a very low throughput we can see packet loss in e810 nic *********************** OVS configuration *********************** Bridge br-dpdk1 fail_mode: standalone datapath_type: netdev Port dpdk3 Interface dpdk3 type: dpdk options: {dpdk-devargs="0000:3b:00.1"} Port patch-provnet-f6ce6a34-0212-4c3d-b69c-7eed2b8f056e-to-br-int Interface patch-provnet-f6ce6a34-0212-4c3d-b69c-7eed2b8f056e-to-br-int type: patch options: {peer=patch-br-int-to-provnet-f6ce6a34-0212-4c3d-b69c-7eed2b8f056e} Port br-dpdk1 Interface br-dpdk1 type: internal Bridge br-dpdk0 fail_mode: standalone datapath_type: netdev Port patch-provnet-c2fd3e23-f6db-427b-94ee-6b3749cd1863-to-br-int Interface patch-provnet-c2fd3e23-f6db-427b-94ee-6b3749cd1863-to-br-int type: patch options: {peer=patch-br-int-to-provnet-c2fd3e23-f6db-427b-94ee-6b3749cd1863} Port dpdk2 Interface dpdk2 type: dpdk options: {dpdk-devargs="0000:3b:00.0"} Port br-dpdk0 Interface br-dpdk0 type: internal [root@computeovsdpdksriov-r740 tripleo-admin]# lspci | grep 3b 3b:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 3b:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) ********************** OVS stats ********************** There are drops in port1 in both ovs bridges in dpdk port (dpdk2 and dpdk3). I am not injecting a very high rate (1mpps), there should not be drops [root@computeovsdpdksriov-r740 tripleo-admin]# ovs-ofctl dump-ports br-dpdk0 OFPST_PORT reply (xid=0x2): 3 ports port "patch-provnet-c": rx pkts=1874674292, bytes=112480457896, drop=?, errs=?, frame=?, over=?, crc=? tx pkts=1888115927, bytes=120839419708, drop=?, errs=?, coll=? port dpdk2: rx pkts=1888134892, bytes=148675414617, drop=434902752, errs=0, frame=?, over=?, crc=? tx pkts=1874685878, bytes=119979897034, drop=0, errs=0, coll=? port LOCAL: rx pkts=42, bytes=3172, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=88128057, bytes=5640511868, drop=0, errs=0, coll=0 [root@computeovsdpdksriov-r740 tripleo-admin]# ovs-ofctl dump-ports br-dpdk1 OFPST_PORT reply (xid=0x2): 3 ports port "patch-provnet-f": rx pkts=1874876125, bytes=112492567990, drop=?, errs=?, frame=?, over=?, crc=? tx pkts=1888349808, bytes=120854388032, drop=?, errs=?, coll=? port dpdk3: rx pkts=1888358904, bytes=148805956909, drop=436718434, errs=0, frame=?, over=?, crc=? tx pkts=1874877823, bytes=119992181586, drop=0, errs=0, coll=? port LOCAL: rx pkts=42, bytes=3172, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=88544048, bytes=5667134976, drop=0, errs=0, coll=0 Version-Release number of selected component (if applicable): RHOS-17.1-RHEL-9-20230315.n.1 How reproducible: 1. Deploy performance templates (ovn-dpdk) for e810 testing 2. Run performance 3. Performance is less than 0.5 mpps Actual results: Packet loss in e810 nic with very low throughput Expected results: There should not be packet loss in e810 with low throughput Additional info:
I have made some additional debugging on that platform. Running testpmd on the same E810 NICs (I had to manually unpack some firmware images to get it to work, see bug #2179024), I get good throughput throughput performance but there's a small amount of packet loss which eventually leads to very bad zero-loss performance test results. With X710 nics on the same machine, same setup, same testpmd configuration, I don't see the issue. It may be related to a dpdk driver problem. ======== SOFTWARE ======== BOOT_IMAGE=/vmlinuz-5.14.0-283.el9.x86_64 root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M default_hugepagesz=1GB hugepagesz=1G hugepages=64 iommu=pt intel_iommu=on tsx=off isolcpus=2-19,22-39 console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb skew_tick=1 nohz=on nohz_full=2-19,22-39 rcu_nocbs=2-19,22-39 tuned.non_isolcpus=00300003 intel_pstate=disable nosoftlockup openvswitch3.0-3.0.0-36.el9fdp.x86_64 dpdk-22.11-1.el9.x86_64 ======== HARDWARE ======== NUMA 0 ====== Memory: 187GB 2MB hugepages: 0 1GB hugepages: 32 CPUs ---- Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz Cores IDs: 0,20 2,22 4,24 6,26 8,28 10,30 12,32 14,34 16,36 18,38 NICs ---- SLOT DRIVER IFNAME MAC LINK/STATE SPEED DEVICE 0000:3b:00.0 ice ens1f0 40:a6:b7:18:e0:60 1/up 10Gb/s Ethernet Controller E810-C for QSFP 0000:3b:00.1 ice ens1f1 40:a6:b7:18:e0:61 1/up 10Gb/s Ethernet Controller E810-C for QSFP [root@computeovsdpdksriov-r740 ~]# driver: ice version: 5.14.0-283.el9.x86_64 firmware-version: 4.20 0x80017785 1.3346.0 expansion-rom-version: bus-info: 0000:3b:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes NUMA 1 ====== Memory: 188GB 2MB hugepages: 0 1GB hugepages: 32 CPUs ---- Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz Cores IDs: 1,21 3,23 5,25 7,27 9,29 11,31 13,33 15,35 17,37 19,39 NICs ---- SLOT DRIVER IFNAME MAC LINK/STATE SPEED DEVICE 0000:af:00.0 i40e ens4f0 f8:f2:1e:42:5d:70 1/up 10Gb/s Ethernet Controller X710 for 10GbE SFP+ 0000:af:00.1 i40e ens4f1 f8:f2:1e:42:5d:70 1/up 10Gb/s Ethernet Controller X710 for 10GbE SFP+ 0000:af:00.2 i40e ens4f2 f8:f2:1e:42:5d:72 1/up 10Gb/s Ethernet Controller X710 for 10GbE SFP+ 0000:af:00.3 i40e ens4f3 f8:f2:1e:42:5d:73 1/up 10Gb/s Ethernet Controller X710 for 10GbE SFP+ [root@computeovsdpdksriov-r740 ~]# ethtool -i ens4f2 driver: i40e version: 5.14.0-283.el9.x86_64 firmware-version: 6.00 0x80003a29 18.3.6 expansion-rom-version: bus-info: 0000:af:00.2 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ======== COMMANDS ======== dpdk-testpmd -l 0,2 -a 0000:3b:00.0 -a 0000:3b:00.1 -n 4 -- --nb-cores=1 -i --rxd=4096 --txd=4096 --forward-mode=mac --eth-peer=0,3c:fd:fe:33:a8:44 --eth-peer=1,3c:fd:fe:33:a8:46 dpdk-testpmd -l 1,3 -a 0000:af:00.2 -a 0000:af:00.3 -n 4 -- --nb-cores=1 -i --rxd=4096 --txd=4096 --forward-mode=mac --eth-peer=0,3c:fd:fe:33:a8:44 --eth-peer=1,3c:fd:fe:33:a8:46 ==== TREX ==== ./ndr --stl --opt-bin-search --bi-dir --ports 0 1 -v --profile stl/udp_1pkt.py With E810: {'BW per core [Gbit/sec @100% per core]': '12.75Gbit/Sec @100% per core', 'CPU Utilization [%]': '1.29', 'Drop Rate [%]': '0.03', 'Elapsed Time [Sec]': '350.62', 'Line Utilization [%]': '4.3', 'Max Rate [bps]': '7.63 Gbps', 'NDR points': ['655.56 Mbps'], 'OPT RX Rate [bps]': '327.73 Mbps', 'OPT Rate (Multiplier) [%]': '4.29604435115832', 'OPT TX Rate [bps]': '327.78 Mbps', 'Queue Full [%]': '0.0%', 'RX [MPPS]': '1.28 MPPS', 'RX [bps]': '655.46 Mbps', 'TX [MPPS]': '1.28 MPPS', 'TX [bps]': '655.56 Mbps', 'Title': 'udp_1pkt', 'Total Iterations': 6, 'Total RX L1': '860.29 Mbps', 'Total TX L1': '860.42 Mbps', 'latency': {}, 'valid_latency': True} With X710: {'BW per core [Gbit/sec @100% per core]': '12.4Gbit/Sec @100% per core', 'CPU Utilization [%]': '22.78', 'Drop Rate [%]': '0.04', 'Elapsed Time [Sec]': '210.36', 'Line Utilization [%]': '74.16', 'Max Rate [bps]': '7.62 Gbps', 'NDR points': ['11.3 Gbps'], 'OPT RX Rate [bps]': '5.65 Gbps', 'OPT Rate (Multiplier) [%]': '74.18179379040447', 'OPT TX Rate [bps]': '5.65 Gbps', 'Queue Full [%]': '0.0%', 'RX [MPPS]': '22.07 MPPS', 'RX [bps]': '11.3 Gbps', 'TX [MPPS]': '22.07 MPPS', 'TX [bps]': '11.3 Gbps', 'Title': 'udp_1pkt', 'Total Iterations': 2, 'Total RX L1': '14.83 Gbps', 'Total TX L1': '14.83 Gbps', 'latency': {}, 'valid_latency': True} The throughput performance is comparable with both NICs (~22M pkts/s).
I had a look at this system with rjarry, summarizing here what we did yesterday. We can see some random imiss on the physical ports, even at low rate. Yet, the testpmd datapath thread does not seem interrupted by some kernel task or interrupt (no context switch in /proc/XXX/task/YYY/status, no interrupt count that changes in /proc/interrupts during the test, nothing that stands out in perf record). My current gut feeling is that there is some issue at the hw level: cpu, pci, nic... ? The nic firmware and loaded ddp seems recent enough. This nic is plugged in a x8 slot but I don't see a reason for random hiccups because of it. We looked at the cpu side. One odd thing is that cpupower reports that the core (running this thread) is not 100% of its time in C0. Yet, looking at this core /sys/ c-state counters, there is no change in them during the test. Robin disabled turbo boost which was enabled in the bios, but this does not change much the outcome. Since a nic on this same system works fine in the other numa node, switching the nics could be something to try, but again, I have no decisive clue about the issue.
Additionnally, something to try would be to (ex)change cables for the E810 nic.
I reproduced the issue on this system with dpdk 20.11 package which seems to confort the idea that this is not a software issue.
I deployed OSP 16.2 and I have not been able to reproduced the issue. I get a much higher performance (2.5 mpps), So I think this is not something related to the hardware or cables. It looks like something related with kernel or drivers or dpdk, but I would say it is not hardware. This is the version I have tested: RHOS-16.2-RHEL-8-20230301.n.1 Red Hat Enterprise Linux release 8.4 (Ootpa) openvswitch2.15-2.15.0-130.el8fdp.x86_64 dpdk-20.11-4.el8_4.x86_64
After some more debugging today, I may have found what is causing the intermittent packet loss. The eno1 interface (which is not used in the deployment nor related in any way to the E810 ports) is up and receives vrrp traffic. This causes random "martian packet" logs in the kernel buffer: [88487.734192] IPv4: martian source 10.46.174.3 from 10.46.174.126, on dev eno1 [88487.826907] ll header: 00000000: ff ff ff ff ff ff 00 00 5e 00 01 01 08 06 I have made a direct correlation between these martian packet logs and the packet loss in the E810 NIC using testpmd. 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s [88188.539303] IPv4: martian source 10.46.174.39 from 10.46.174.126, on dev eno1 [88188.633818] ll header: 00000000: ff ff ff ff ff ff 00 00 5e 00 01 01 08 06 --- 0: RX=4.9M pkt/s DROP=21.5K pkt/s TX=4.9M pkt/s 1: RX=4.9M pkt/s DROP=21.6K pkt/s TX=4.9M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s [88196.057719] IPv4: martian source 10.46.174.6 from 10.46.174.126, on dev eno1 [88196.150947] ll header: 00000000: ff ff ff ff ff ff 00 00 5e 00 01 01 08 06 --- 0: RX=4.9M pkt/s DROP=1.0K pkt/s TX=4.9M pkt/s 1: RX=4.9M pkt/s DROP=1.0K pkt/s TX=4.9M pkt/s --- 0: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s 1: RX=5.0M pkt/s DROP=0.0 pkt/s TX=5.0M pkt/s Running tcpdump on eno1 shows these packets: [root@computeovsdpdksriov-r740 ~]# tcpdump --no-promiscuous-mode -nnei eno1 16:03:48.938699 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.25 tell 10.46.174.126, length 46 16:03:58.557135 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.40 tell 10.46.174.126, length 46 16:04:08.171754 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.7 tell 10.46.174.126, length 46 When enabling promisc, we see all the vrrp traffic: [root@computeovsdpdksriov-r740 ~]# tcpdump -nnei eno1 16:05:03.781562 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:04.270320 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 16:05:04.627667 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:05.048363 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 16:05:05.559675 54:1e:56:0c:c6:c0 > 01:80:c2:00:00:00, 802.3, length 39: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03: STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 81b9.54:1e:56:0c:c6:c0.81f3, length 36 16:05:05.607348 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:05.901500 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 16:05:06.372737 08:05:e2:75:7b:20 > 01:00:5e:00:00:0d, ethertype IPv4 (0x0800), length 72: 10.46.174.125 > 224.0.0.13: PIMv2, Hello, length 38 16:05:06.442372 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:06.877408 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 16:05:07.301527 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:07.389763 54:1e:56:0c:c6:c0 > 01:80:c2:00:00:00, 802.3, length 39: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03: STP 802.1w, Rapid STP, Flags [Learn, Forward], bridge-id 81b9.54:1e:56:0c:c6:c0.81f3, length 36 16:05:07.658379 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 16:05:08.280440 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:08.489423 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 16:05:08.551678 08:05:e2:75:7b:20 > 33:33:00:00:00:0d, ethertype IPv6 (0x86dd), length 114: fe80::a05:e201:b975:7b20 > ff02::d: PIMv2, Hello, length 60 16:05:09.066432 00:00:5e:00:02:01 > 33:33:00:00:00:12, ethertype IPv6 (0x86dd), length 94: fe80::a05:e201:b975:8a20 > ff02::12: VRRPv3, Advertisement, vrid 1, prio 254, intvl 100cs, length 40 16:05:09.272567 00:00:5e:00:01:01 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 10.46.174.2 tell 10.46.174.126, length 46 16:05:09.285456 00:00:5e:00:01:01 > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 60: 10.46.174.124 > 224.0.0.18: VRRPv2, Advertisement, vrid 1, prio 254, authtype none, intvl 1s, length 20 For some obscure reason which is yet to be discovered, disabling the log_martians option on the "all" interface, fixes the random packet loss on dpdk ports. echo 0 > /proc/sys/net/ipv4/conf/all/log_martians This really does not make any sense. The "martian packet" logs also show on the 16.2 deployment and they do not affect DPDK performance. Yesterday, we also tried to update a 16.2 compute node with a RHEL 9.2 kernel and there was no packet drop, in spite of the "martian packet" logs. Maybe the kernel folks could shine a light on this?
Obviously, setting the eno1 interface down also fixes the issue.
Could it be that for some reason CPUs are interrupted when receiving these packets and cause the packet losses?
Here are some additional observations: None of the {non,}voluntary_ctxt_switches stats are increasing on the dpdk lcore bound thread when I observe packet drop. voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 283 <--- remains mostly constant For the next two outputs, I used this script for reference: https://git.sr.ht/~rjarry/dotfiles/tree/main/item/bin/irqstat irqbalance seems not able to properly isolate the requested CPUs. This may be due to an excessive number of VFs compared to the number of available slots in the interrupt controllers (max 224 per core). However, the specific interface that is causing the "martian source" messages is not assigned to CPU 2. # ./irqstat -c 2 IRQ AFFINITY EFFECTIVE-CPUs 205:i40e-eno2-TxRx-16 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 2 218:i40e-eno2-TxRx-29 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 2 251:mlx5_comp1@pci:0000:5f:00.2 2 2 263:mlx5_comp1@pci:0000:5f:00.3 2 2 275:mlx5_comp1@pci:0000:5f:00.4 2 2 311:mlx5_comp1@pci:0000:5f:00.7 2 2 323:mlx5_comp1@pci:0000:5f:01.0 2 2 655:mlx5_comp1@pci:0000:5f:01.4 2 2 691:mlx5_comp1@pci:0000:5f:01.7 2 2 Also, there are no eno1 interrupts firing on the lcore CPU 2. The interrupt counters (that are wrongfully bound to CPU2) remain mostly constant: # ./irqstat -se -c 2 IRQ CPU-2 205:i40e-eno2-TxRx-16 63 218:i40e-eno2-TxRx-29 49 251:mlx5_comp1@pci:0000:5f:00.2 132 263:mlx5_comp1@pci:0000:5f:00.3 144 275:mlx5_comp1@pci:0000:5f:00.4 134 287:mlx5_comp1@pci:0000:5f:00.5 11 299:mlx5_comp1@pci:0000:5f:00.6 11 311:mlx5_comp1@pci:0000:5f:00.7 124 323:mlx5_comp1@pci:0000:5f:01.0 460 619:mlx5_comp1@pci:0000:5f:01.1 12 631:mlx5_comp1@pci:0000:5f:01.2 6 643:mlx5_comp1@pci:0000:5f:01.3 8 667:mlx5_comp1@pci:0000:5f:01.5 10 679:mlx5_comp1@pci:0000:5f:01.6 10 691:mlx5_comp1@pci:0000:5f:01.7 140 703:mlx5_comp1@pci:0000:5f:02.0 11 715:mlx5_comp1@pci:0000:5f:02.1 7 727:mlx5_comp1@pci:0000:5f:02.2 5 739:mlx5_comp1@pci:0000:5f:02.3 7 751:mlx5_comp1@pci:0000:5f:02.4 3 763:mlx5_comp1@pci:0000:5f:02.5 2 Here are some excerpts from perf captures on the lcore CPU: > rte-worker-2 217762 [002] 57183.022352: 608650 cycles: > 49613 ice_recv_pkts_vec_avx2+0x133 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.022608: 611186 cycles: > 796ce pkt_burst_mac_forward.lto_priv.0+0x17e (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.022916: 609154 cycles: > 49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.023265: 594908 cycles: > ffffffffa9371cb7 vprintk_emit+0x187 ([kernel.kallsyms]) > ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms]) > ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms]) > ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms]) > ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms]) > ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms]) > ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms]) > ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms]) > 49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > [REPEATS 25 TIMES] > > rte-worker-2 217762 [002] 57183.029195: 599669 cycles: > ffffffffa9371cbe vprintk_emit+0x18e ([kernel.kallsyms]) > ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms]) > ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms]) > ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms]) > ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms]) > ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms]) > ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms]) > ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms]) > 49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.029682: 599248 cycles: > ffffffffa98d60b7 io_serial_out+0x17 ([kernel.kallsyms]) > ffffffffa98da5e8 serial8250_console_write+0x208 ([kernel.kallsyms]) > ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms]) > ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms]) > ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms]) > ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms]) > ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms]) > ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms]) > ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms]) > ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms]) > ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms]) > ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms]) > 49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > [REPEATS 20 TIMES] > > rte-worker-2 217762 [002] 57183.034892: 613613 cycles: > ffffffffa98d6095 io_serial_in+0x15 ([kernel.kallsyms]) > ffffffffa98d678d wait_for_lsr+0x3d ([kernel.kallsyms]) > ffffffffa98da77c serial8250_console_write+0x39c ([kernel.kallsyms]) > ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms]) > ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms]) > ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms]) > ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms]) > ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms]) > ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms]) > ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms]) > ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms]) > ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms]) > ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms]) > 49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.035142: 613387 cycles: > ffffffffa97c2163 delay_tsc+0x33 ([kernel.kallsyms]) > ffffffffa98d6775 wait_for_lsr+0x25 ([kernel.kallsyms]) > ffffffffa98da77c serial8250_console_write+0x39c ([kernel.kallsyms]) > ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms]) > ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms]) > ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms]) > ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms]) > ffffffffa945c2c0 irq_work_single+0x20 ([kernel.kallsyms]) > ffffffffa945c326 irq_work_run_list+0x26 ([kernel.kallsyms]) > ffffffffa945c354 irq_work_run+0x14 ([kernel.kallsyms]) > ffffffffa922b1b9 __sysvec_irq_work+0x29 ([kernel.kallsyms]) > ffffffffa9d26ca7 sysvec_irq_work+0x37 ([kernel.kallsyms]) > ffffffffa9e00ec6 asm_sysvec_irq_work+0x16 ([kernel.kallsyms]) > 49969 ice_recv_pkts_vec_avx2+0x489 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.147546: 1521803 cycles: > 49785 ice_recv_pkts_vec_avx2+0x2a5 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 795e9 pkt_burst_mac_forward.lto_priv.0+0x99 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.148160: 1377089 cycles: > 494e5 ice_recv_pkts_vec_avx2+0x5 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > > rte-worker-2 217762 [002] 57183.148714: 1275019 cycles: > 82477 run_pkt_fwd_on_lcore+0x97 (/usr/bin/dpdk-testpmd) > 8148535554415541 [unknown] ([unknown]) > > rte-worker-2 217762 [002] 57183.149229: 1116662 cycles: > 4d5c2 ice_xmit_pkts_vec_avx2+0x7b2 (/usr/lib64/dpdk-pmds/librte_net_ice.so.23.0) > 79760 pkt_burst_mac_forward.lto_priv.0+0x210 (/usr/bin/dpdk-testpmd) > 168690980 [unknown] ([unknown]) > > rte-worker-2 217762 [002] 57183.149679: 1044895 cycles: > 796d7 pkt_burst_mac_forward.lto_priv.0+0x187 (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.150101: 940919 cycles: > 7970b pkt_burst_mac_forward.lto_priv.0+0x1bb (/usr/bin/dpdk-testpmd) > > rte-worker-2 217762 [002] 57183.150480: 892927 cycles: > 79556 pkt_burst_mac_forward.lto_priv.0+0x6 (/usr/bin/dpdk-testpmd) > 8148535554415541 [unknown] ([unknown]) And on the CPUs that are bound to the eno1 interrupts at the time the martian source packets are printed on the console: > swapper 0 [000] 58566.812242: 409661 cycles: > ffffffffa98d6095 io_serial_in+0x15 ([kernel.kallsyms]) > ffffffffa98d678d wait_for_lsr+0x3d ([kernel.kallsyms]) > ffffffffa98da77c serial8250_console_write+0x39c ([kernel.kallsyms]) > ffffffffa93705ef console_emit_next_record.constprop.0+0x13f ([kernel.kallsyms]) > ffffffffa9370825 console_unlock+0x105 ([kernel.kallsyms]) > ffffffffa9371cd2 vprintk_emit+0x1a2 ([kernel.kallsyms]) > ffffffffa9cde020 _printk+0x58 ([kernel.kallsyms]) > ffffffffa9d21bbf ip_handle_martian_source.cold+0x17 ([kernel.kallsyms]) > ffffffffa9b8e507 __mkroute_input+0x337 ([kernel.kallsyms]) > ffffffffa9b8fa3e ip_route_input_slow+0x66e ([kernel.kallsyms]) > ffffffffa9b900ed ip_route_input_noref+0x4d ([kernel.kallsyms]) > ffffffffa9bd87e9 arp_process+0x499 ([kernel.kallsyms]) > ffffffffa9ae739d __netif_receive_skb_list_core+0x26d ([kernel.kallsyms]) > ffffffffa9ae7a36 netif_receive_skb_list_internal+0x1c6 ([kernel.kallsyms]) > ffffffffa9ae819f napi_complete_done+0x6f ([kernel.kallsyms]) > ffffffffc0df27bc i40e_napi_poll+0x19c ([kernel.kallsyms]) > ffffffffa9ae82f7 __napi_poll+0x27 ([kernel.kallsyms]) > ffffffffa9ae87d3 net_rx_action+0x233 ([kernel.kallsyms]) > ffffffffaa0000c7 __softirqentry_text_start+0xc7 ([kernel.kallsyms]) > ffffffffa92fe1d5 __irq_exit_rcu+0xb5 ([kernel.kallsyms]) > ffffffffa9d265c0 common_interrupt+0x80 ([kernel.kallsyms]) > ffffffffa9e00ce2 asm_common_interrupt+0x22 ([kernel.kallsyms]) > ffffffffa982c06f mwait_idle_with_hints.constprop.0+0x4f ([kernel.kallsyms]) > ffffffffa9d37170 intel_idle_irq+0x20 ([kernel.kallsyms]) > ffffffffa9a637e6 cpuidle_enter_state+0x86 ([kernel.kallsyms]) > ffffffffa9a63ba9 cpuidle_enter+0x29 ([kernel.kallsyms]) > ffffffffa934e45c cpuidle_idle_call+0x12c ([kernel.kallsyms]) > ffffffffa934e56b do_idle+0x7b ([kernel.kallsyms]) > ffffffffa934e789 cpu_startup_entry+0x19 ([kernel.kallsyms]) > ffffffffa9d2916a rest_init+0xca ([kernel.kallsyms]) > ffffffffab6124d0 arch_call_rest_init+0xa ([kernel.kallsyms]) > ffffffffab61298c start_kernel+0x49e ([kernel.kallsyms]) > ffffffffa920015a secondary_startup_64_no_verify+0xe5 ([kernel.kallsyms]) Removing console=tty0 fixes packet drops and gets rid of the "printk" interrupts on CPU 2. I don't know what could be causing these "printk" interruptions on isolated cores. But they seem directly correlated with actual printk on serial console triggered by martian packets during hardware RX interrupts on other CPUs.
Out of curiosity, I ran this as well: > # ./irqstat -se | grep -e ^CPU -e '^2 ' > CPU NMI LOC SPU PMI IWI RTR RES CAL TLB TRM THR DFR MCE MCP PIN NPI PIW > 2 104394 75127 0 104394 57 0 10 17736 0 0 0 0 0 2 0 0 0 These additional interrupts cannot be "bound" to specific CPUs. When console=tty0 is removed from the boot command line, none of these counters are increasing. When console=tty0 is present, the CAL (Function Call Interrupts) counter is increasing but not in direct correlation with the "martian source" logs nor the packet drops.
This looks not specific to E810. I noticed the same intermittent packet drop running the same test on a CX-5 NIC on the same numa node. Note that setting eno1 down (where the martian packets are received) fixes the issue. Simply echoing garbage to /dev/kmsg does not trigger packet drop.
Can someone please elaborate on why the Regression and (with it) why this would block 17.1? I mean, did this used to work with 16.2?
Hi Marcelo, Yes, in the 16.2 release, we had great performance numbers for NFV workloads. For NFV customers performance is super critical.
We narrowed down the packets dropping issue with Robin. What happens is packets are being dropped when fbcon sends pixels to the VGA controller. We could reproduce the packet drops not only when receiving martian packets, but with any kernel log message or data written on the virtual console associated with fcbon (here, tty0 > fbcon > vga). Some reproducers below: 1. Triggering martian packets, ``` #!/bin/bash ns="ns1 ns2" for n in $ns; do ip netns add $n ip -net $n link set lo up done ip netns exec ns2 sysctl -q net.ipv4.conf.all.log_martians=1 ip netns exec ns2 sysctl -q net.ipv4.conf.default.log_martians=1 ip link add veth12 netns ns1 type veth peer name veth21 netns ns2 ip -net ns1 link set veth12 up; ip -net ns2 link set veth21 up ip -net ns1 a a 10.0.42.1/24 dev veth12 ip -net ns2 link add d0 type dummy ip -net ns2 link set d0 up ip -net ns2 a a 10.0.42.2/24 dev d0 sleep 1 ip netns exec ns1 ping -I veth12 -c2 10.0.42.2 &>/dev/null for n in $ns; do ip netns pids $n | xargs -r kill ip netns del $n done exit 0 ``` 2. Generating logs in dmesg, e.g. `ip link set ethX up` (if the interface has a carrier). Also works well with veth and the above script w/o the ping. 3. Directly writing to the virtual console associated to the fbcon, `while read -n 128 foo; do printf '%s\n' "$foo"; done < /dev/urandom > /dev/tty0`. 4. Rotating the fbcon, `echo 1 > /sys/devices/virtual/graphics/fbcon/rotate`. Etc. While we can't disable fbcon at boot time, any of the following made the packet drops to stop: 1. By disabling the vtcon associated with fbcon, here vtcon1: `echo 0 > /sys/class/vtconsole/vtcon1/bind`. Re-binding the vtcon makes drops to be seen again. 2. By removing `console=tty0` from the kernel cmdline and rebooting. This stops the console to be sent to tty0, and ultimately the vga controller. I reached out to Jocelyn from kernel graphics and (in addition to nicely answering our questions) pointed out one very similar bug they have, bz2139901, impacting specific h/w, which we were using[1]. Graphics guys can probably explain this better than I do, but my understanding is "when pixels are sent to the VGA card, bus transfer latency rises 10x" which prevents CPUs on the same NUMA node to access the NIC for a period of time and some packets can be dropped. This correlates with what Robin saw where drops where not seen in other NUMA nodes. We did not see the issue previously because 'console=tty0' was not used before in the kernel cmdline in previous versions of OCP (using RHEL 8). But we did test a RHEL 8.6 kernel with 'console=tty0' and were able to reproduce the packet drops. Robin is looking into the history of this. My understanding is this can't be fixed in the kernel, it's an h/w issue. My recommendation would be to remove 'console=tty0' from the kernel cmdline as a workaround. As a side note, the VGA output can still be used, only the console output isn't redirected there. This means an user accessing a virtual terminal on the VGA or an application displaying something can still trigger a similar issue. This is something we also tested, and while we were not able to trigger the same amount of drops as with fbcon, we could see some. On a different topic, we briefly performed some tests about the interrupts issue and our *very initial* thought was this *might* be specific to mlx NICs. Robin offered to perform more tests in this regard and we'll follow-up. This will probably be moved to a dedicated bz. [1] $ lspci | grep VGA 03:00.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller [102b:0536] (rev 04)
Not sure why this changed, but the `console=tty0` is in `/etc/default/grub` for OSP 17, but not in OSP 16. overcloud-full-16.2-20221124.2.x86_64 $ guestfish -a usr/share/rhosp-director-images/overcloud-full-16.2-20221124.2.x86_64/overcloud-full.qcow2 ><fs> run ><fs> mount /dev/sda / ><fs> grep GRUB_CMDLINE_LINUX /etc/default/grub GRUB_CMDLINE_LINUX="console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet" overcloud-full-17.0-20230111.1.x86_64 $ guestfish -a usr/share/rhosp-director-images/overcloud-full-17.0-20230111.1.x86_64/overcloud-full.qcow2 ><fs> run ><fs> list-filesystems /dev/sda: xfs ><fs> mount /dev/sda / ><fs> grep GRUB_CMDLINE_LINUX /etc/default/grub GRUB_CMDLINE_LINUX="console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M" GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset gfxpayload=text" overcloud-hardened-uefi-full-17.0-20230111.1.x86_64 $ guestfish -a usr/share/rhosp-director-images/overcloud-hardened-uefi-full-17.0-20230111.1.x86_64/overcloud-hardened-uefi-full.qcow2 ><fs> run ><fs> mount /dev/vg/lv_root / ><fs> grep GRUB_CMDLINE_LINUX /etc/default/grub GRUB_CMDLINE_LINUX="console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M" GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200 no_timer_check nofb nomodeset vga=normal console=tty0 console=ttyS0,115200 audit=1 nousb"
If I am not mistaken, this has been that way for a long time: https://opendev.org/openstack/tripleo-common/commit/fc07b696ac2c45c3f914b85b750e471fceb5c390
The question is, why/how this console=tty0 was removed in earlier versions?
Because the workaround or the issue doesn't apply to OVS, can we reassign it to the proper component? Thanks fbl
I think the reworking of the diskimage-builder bootloader element resulted in default kernel args being set as intended, which includes console=ttyS0. Assuming that console=ttyS0 is fine in the general case we could handle this either: - in documentation to specify custom kernelargs which excludes console=ttyS0, or - in a DPDK tripleo-ansible kernel args workaround to remove it, or - in an image-customization workaround to remove it I can't see the current state of DPDK documentation in 17.x [1] so I can't recommend the best approach yet. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.0/html/network_functions_virtualization_planning_and_configuration_guide/pr01
Hi Steve, There is no issue with ttyS0. The problem is specific to the VGA console: *tty0*. Also, this is not a problem only for DPDK. This bug effectively introduces a 150us memory bus latency on the whole NUMA node where the VGA controller is. This could affect any running VM on any CPU. Unless I am mistaken, there is no way to fix it in the kernel. The only workaround is to get rid of console=tty0 for all deployments until there is a fix for the affected generations of hardware. Once we have a better view of the list of these affected VGA controllers / CPU models, we plan to issue a KB article. NB: 17.0 was released *without NFV support* partly because of this bug.
Interesting test I ran today which clearly shows that writing 128 bytes to the tty0 console every second causes the max thread timer latency to rise significantly: > ~# timerlat > Timer Latency > 0 00:00:12 | IRQ Timer Latency (us) | Thread Timer Latency (us) > CPU COUNT | cur min avg max | cur min avg max > 0 #11629 | 1 0 1 41 | 3 2 3 677 > 1 #11625 | 1 0 1 61 | 4 2 4 1328 > 2 #11463 | 1 0 8 83812 | 3 2 17 166085 > 3 #11629 | 1 0 1 1 | 3 3 3 7 > 4 #11629 | 1 1 1 1 | 3 3 3 4 > 5 #11629 | 1 1 1 1 | 3 3 3 4 > 6 #11629 | 1 1 1 1 | 4 3 4 5 > 7 #11629 | 1 1 1 1 | 3 3 3 4 > 8 #11629 | 1 1 1 1 | 4 3 3 4 > 9 #11629 | 1 1 1 1 | 3 3 3 4 > 10 #11628 | 1 1 1 1 | 3 3 3 4 > 11 #11628 | 1 1 1 1 | 3 3 3 4 > 12 #11628 | 1 1 1 1 | 4 3 3 4 > 13 #11628 | 1 1 1 1 | 3 3 3 4 > 14 #11628 | 1 1 1 1 | 3 3 3 4 > 15 #11628 | 1 1 1 1 | 3 3 3 7 > 16 #11629 | 1 1 1 1 | 3 3 3 5 > 17 #11629 | 1 1 1 3 | 3 3 3 6 > 18 #11629 | 1 1 1 1 | 3 3 3 5 > 19 #11629 | 1 1 1 1 | 3 3 3 4 > 20 #11622 | 1 0 1 177 | 3 3 3 1199 > 21 #11627 | 1 0 1 8 | 3 3 3 1416 > 22 #11629 | 1 0 1 3 | 3 3 3 5 > 23 #11629 | 1 1 1 1 | 3 3 3 9 > 24 #11629 | 1 1 1 1 | 3 3 3 6 > 25 #11629 | 1 1 1 1 | 3 3 3 4 > 26 #11629 | 1 1 1 1 | 3 3 3 4 > 27 #11629 | 1 1 1 1 | 3 3 3 4 > 28 #11629 | 1 1 1 1 | 3 3 3 4 > 29 #11629 | 1 1 1 1 | 3 3 3 4 > 30 #11629 | 1 1 1 1 | 3 3 3 4 > 31 #11629 | 1 1 1 1 | 4 3 4 5 > 32 #11629 | 1 1 1 1 | 4 3 3 4 > 33 #11629 | 1 1 1 1 | 3 3 3 4 > 34 #11629 | 1 1 1 1 | 4 3 3 4 > 35 #11628 | 1 1 1 1 | 3 3 3 4 > 36 #11628 | 1 1 1 1 | 4 3 3 4 > 37 #11628 | 1 1 1 1 | 3 3 3 4 > 38 #11628 | 1 1 1 1 | 3 3 3 7 > 39 #11628 | 1 1 1 1 | 3 3 3 4 while read -n 128 toto; do printf '%s\n' "$toto"; sleep 1; done < /dev/urandom > /dev/tty0 & > ~# timerlat > Timer Latency > 0 00:00:34 | IRQ Timer Latency (us) | Thread Timer Latency (us) > CPU COUNT | cur min avg max | cur min avg max > 0 #33651 | 1 0 3 85105 | 3 2 8 167377 > 1 #33815 | 1 0 1 61 | 4 2 3 1328 > 2 #33653 | 1 0 3 83812 | 3 2 8 166085 > 3 #33819 | 1 0 1 39 | 3 3 3 72 > 4 #33819 | 1 1 1 1 | 3 3 3 5 > 5 #33819 | 1 1 1 32 | 3 3 3 76 > 6 #33819 | 1 1 1 1 | 4 3 4 5 > 7 #33819 | 1 1 1 38 | 3 3 3 80 > 8 #33819 | 1 1 1 1 | 3 3 3 5 > 9 #33819 | 1 1 1 29 | 3 3 3 81 > 10 #33819 | 1 1 1 1 | 3 3 3 5 > 11 #33819 | 1 1 1 36 | 3 3 3 73 > 12 #33818 | 1 1 1 1 | 3 3 3 5 > 13 #33818 | 1 1 1 81 | 3 3 3 117 > 14 #33818 | 1 1 1 1 | 3 3 3 5 > 15 #33818 | 1 1 1 40 | 3 3 3 80 > 16 #33819 | 1 1 1 1 | 3 3 3 5 > 17 #33819 | 1 1 1 3 | 4 3 3 6 > 18 #33819 | 1 1 1 1 | 3 3 3 5 > 19 #33819 | 1 1 1 62 | 3 3 3 84 > 20 #33812 | 1 0 1 177 | 3 2 3 1199 > 21 #33817 | 1 0 1 61 | 3 2 3 1416 > 22 #33819 | 1 0 1 3 | 3 3 3 5 > 23 #33819 | 1 0 1 34 | 3 3 3 83 > 24 #33819 | 1 1 1 1 | 3 3 3 6 > 25 #33819 | 1 1 1 61 | 3 3 3 113 > 26 #33819 | 1 1 1 1 | 3 3 3 5 > 27 #33819 | 1 1 1 57 | 3 3 3 92 > 28 #33819 | 1 1 1 1 | 3 3 3 5 > 29 #33819 | 1 1 1 48 | 3 3 3 89 > 30 #33819 | 1 1 1 1 | 3 3 3 5 > 31 #33819 | 1 1 1 40 | 4 3 4 111 > 32 #33819 | 1 1 1 1 | 3 3 3 5 > 33 #33819 | 1 1 1 52 | 3 3 3 105 > 34 #33819 | 1 1 1 1 | 4 3 3 5 > 35 #33819 | 1 1 1 66 | 3 3 3 111 > 36 #33819 | 1 1 1 1 | 4 3 3 5 > 37 #33818 | 1 1 1 45 | 3 3 3 116 > 38 #33818 | 1 1 1 1 | 3 3 3 7 > 39 #33818 | 1 1 1 28 | 3 3 3 108
Lets consider the implications of removing console=tty0 from the default kernel arguments[1]. - Virt development and CI environments would not show console output in the graphical console, but would continue to show it in the serial console - Customers who need to see the kernel boot through an actual video card can enable that temporarily at the grub menu. We could even document this I'm not seeing much downside from removing console=tty0, any thoughts from anyone else? [1] https://opendev.org/openstack/tripleo-common/src/branch/stable/wallaby/image-yaml/overcloud-hardened-images-uefi-python3.yaml#L41
Hi all, I have performed some additional debugging after reading BZ 1848102#c110. It turns out that the real culprit is the nomodeset option. **Removing** nomodeset actually fixes the issue, even when console=tty0 is present. I could not reproduce any packet drop and/or latency spikes while writing to the VGA controller (via tty0 or via fb0), neither by rotating the fb console. Here is what the kernel docs have to say about nomodeset (I have underlined the important details): > nomodeset Disable kernel modesetting. Most systems' firmware > sets up a display mode and provides framebuffer memory > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > for output. With nomodeset, DRM and fbdev drivers will > not load if they could possibly displace the pre- > initialized output. Only the system framebuffer will > be available for use. The respective drivers will not > perform display-mode changes or accelerated rendering. > ^^^^^^^^^^^^^^^^^^^^^ > > Useful as error fallback, or for testing and debugging. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ https://github.com/torvalds/linux/blob/2af3e53a4dc08657f1b46f97f04ff4a0ab3cad8d/Documentation/admin-guide/kernel-parameters.txt#L3813-L3821 This nomodeset option was added without specific explanation at the same time than console=tty0: https://opendev.org/openstack/tripleo-common/commit/fc07b696ac2c45c3f914b85b750e471fceb5c390. It is possible than it was left here by mistake. I tried removing nofb and vga=normal as well. These had absolutely no effect. I didn't even find any documentation about nofb. It is possible that this option has been deprecated for a long time. Since it does not hurt, it can be left here. This nomodeset idea was raised by Marcelo Tosatti via a separate email thread (quoting him here): > Wonder if this is not related to the memory type assigned to the memory-mapped I/O region for the graphics card: > > 11.3.2 Choosing a Memory Type > > The simplest system memory model does not use memory-mapped I/O with read or > write side effects, does not include a frame buffer, and uses the write-back > memory type for all memory. An I/O agent can perform direct memory access (DMA) > to write-back memory and the cache protocol maintains cache coherency. > > A system can use strong uncacheable memory for other memory-mapped I/O, and > should always use strong uncacheable memory for memory-mapped I/O with read > side effects. Dual-ported memory can be considered a write side effect, making > relatively prompt writes desirable, because those writes cannot be observed at > the other port until they reach the memory agent. > > A system can use strong uncacheable, write-through, or write-combining memory > for frame buffers or dual-ported memory that contains pixel values displayed on > a screen. Frame buffer memory is typically large (a few megabytes) and is > usually written more than it is read by the processor. Using strong uncacheable > memory for a frame buffer gener- ates very large amounts of bus traffic, > because operations on the entire buffer are implemented using partial writes > rather than line writes. > > Using write-through memory for a frame buffer can displace almost all other > useful cached lines in the processor's L2 and L3 caches and L1 data cache. > Therefore, systems should use write-combining memory for frame buffers whenever > possible. > > And: https://download.intel.com/design/PentiumII/applnots/24442201.pdf > > /proc/mtrr and /sys/kernel/debug/x86/pat_memtype_list can be used to find the > memory type assigned to the memory I/O regions (/proc/iomem and lspci can be > used to map that to specific devices). Along with the combination rules: > > https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-427.html Here is the info I could extract. Without nomodeset: /proc/iomem: > 91000000-91ffffff : PCI Bus 0000:03 > 91000000-91ffffff : 0000:03:00.0 > 91000000-91ffffff : mgadrmfb_vram > 92000000-928fffff : PCI Bus 0000:03 > 92000000-927fffff : 0000:03:00.0 > 92808000-9280bfff : 0000:03:00.0 > 92808000-9280bfff : mgadrmfb_mmio /sys/kernel/debug/x86/pat_memtype_list: > PAT: [mem 0x0000000091000000-0x0000000092000000] write-combining > PAT: [mem 0x0000000091000000-0x0000000092000000] write-combining /proc/mtrr: > reg00: base=0x000000000 ( 0MB), size=262144MB, count=1: write-back With nomodeset: /proc/iomem: > 91000000-91ffffff : PCI Bus 0000:03 > 91000000-91ffffff : 0000:03:00.0 > 91000000-912fffff : efifb > 92000000-928fffff : PCI Bus 0000:03 > 92000000-927fffff : 0000:03:00.0 > 92808000-9280bfff : 0000:03:00.0 /sys/kernel/debug/x86/pat_memtype_list: > PAT: [mem 0x0000000091000000-0x0000000091300000] write-combining /proc/mtrr: > reg00: base=0x000000000 ( 0MB), size=262144MB, count=1: write-back I'm not sure if any of it is relevant, but from what I can read, efifb memory is mapped when using nomodeset, instead of mgadrmfb_vram. It could very well be a firmware issue that causes writing to this efifb memory very slow. I have no idea. In any case, removing nomodeset makes writing pixels to the VGA controller much faster and it does not hold the memory bus on the whole NUMA node. In conclusion, I would suggest the following: 1) Remove nomodeset: this option is for debugging only and should NOT be used on production. 2) Remove console=tty0: since nomodeset is removed, we *may* encounter issues with exotic VGA controllers. Let's be safe and avoid logging kernel messages on the VGA console. This has proven to be quite an effective workaround. 3) Limit printk log level to KERN_EMERG: kernel messages are available with dmesg and journalctl anyway. We might as well avoid logging to the serial consoles to shield from future issues (echo "kernel.printk = 0 4 0 0" > /etc/sysctl.d/95-printk.conf).
*** Bug 2178958 has been marked as a duplicate of this bug. ***
FYI, I also pushed an RFC patches, to use DMA/IRQ with mgag200 https://patchwork.freedesktop.org/series/117380/ It should have good impact on latency. Latency, measured with cyclictest -s -l 10000: Without DMA: # /dev/cpu_dma_latency set to 0us policy: other/other: loadavg: 1.52 0.52 0.33 3/358 2025 T: 0 ( 1977) P: 0 I:1000 C: 10000 Min: 7 Act: 56 Avg: 85 Max: 2542 With DMA: # /dev/cpu_dma_latency set to 0us policy: other/other: loadavg: 1.27 0.48 0.18 2/363 2498 T: 0 ( 2403) P: 0 I:1000 C: 10000 Min: 8 Act: 62 Avg: 59 Max: 339 I still have some re-work to do before it can be merged upstream.
Who should be the assignee now? It's nobody@.
It looks like the assignee got reset when changing components (tripleo-ansible → openstack-tripleo-common). Julia, was this intentional?
Apologies, I'll reset it to Steve Baker. When fixing the component, it auto-reset.
Looking at the shipped Overcloud UEFI image in RPM `rhosp-director-images-uefi-x86_64-17.1-20230525.2.el9ost.noarch`, I can see see the kernel args we wanted to remove: ``` ><fs> cat /boot/loader/entries/1565da793eb146cb86b44147abba75dc-5.14.0-284.11.1.el9_2.x86_64.conf title Red Hat Enterprise Linux (5.14.0-284.11.1.el9_2.x86_64) 9.2 (Plow) version 5.14.0-284.11.1.el9_2.x86_64 linux /vmlinuz-5.14.0-284.11.1.el9_2.x86_64 initrd /initramfs-5.14.0-284.11.1.el9_2.x86_64.img options root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=ttyS0,115200 no_timer_check nofb nomodeset gfxpayload=text boot=LABEL=mkfs_boot grub_users $grub_users grub_arg --unrestricted grub_class rhel ``` It makes sense since we still saw a low performance in our CI. Moving back to ASSIGNED, correct me if I am wrong, but we would also want this change to be propagated to shipped Overcloud images.
Thanks, everyone. We can confirm that on compose `RHOS-17.1-RHEL-9-20230607.n.2`, the default kernelargs do not include `nomodeset` and `tty0` ``` rpm -qa | grep rhosp-director rhosp-director-images-base-17.1-20230606.6.el9ost.noarch rhosp-director-images-metadata-17.1-20230606.6.el9ost.noarch rhosp-director-images-ipa-x86_64-17.1-20230606.6.el9ost.noarch rhosp-director-images-x86_64-17.1-20230606.6.el9ost.noarch rhosp-director-images-17.1-20230606.6.el9ost.noarch rhosp-director-images-minimal-17.1-20230606.6.el9ost.noarch rhosp-director-images-uefi-x86_64-17.1-20230606.6.el9ost.noarch ``` ``` ><fs> cat /boot/loader/entries/73e33ffe5206403fa7854e9b58507275-5.14.0-284.11.1.el9_2.x86_64.conf / error: incorrect number of arguments usage: cat path type 'help cat' for more help on cat ><fs> cat /boot/loader/entries/73e33ffe5206403fa7854e9b58507275-5.14.0-284.11.1.el9_2.x86_64.conf title Red Hat Enterprise Linux (5.14.0-284.11.1.el9_2.x86_64) 9.2 (Plow) version 5.14.0-284.11.1.el9_2.x86_64 linux /vmlinuz-5.14.0-284.11.1.el9_2.x86_64 initrd /initramfs-5.14.0-284.11.1.el9_2.x86_64.img options root=LABEL=img-rootfs ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=ttyS0,115200 no_timer_check memtest=0 boot=LABEL=mkfs_boot grub_users $grub_users grub_arg --unrestricted grub_class rhel ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577