Description of problem: I have seen high latenciy in nics when using both x710 PF and e810 VF. I was not aware of this issue in previous puddle. 64 bytes from 10.10.178.181: icmp_seq=56 ttl=64 time=319 ms 64 bytes from 10.10.178.181: icmp_seq=57 ttl=64 time=1159 ms 64 bytes from 10.10.178.181: icmp_seq=59 ttl=64 time=1254 ms 64 bytes from 10.10.178.181: icmp_seq=58 ttl=64 time=2256 ms 64 bytes from 10.10.178.181: icmp_seq=60 ttl=64 time=4160 ms 64 bytes from 10.10.178.181: icmp_seq=62 ttl=64 time=2112 ms 64 bytes from 10.10.178.181: icmp_seq=61 ttl=64 time=3159 ms 64 bytes from 10.10.178.181: icmp_seq=63 ttl=64 time=1088 ms 64 bytes from 10.10.178.181: icmp_seq=64 ttl=64 time=150 ms 64 bytes from 10.10.178.181: icmp_seq=65 ttl=64 time=1767 ms 64 bytes from 10.10.178.181: icmp_seq=66 ttl=64 time=928 ms 64 bytes from 10.10.178.181: icmp_seq=67 ttl=64 time=553 ms 64 bytes from 10.10.178.181: icmp_seq=68 ttl=64 time=447 ms 64 bytes from 10.10.178.181: icmp_seq=69 ttl=64 time=106 ms I also have seen the following behaviour: First time I start a vm or when I reboot the vm with openstack reboot, latency is ok, but when I reboot the vm from inside the vm (sudo reboot),when the vm is up I see this latency issue. With e810 nic vf, I also see this error with dmesg: [ 15.245107] ------------[ cut here ]------------ [ 15.246093] NETDEV WATCHDOG: eth2 (iavf): transmit queue 12 timed out [ 15.247442] WARNING: CPU: 12 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x1f9/0x200 [ 15.249159] Modules linked in: rfkill intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm kvm_intel kvm irqbypass rapl vfat iTCO_wdt iTCO_vendor_support fat virtio_gpu virtio_dma_buf drm_shmem_helper drm_kms_helper i2c_i801 pcspkr i2c_smbus syscopyarea virtio_balloon sysfillrect lpc_ich sysimgblt fb_sys_fops joydev drm fuse xfs libcrc32c sr_mod cdrom sg ahci crct10dif_pclmul crc32_pclmul crc32c_intel libahci virtio_net libata ghash_clmulni_intel net_failover iavf failover virtio_blk serio_raw sunrpc dm_mirror dm_region_hash dm_log dm_mod [ 15.259187] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 5.14.0-284.11.1.el9_2.x86_64 #1 [ 15.260829] Hardware name: Red Hat OpenStack Compute/RHEL, BIOS 1.16.1-1.el9 04/01/2014 [ 15.262468] RIP: 0010:dev_watchdog+0x1f9/0x200 [ 15.263400] Code: 00 e9 40 ff ff ff 48 89 ef c6 05 ca 83 5c 01 01 e8 2c 75 f9 ff 44 89 e9 48 89 ee 48 c7 c7 10 36 4c 92 48 89 c2 e8 15 3a 18 00 <0f> 0b e9 22 ff ff ff 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f 18 0f [ 15.267191] RSP: 0018:ffffad56c032ceb0 EFLAGS: 00010282 [ 15.268277] RAX: 0000000000000000 RBX: ffff8a7046cb4488 RCX: 000000000000083f This bug reminds me this other bug that was opened with Mellanox nics. It does not take so long, but enough time to have this watchdog log in dmesg. https://bugzilla.redhat.com/show_bug.cgi?id=2182371 Version-Release number of selected component (if applicable): RHOS-17.1-RHEL-9-20230613.n.1 How reproducible: Deploy osp17 Spawn a vm with a e810 vf or x710 pf (those that the combinations I have tested) Ping using those interfaces Actual results: High latency with ping when using e810 vf or x710 Expected results: latency should be low Additional info:
I have reproduced the issue in both deployments ml2-ovs and ovn
Some time ago I opened a similar bz that was solved updating guest image to rhel 9.2. But this time i am already using rhel 9.2 https://bugzilla.redhat.com/show_bug.cgi?id=2175802 [cloud-user@trex ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 9.2 (Plow) [cloud-user@trex ~]$ uname -a Linux trex 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Miguel, Can you confirm if this behaviour is 100% reproducible? We need to get a kernel trace to understand better. Regards Vijay.
We have run more tests today to narrow the problem. On the compute node, the Host was on rhel 9.2 with kernel 5.14.0-284.11.1.el9_2.x86_64. 1st test - (Guest = Host = rhel 9.2): ------------------------------------- We have spawn several VMs with the same version of the Host (rhel 9-2 - kernel 5.14.0-284.11.1.el9_2.x86_64) and the results was: - First boot: the ping shows low latency. - After a reboot from the VM (systemctl reboot): we saw the same kernel error message stated above (NETDEV WATCHDOG) and the ping latency was high. 2nd test - (Guest: Debian unstable / Host: rhel 9.2): ----------------------------------------------------- We have spawn several VMs with Debian unstable version (kernel 6.3) and the results was: - First boot: the ping shows low latency. - After a reboot from the VM (systemctl reboot): we saw no kernel error and the ping latency was high. => We can reproduce the bug with old and recent kernel on the Guest 3rd test - (Guest: Debian unstable / Host: Fedora 38 kernel): ------------------------------------------------------------- We have exchange the Host kernel to the lastest Fedora 38 kernel (6.3) version. We have spawn several VMs with Debian unstable version (kernel 6.3) and the results was: - First boot: the ping shows low latency. - After a reboot from the VM (systemctl reboot): we saw no kernel error and the ping latency was low. => With a recent kernel on the Host, we see no more high latency Next step: - A 4th test could be done with an older version of rhel 9.2 (ex: 5.14.0-168) to check if there is any regression - Create a simple reproducer without OSP configuration (i.e spawning VMs with only Qemu / libvirt)
Running on a baremetal RHEL 9.2 machine (no openstack) we managed to reproduce the issue, regardless of the guest kernel version. Downgrading the kernel to 5.14.0-162.23.1.el9_1.x86_64 fixed the high latency with VFs. We are currently running a binary search on rhel-9 repo to try and pinpoint the patch that introduced the regression.
Important note: the issue also appears with Mellanox CX6 VFs. It seems not related to the iavf nor the i40e drivers.
Tested RHOS-17.1-RHEL-9-20230712.n.1 [tripleo-admin@computeovsdpdksriov-r730 ~]$ uname -a Linux computeovsdpdksriov-r730 5.14.0-284.23.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jul 5 10:07:42 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux Reboot with ssh * vm with 2 e810 vfs running in dellr740: up in 30 seconds * vm with 2 x710 pfs running in dellr730: up in 15 seconds [cloud-user@trex ~]$ ping 10.10.179.158 PING 10.10.179.158 (10.10.179.158) 56(84) bytes of data. 64 bytes from 10.10.179.158: icmp_seq=1 ttl=64 time=0.501 ms 64 bytes from 10.10.179.158: icmp_seq=2 ttl=64 time=0.078 ms 64 bytes from 10.10.179.158: icmp_seq=3 ttl=64 time=0.071 ms ^C In vm I only see this error that seems unrelated [cloud-user@testpmd-sriov-vf-dut ~]$ dmesg | grep error [ 2.897591] shpchp 0000:01:00.0: pci_hp_register failed with error -16 Reboot with openstack api: Reboot with ssh * vm with 2 e810 vfs running in dellr740: up in 34 seconds * vm with 2 x710 pfs running in dellr730: up in 12 seconds [cloud-user@trex ~]$ ping 10.10.179.158 PING 10.10.179.158 (10.10.179.158) 56(84) bytes of data. 64 bytes from 10.10.179.158: icmp_seq=1 ttl=64 time=0.385 ms 64 bytes from 10.10.179.158: icmp_seq=2 ttl=64 time=0.077 ms 64 bytes from 10.10.179.158: icmp_seq=3 ttl=64 time=0.244 ms 64 bytes from 10.10.179.158: icmp_seq=5 ttl=64 time=0.057 ms 64 bytes from 10.10.179.158: icmp_seq=6 ttl=64 time=0.058 ms 64 bytes from 10.10.179.158: icmp_seq=7 ttl=64 time=0.065 ms 64 bytes from 10.10.179.158: icmp_seq=8 ttl=64 time=0.058 ms 64 bytes from 10.10.179.158: icmp_seq=9 ttl=64 time=0.044 ms In vm I only see this error that seems unrelated [cloud-user@trex ~]$ dmesg | grep error [ 2.122680] shpchp 0000:01:00.0: pci_hp_register failed with error -16 In compute i see these error, but i think it is unrelated [ 154.017502] ACPI Error: No handler for Region [SYSI] (0000000020c097ed) [IPMI] (20211217/evregion-130) [ 154.128869] ACPI Error: Region IPMI (ID=7) has no handler (20211217/exfldio-261) [ 154.217432] ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20211217/psparse-529) [ 154.341354] ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20211217/psparse-529) [ 154.481882] ACPI: \_SB_.PMI0: _PMC evaluation failed: AE_NOT_EXIST [ 154.623151] mei_me 0000:00:16.0: Device doesn't have valid ME Interface
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577