Bug 2216441
Summary: | [17.1] kvm/sriov: high latency after soft reboot | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Miguel Angel Nieto <mnietoji> |
Component: | os-net-config | Assignee: | Robin Jarry <rjarry> |
Status: | CLOSED ERRATA | QA Contact: | Miguel Angel Nieto <mnietoji> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 17.1 (Wallaby) | CC: | bcafarel, bfournie, ekuris, gurpsing, hakhande, hbrock, jslagle, mburns, njohnston, pgrist, prgutier, rdiazcam, rjarry, vchundur, vkhitrin, yanghliu, yusokada |
Target Milestone: | ga | Keywords: | Regression, Triaged |
Target Release: | 17.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | os-net-config-14.2.1-1.20230412012160.el9ost | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-08-16 01:15:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2217881, 2218871 | ||
Bug Blocks: |
Description
Miguel Angel Nieto
2023-06-21 11:42:06 UTC
I have reproduced the issue in both deployments ml2-ovs and ovn Some time ago I opened a similar bz that was solved updating guest image to rhel 9.2. But this time i am already using rhel 9.2 https://bugzilla.redhat.com/show_bug.cgi?id=2175802 [cloud-user@trex ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 9.2 (Plow) [cloud-user@trex ~]$ uname -a Linux trex 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux Miguel, Can you confirm if this behaviour is 100% reproducible? We need to get a kernel trace to understand better. Regards Vijay. We have run more tests today to narrow the problem. On the compute node, the Host was on rhel 9.2 with kernel 5.14.0-284.11.1.el9_2.x86_64. 1st test - (Guest = Host = rhel 9.2): ------------------------------------- We have spawn several VMs with the same version of the Host (rhel 9-2 - kernel 5.14.0-284.11.1.el9_2.x86_64) and the results was: - First boot: the ping shows low latency. - After a reboot from the VM (systemctl reboot): we saw the same kernel error message stated above (NETDEV WATCHDOG) and the ping latency was high. 2nd test - (Guest: Debian unstable / Host: rhel 9.2): ----------------------------------------------------- We have spawn several VMs with Debian unstable version (kernel 6.3) and the results was: - First boot: the ping shows low latency. - After a reboot from the VM (systemctl reboot): we saw no kernel error and the ping latency was high. => We can reproduce the bug with old and recent kernel on the Guest 3rd test - (Guest: Debian unstable / Host: Fedora 38 kernel): ------------------------------------------------------------- We have exchange the Host kernel to the lastest Fedora 38 kernel (6.3) version. We have spawn several VMs with Debian unstable version (kernel 6.3) and the results was: - First boot: the ping shows low latency. - After a reboot from the VM (systemctl reboot): we saw no kernel error and the ping latency was low. => With a recent kernel on the Host, we see no more high latency Next step: - A 4th test could be done with an older version of rhel 9.2 (ex: 5.14.0-168) to check if there is any regression - Create a simple reproducer without OSP configuration (i.e spawning VMs with only Qemu / libvirt) Running on a baremetal RHEL 9.2 machine (no openstack) we managed to reproduce the issue, regardless of the guest kernel version. Downgrading the kernel to 5.14.0-162.23.1.el9_1.x86_64 fixed the high latency with VFs. We are currently running a binary search on rhel-9 repo to try and pinpoint the patch that introduced the regression. Important note: the issue also appears with Mellanox CX6 VFs. It seems not related to the iavf nor the i40e drivers. Tested RHOS-17.1-RHEL-9-20230712.n.1 [tripleo-admin@computeovsdpdksriov-r730 ~]$ uname -a Linux computeovsdpdksriov-r730 5.14.0-284.23.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jul 5 10:07:42 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux Reboot with ssh * vm with 2 e810 vfs running in dellr740: up in 30 seconds * vm with 2 x710 pfs running in dellr730: up in 15 seconds [cloud-user@trex ~]$ ping 10.10.179.158 PING 10.10.179.158 (10.10.179.158) 56(84) bytes of data. 64 bytes from 10.10.179.158: icmp_seq=1 ttl=64 time=0.501 ms 64 bytes from 10.10.179.158: icmp_seq=2 ttl=64 time=0.078 ms 64 bytes from 10.10.179.158: icmp_seq=3 ttl=64 time=0.071 ms ^C In vm I only see this error that seems unrelated [cloud-user@testpmd-sriov-vf-dut ~]$ dmesg | grep error [ 2.897591] shpchp 0000:01:00.0: pci_hp_register failed with error -16 Reboot with openstack api: Reboot with ssh * vm with 2 e810 vfs running in dellr740: up in 34 seconds * vm with 2 x710 pfs running in dellr730: up in 12 seconds [cloud-user@trex ~]$ ping 10.10.179.158 PING 10.10.179.158 (10.10.179.158) 56(84) bytes of data. 64 bytes from 10.10.179.158: icmp_seq=1 ttl=64 time=0.385 ms 64 bytes from 10.10.179.158: icmp_seq=2 ttl=64 time=0.077 ms 64 bytes from 10.10.179.158: icmp_seq=3 ttl=64 time=0.244 ms 64 bytes from 10.10.179.158: icmp_seq=5 ttl=64 time=0.057 ms 64 bytes from 10.10.179.158: icmp_seq=6 ttl=64 time=0.058 ms 64 bytes from 10.10.179.158: icmp_seq=7 ttl=64 time=0.065 ms 64 bytes from 10.10.179.158: icmp_seq=8 ttl=64 time=0.058 ms 64 bytes from 10.10.179.158: icmp_seq=9 ttl=64 time=0.044 ms In vm I only see this error that seems unrelated [cloud-user@trex ~]$ dmesg | grep error [ 2.122680] shpchp 0000:01:00.0: pci_hp_register failed with error -16 In compute i see these error, but i think it is unrelated [ 154.017502] ACPI Error: No handler for Region [SYSI] (0000000020c097ed) [IPMI] (20211217/evregion-130) [ 154.128869] ACPI Error: Region IPMI (ID=7) has no handler (20211217/exfldio-261) [ 154.217432] ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20211217/psparse-529) [ 154.341354] ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20211217/psparse-529) [ 154.481882] ACPI: \_SB_.PMI0: _PMC evaluation failed: AE_NOT_EXIST [ 154.623151] mei_me 0000:00:16.0: Device doesn't have valid ME Interface Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577 |