Bug 1976852
Summary: | [failover vf migration] The failover vf will be unregistered if canceling the migration whose status is "wait-unplug" | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Yanghang Liu <yanghliu> |
Component: | qemu-kvm | Assignee: | Laurent Vivier <lvivier> |
qemu-kvm sub component: | Networking | QA Contact: | Yanhui Ma <yama> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aadam, chayang, jinzhao, juzhang, lvivier, virt-maint, yalzhang, yama |
Version: | 8.5 | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | 8.5 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-6.0.0-27.module+el8.5.0+12121+c40c8708 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-11-16 07:54:28 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1957194 |
Description
Yanghang Liu
2021-06-28 11:28:06 UTC
This bug still exists in the qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc.x86_64 (In reply to Yanghang Liu from comment #4) > This bug still exists in the > qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc.x86_64 Yes, the fix has not been merged because some ACKs are missing on the developer side. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. The verification process: Test env: host: qemu-kvm-6.0.0-27.module+el8.5.0+12121+c40c8708.x86_64 4.18.0-324.el8.x86_64 guest: 4.18.0-319.el8.x86_64 Test result: This bug is fixed by the qemu-kvm-6.0.0-27.module+el8.5.0+12121+c40c8708.x86_64 Test step: Repeat the step 1 - step 7 in the description. Relate virsh cmd: # virsh migrate --live --verbose rhel85 qemu+ssh://10.73.73.73/system <--- use the ctrl + c to cancel the migration Migration: [ 0 %] error: operation aborted: migration out job: canceled by client Related qmp: > {"execute":"query-migrate","id":"libvirt-397"} < {"return": {"blocked": false, "status": "wait-unplug"}, "id": "libvirt-397"} > {"execute":"query-migrate","id":"libvirt-398"} < {"return": {"blocked": false, "status": "wait-unplug"}, "id": "libvirt-398"} <--- cancel the migration whose status is "wait-unplug" ! {"timestamp": {"seconds": 1628144313, "microseconds": 730402}, "event": "MIGRATION", "data": {"status": "cancelling"}} > {"execute":"migrate_cancel","id":"libvirt-399"} < {"return": {}, "id": "libvirt-399"} < {"return": {"blocked": false, "expected-downtime": 300, "status": "cancelling", "setup-time": 0, "total-time": 3333, "ram": {"total": 4300021760, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "pages-per-second": 0, "page-size": 4096, "remaining": 4300021760, "mbps": 0, "transferred": 0, "duplicate": 0, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 0, "normal": 0}}, "id": "libvirt-400"} 263.959 > {"execute":"query-migrate","id":"libvirt-400"} ... < {"return": {"blocked": false, "expected-downtime": 300, "status": "cancelling", "setup-time": 0, "total-time": 6341, "ram": {"total": 4300021760, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "pages-per-second": 0, "page-size": 4096, "remaining": 4300021760, "mbps": 0, "transferred": 0, "duplicate": 0, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 0, "normal": 0}}, "id": "libvirt-406"} ! {"timestamp": {"seconds": 1628144317, "microseconds": 446637}, "event": "MIGRATION", "data": {"status": "cancelled"}} Related info in the vm after canceling the migration: # ifconfig enp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.200.86 netmask 255.255.255.0 broadcast 192.168.200.255 inet6 2001::5aa9:2376:1944:b715 prefixlen 64 scopeid 0x0<global> inet6 fe80::f3a5:85c3:d547:59f7 prefixlen 64 scopeid 0x20<link> ether 52:54:00:aa:1c:ef txqueuelen 1000 (Ethernet) RX packets 284 bytes 39121 (38.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 246 bytes 31284 (30.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp4s0nsby: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::6a87:472d:d210:be06 prefixlen 64 scopeid 0x20<link> ether 52:54:00:aa:1c:ef txqueuelen 1000 (Ethernet) RX packets 251 bytes 24308 (23.7 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 125 bytes 25140 (24.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp5s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 <--- The failover vf still exists in the vm after canceling the migration whose status is "wait-unplug" inet 192.168.200.87 netmask 255.255.255.0 broadcast 192.168.200.255 inet6 fe80::e5a0:ad81:3237:e01 prefixlen 64 scopeid 0x20<link> ether 52:54:00:aa:1c:ef txqueuelen 1000 (Ethernet) RX packets 98 bytes 22458 (21.9 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 127 bytes 8130 (7.9 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 # dmesg [ 44.838599] pcieport 0000:00:02.4: Slot(0-4): Attention button pressed [ 44.842557] pcieport 0000:00:02.4: Slot(0-4): Powering off due to button press [ 50.165434] virtio_net virtio2 enp4s0: failover primary slave:enp5s0 unregistered [ 51.613152] pcieport 0000:00:02.4: Slot(0-4): Attention button pressed [ 51.620147] pcieport 0000:00:02.4: Slot(0-4) Powering on due to button press [ 51.625903] pcieport 0000:00:02.4: Slot(0-4): Card present [ 51.628399] pcieport 0000:00:02.4: Slot(0-4): Link Up [ 51.756229] pci 0000:05:00.0: [8086:154c] type 00 class 0x020000 [ 51.768575] pci 0000:05:00.0: reg 0x10: [mem 0xfc000000-0xfc00ffff 64bit pref] [ 51.783777] pci 0000:05:00.0: reg 0x1c: [mem 0xfc010000-0xfc013fff 64bit pref] [ 51.793331] pci 0000:05:00.0: enabling Extended Tags [ 51.796540] pci 0000:05:00.0: BAR 0: assigned [mem 0xfc000000-0xfc00ffff 64bit pref] [ 51.801661] pci 0000:05:00.0: BAR 3: assigned [mem 0xfc010000-0xfc013fff 64bit pref] [ 51.808229] pcieport 0000:00:02.4: PCI bridge to [bus 05] [ 51.810696] pcieport 0000:00:02.4: bridge window [io 0x5000-0x5fff] [ 51.814593] pcieport 0000:00:02.4: bridge window [mem 0xfe000000-0xfe1fffff] [ 51.817483] pcieport 0000:00:02.4: bridge window [mem 0xfc000000-0xfc1fffff 64bit pref] [ 51.884906] iavf 0000:05:00.0: Multiqueue Enabled: Queue pair count = 4 [ 51.892285] virtio_net virtio2 enp4s0: failover primary slave:eth0 registered [ 51.896293] iavf 0000:05:00.0: MAC address: 52:54:00:aa:1c:ef [ 51.900537] iavf 0000:05:00.0: GRO is enabled [ 51.909626] iavf 0000:05:00.0 enp5s0: renamed from eth0 [ 51.917855] IPv6: ADDRCONF(NETDEV_UP): enp5s0: link is not ready [ 51.929122] IPv6: ADDRCONF(NETDEV_UP): enp5s0: link is not ready [ 51.980868] iavf 0000:05:00.0 enp5s0: NIC Link is Up Speed is 40 Gbps Full Duplex [ 51.986564] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0: link becomes ready Test result also passes on RHEL9.0.0. Packages: qemu-kvm-6.0.0-12.el9.x86_64 kernel-5.14.0-0.rc7.54.el9.x86_64 (both host and guest) Steps are the same as Description Test results: # ifconfig enp5s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.200.110 netmask 255.255.255.0 broadcast 192.168.200.255 inet6 fe80::c642:4337:db94:fe07 prefixlen 64 scopeid 0x20<link> inet6 2001::8b19:cc98:515f:df62 prefixlen 64 scopeid 0x0<global> ether 32:d8:80:bb:13:2e txqueuelen 1000 (Ethernet) RX packets 115 bytes 15365 (15.0 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 96 bytes 11097 (10.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp5s0nsby: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 32:d8:80:bb:13:2e txqueuelen 1000 (Ethernet) RX packets 156 bytes 15618 (15.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 48 bytes 9988 (9.7 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enp6s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.200.110 netmask 255.255.255.0 broadcast 192.168.200.255 inet6 fe80::5ff0:85c9:dede:b9f7 prefixlen 64 scopeid 0x20<link> ether 32:d8:80:bb:13:2e txqueuelen 1000 (Ethernet) RX packets 39 bytes 8075 (7.8 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 66 bytes 5233 (5.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@localhost ~]# dmesg | grep -i failover [ 3.108330] virtio_net virtio1 eth0: failover master:eth0 registered [ 3.112416] virtio_net virtio1 eth0: failover standby slave:eth1 registered [ 3.325045] virtio_net virtio1 enp5s0: failover primary slave:eth0 registered [ 83.618659] virtio_net virtio1 enp5s0: failover primary slave:enp6s0 unregistered [ 86.627624] virtio_net virtio1 enp5s0: failover primary slave:eth0 registered Related qmp commands: {"execute": "qmp_capabilities"} {"return": {}} {"execute": "migrate", "arguments": {"uri": "tcp:10.73.73.98:4000"}} {"return": {}} {"timestamp": {"seconds": 1629884328, "microseconds": 625978}, "event": "UNPLUG_PRIMARY", "data": {"device-id": "idXVst"}} {"execute": "query-migrate"} {"return": {"blocked": false, "status": "wait-unplug"}} {"execute":"migrate_cancel"} {"return": {}} {"execute": "query-migrate"} {"return": {"blocked": false, "status": "cancelled"}} Hello Laurent, The bug appears again on qemu-kvm-6.1.0-3.el9.x86_64. But there is not the wait-unplug status for qemu-kvm-6.1.0-3.el9.x86_64, so cancelling the migration when status is active. There is a little different. Should I re-open the bug or file a new bug for RHEL9.0? (In reply to Yanhui Ma from comment #13) > Hello Laurent, > > The bug appears again on qemu-kvm-6.1.0-3.el9.x86_64. > > But there is not the wait-unplug status for qemu-kvm-6.1.0-3.el9.x86_64, so > cancelling the migration when status is active. > There is a little different. > Should I re-open the bug or file a new bug for RHEL9.0? I think it's a new bug that introduces a regression at this level. Please open a new bug describing how to reproduce the problem. Could you test with qemu parameter "-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"? We have some PCI hotplug regression introduced by: 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35") (In reply to Laurent Vivier from comment #14) > (In reply to Yanhui Ma from comment #13) > > Hello Laurent, > > > > The bug appears again on qemu-kvm-6.1.0-3.el9.x86_64. > > > > But there is not the wait-unplug status for qemu-kvm-6.1.0-3.el9.x86_64, so > > cancelling the migration when status is active. > > There is a little different. > > Should I re-open the bug or file a new bug for RHEL9.0? > > I think it's a new bug that introduces a regression at this level. > > Please open a new bug describing how to reproduce the problem. > > Could you test with qemu parameter "-global > ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"? > > We have some PCI hotplug regression introduced by: > > 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35") Patch proposed upstream: https://patchew.org/QEMU/20210930100815.1246081-1-lvivier@redhat.com/ [PATCH v2] failover: fix unplug pending detection Failover needs to detect the end of the PCI unplug to start migration after the VFIO card has been unplugged. To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in pcie_unplug_device(). But since 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35") we have switched to ACPI unplug and these functions are not called anymore and the flag not set. So failover migration is not able to detect if card is really unplugged and acts as it's done as soon as it's started. So it doesn't wait the end of the unplug to start the migration. We don't see any problem when we test that because ACPI unplug is faster than PCIe native hotplug and when the migration really starts the unplug operation is already done. See c000a9bd06ea ("pci: mark device having guest unplug request pending") a99c4da9fc2a ("pci: mark devices partially unplugged") Signed-off-by: Laurent Vivier <lvivier> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684 |