Bug 2033279

Summary: [wrb][qemu-kvm 6.2] The hot-unplugged device can not be hot-plugged back
Product: Red Hat Enterprise Linux 8 Reporter: Yanghang Liu <yanghliu>
Component: qemu-kvmAssignee: Kevin Wolf <kwolf>
qemu-kvm sub component: Devices QA Contact: Yanghang Liu <yanghliu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: ailan, alex.williamson, chayang, coli, jinzhao, juzhang, kwolf, leiyang, lizhu, mark, mst, pezhang, pkrempa, virt-maint, yafu, yalzhang, yanghliu, yicui, ymankad
Version: 8.6Keywords: Regression, TestBlocker, Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-6.2.0-6.module+el8.6.0+14165+5e5e76ac Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-10 13:24:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yanghang Liu 2021-12-16 12:25:08 UTC
Description of problem:
The hot-unplugged PF/VF can not be hot-plugged back

Version-Release number of selected component (if applicable):
host:
qemu-kvm-6.2.0-1.rc2.scrmod+el8.6.0+13458+219ac088.wrb211124.x86_64
libvirt-7.9.0-1.module+el8.6.0+13150+28339563.x86_64

How reproducible:
100%

Steps to Reproduce:
1.start a vm with a PF/VF

# virt-install --machine=q35 --noreboot --name=rhel86 --memory=4096 --vcpus=4 --graphics type=vnc,port=5986,listen=0.0.0.0  --network bridge=switch,model=virtio,mac=52:54:00:00:86:86 --import --noautoconsole --disk path=/home/images/RHEL86.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,size=20 --hostdev pci_0000_e3_0a_0  

The device xml:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0xe3' slot='0x0a' function='0x0'/>
      </source>
    </hostdev>


2.Hot-unplug the PF/VF

# virsh detach-device-alias rhel86 hostdev0
Device detach request sent successfully    <--- But the PF/VF xml still exists in the vm

3.check the PF/VF info in the vm

# lspci or # ifconfig  <-- There is no any info about the hot-unplugged PF/VF 

# dmesg
[   37.105546] pcieport 0000:00:02.3: pciehp: Slot(0-3): Attention button pressed
[   37.107395] pcieport 0000:00:02.3: pciehp: Slot(0-3): Powering off due to button press
[   42.634339] iavf 0000:04:00.0: Hardware reset detected

4. Hot-plug the PF/VF back to the vm

# virsh attach-device rhel86 /tmp/device/0000\:e3\:0a.0.xml 
error: Failed to attach device from /tmp/device/0000:e3:0a.0.xml
error: Requested operation is not valid: PCI device 0000:e3:0a.0 is in use by driver QEMU, domain rhel86



Actual results:
The PF/VF xml still exists in the vm after hot-unplug the PF/VF device
The hot-unplugged PF/VF can not be hot-plugged back

Expected results:
The hot-unplugged PF/VF can be hot-plugged back successfully

Additional info:

(1) Only using qemu-kvm to test the same scenario in the same test env *does not reproduce this problem*

The Simplified qemu command line is as following:
/usr/libexec/qemu-kvm -name rhel86 -M q35 -enable-kvm \
-monitor stdio \
-nodefaults \
-m 4G \
-boot menu=on \
-cpu host \
-smp 8,sockets=4,cores=2,threads=1,maxcpus=8 \
-qmp tcp:0:5555,server,nowait \
-device pcie-root-port,id=root.1,chassis=1,addr=0x2.0,multifunction=on \
-device pcie-root-port,id=root.2,chassis=2,addr=0x2.1 \
-device pcie-root-port,id=root.3,chassis=3,addr=0x2.2 \
-device pcie-root-port,id=root.4,chassis=4,addr=0x2.3 \
-device pcie-root-port,id=root.5,chassis=5,addr=0x2.4 \
-device pcie-root-port,id=root.6,chassis=6,addr=0x2.5 \
-device pcie-root-port,id=root.7,chassis=7,addr=0x2.6 \
-device pcie-root-port,id=root.8,chassis=8,addr=0x2.7 \
-blockdev node-name=back_image,driver=file,cache.direct=on,cache.no-flush=off,filename=/home/images/RHEL86.qcow2,aio=threads \
-blockdev node-name=drive-virtio-disk0,driver=qcow2,cache.direct=on,cache.no-flush=off,file=back_image \
-device virtio-blk-pci,drive=drive-virtio-disk0,id=disk0,bus=root.1 \
-device VGA,id=video1,bus=root.2 \
-vnc :0 \
-device virtio-net-pci,netdev=nic1,id=vnet0,mac=52:54:00:00:86:86,bus=root.3 \
-netdev tap,id=nic1,script=/etc/qemu-ifup,vhost=on \
-device vfio-pci,host=0000:e3:0a.0,bus=root.4,id=pf1 \


The related qmp:
{"execute":"device_del","arguments":{"id":"vf1"}}
{"return": {}}
{"timestamp": {"seconds": 1639658800, "microseconds": 685326}, "event": "DEVICE_DELETED", "data": {"device": "vf1", "path": "/machine/peripheral/pf1"}} 
{"execute":"device_add","arguments":{"driver":"vfio-pci","host":"0000:e3:0a.0","id":"vf1","bus":"root.4"}}
{"return": {}}

Comment 1 Yanghang Liu 2021-12-16 12:32:28 UTC
> Version-Release number of selected component (if applicable):
> qemu-kvm-6.2.0-1.rc2.scrmod+el8.6.0+13458+219ac088.wrb211124.x86_64
> libvirt-7.9.0-1.module+el8.6.0+13150+28339563.x86_64

The hot-unplugged PF/VF can be hot-plugged back successfully in the following test env:
qemu-kvm-6.1.0-5.module+el8.6.0+13430+8fdd5f85.x86_64
libvirt-7.9.0-1.module+el8.6.0+13150+28339563.x86_64

Comment 2 Yanghang Liu 2021-12-16 12:36:52 UTC
I am still not sure whether the root cause of this bug is in libvirt or qemu-kvm, 

but according to comment 1, open this bug in qemu-kvm first and mark this bug as regression.

Feel free to move this bug to libvirt once we find that the root cause is in libvirt.

Comment 3 yalzhang@redhat.com 2021-12-17 02:15:53 UTC
I also encountered this issue when testing with wrb qemu. 
No issue for below combination:
libvirt-7.10.0-1.module+el8.6.0+13502+4f24a11d.x86_64
qemu-kvm-6.1.0-5.module+el8.6.0+13430+8fdd5f85.x86_64

But when I update the qemu-kvm to be 6.2.0-1.rc1.scrmod+el8.6.0+13325+d4e3491c.wrb21117.x86_64, the issue occurs. So I think there may be some changes in the wrb qemu-kvm, which caused this libvirt 'noncooperation'.

1. Start vm with 1 interface:
# virsh domiflist rhel 
 Interface   Type      Source    Model    MAC
-------------------------------------------------------------
 vnet4       network   default   e1000e   52:54:00:c0:a0:9d

2. After the vm boot up successfully, hot-unplug the interace:
# virsh detach-interface rhel network  52:54:00:c0:a0:9d
Interface detached successfully

check on guest OS, the interface is detached.
But check the guest xml, the interface still exists, which is not expected.
# virsh domiflist rhel 
 Interface   Type      Source    Model    MAC
-------------------------------------------------------------
 vnet4       network   default   e1000e   52:54:00:c0:a0:9d

# virsh dumpxml rhel | grep /interface -B7
    <interface type='network'>
      <mac address='52:54:00:c0:a0:9d'/>
      <source network='default' portid='d3ed5141-8efd-4d69-be40-c8512530ea25' bridge='virbr0'/>
      <target dev='vnet4'/>
      <model type='e1000e'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>

Comment 6 Yanghang Liu 2021-12-20 07:50:10 UTC
This bug exists in the following test env:
qemu-kvm-6.2.0-1.el9.x86_64
libvirt-7.10.0-1.el9.x86_64

Comment 12 Lili Zhu 2021-12-30 06:20:36 UTC
Tested with:
qemu-kvm-6.2.0-1.el9.x86_64
libvirt-7.10.0-1.el9.x86_64

For virtiofs and watchdog device, also met with the same issue in Comment #3: devices are hot-unplugged in the guest, but not removed from guest xml.

Comment 13 yalzhang@redhat.com 2022-01-04 01:50:38 UTC
This is the same with Bug 2036669

Comment 17 Yanghang Liu 2022-01-27 02:38:36 UTC
Keep this bug open for this issue still exists in qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a.x86_64.

Comment 18 Yanghang Liu 2022-01-27 03:22:59 UTC
 
>  This bug is the same with Bug 2036669 

>  This issue can still be reproduced in qemu-kvm-6.2.0-4.el9.x86_64, while it is fixed in qemu-kvm-6.2.0-5.el9.x86_64.

>  Keep this bug open for this issue still exists in qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a.x86_64.


Hi Michael, Kevin and Yash

It seems to me that a same bug has been fixed in qemu-kvm-6.2.0-5.el9.x86_64.

May I ask if we can fix this bug on RHEL.8.6 as this bug is Regression and TestBlocker ?

Comment 19 Kevin Wolf 2022-01-27 11:43:53 UTC
The original description of this bug doesn't contain any JSON -device in the command line, and it includes a correct DEVICE_DELETED event in the observed QMP traffic.

Is this still true?

If so, both the condition to trigger the bug and the result are different from bug 2036669, so this looks entirely unrelated.

Comment 20 Yanghang Liu 2022-01-28 04:11:28 UTC
(In reply to Kevin Wolf from comment #19)

> The original description of this bug doesn't contain any JSON -device in the command line, and it includes a correct DEVICE_DELETED event in the observed QMP traffic.
> Is this still true?

Hi Kevin,

The information I added in the description indicates that "This bug cannot be reproduced when the -device qemu cmd is not in JSON format"

I think this result is consistent with your bug.


> Additional info:

>(1) Only using qemu-kvm to test the same scenario in the same test env *does not reproduce this bug*              <--- Please pay attention to the info I highlight here.

>The Simplified qemu command line is as following:
...
>-device vfio-pci,host=0000:e3:0a.0,bus=root.4,id=pf1 \

> The related qmp:
> {"execute":"device_del","arguments":{"id":"vf1"}}
> {"return": {}}
> {"timestamp": {"seconds": 1639658800, "microseconds": 685326}, "event": "DEVICE_DELETED", "data": {"device": "vf1", "path": "/machine/peripheral/pf1"}} 
> {"execute":"device_add","arguments":{"driver":"vfio-pci","host":"0000:e3:0a.0","id":"vf1","bus":"root.4"}}
> {"return": {}}

Comment 21 Yanghang Liu 2022-01-28 05:15:03 UTC
Besides, let me translate the reproducer into a qemu command line/qmp to make this question clearer for us

Test env:
qemu-kvm-6.2.0-4.el9.x86_64
libvirt-7.10.0-1.el9.x86_64


> Steps to Reproduce:
> 1.start a vm with a PF/VF
> 
> # virt-install --machine=q35 --noreboot --name=rhel86 --memory=4096
> --vcpus=4 --graphics type=vnc,port=5986,listen=0.0.0.0  --network
> bridge=switch,model=virtio,mac=52:54:00:00:86:86 --import --noautoconsole
> --disk
> path=/home/images/RHEL86.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,
> size=20 --hostdev pci_0000_e3_0a_0  
> 
> The device xml:
> 
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0xe3' slot='0x0a' function='0x0'/>
>       </source>
>     </hostdev>

The related qemu cmd line:
-device {"driver":"vfio-pci","host":"0000:e3:0a.0","id":"hostdev0"} 

> 2.Hot-unplug the PF/VF
> 
> # virsh detach-device-alias rhel86 hostdev0
> Device detach request sent successfully    <--- But the PF/VF xml still
> exists in the vm

The related qmp:

{"execute":"device_del","arguments":{"id":"hostdev0"},"id":"libvirt-405"}
{"return": {}, "id": "libvirt-405"}

There is not related info output like: "{"timestamp": {"seconds": 1643339608, "microseconds": 630965}, "event": "DEVICE_DELETED", "data": {"device": "hostdev0", "path": "/machine/peripheral/hostdev0"}}"


> 3.check the PF/VF info in the vm
> 
> # lspci or # ifconfig  <-- There is no any info about the hot-unplugged
> PF/VF 
> 
> # dmesg
> [   37.105546] pcieport 0000:00:02.3: pciehp: Slot(0-3): Attention button
> pressed
> [   37.107395] pcieport 0000:00:02.3: pciehp: Slot(0-3): Powering off due to
> button press
> [   42.634339] iavf 0000:04:00.0: Hardware reset detected
> 
> 4. Hot-plug the PF/VF back to the vm
> 
> # virsh attach-device rhel86 /tmp/device/0000\:e3\:0a.0.xml 
> error: Failed to attach device from /tmp/device/0000:e3:0a.0.xml
> error: Requested operation is not valid: PCI device 0000:e3:0a.0 is in use
> by driver QEMU, domain rhel86

The "Hot-plug the PF/VF back to the vm" op is blocked by libvirt because the "Hot-unplug the PF/VF" op has not finished yet.

Comment 22 Kevin Wolf 2022-01-28 08:34:22 UTC
Sorry, I missed that this information was related to the case where it does *not* reproduce.

Then yes, we can use this bug to fix it in 8.6. Note that in 9.0, the problem was first worked around in libvirt, but fixing just QEMU should be enough.

Comment 23 Peter Krempa 2022-01-28 08:48:52 UTC
rhel-8.6 will get (already probably got) libvirt-8.0 which has the workaround, as it is an upstreamed patch, so the code base is identical to rhel-9 in this regard.

Comment 27 Yanan Fu 2022-02-09 06:14:20 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 28 Yanghang Liu 2022-02-09 09:38:11 UTC
> Steps to Reproduce:
> 1.start a vm with a PF/VF
> 
> # virt-install --machine=q35 --noreboot --name=rhel86 --memory=4096
> --vcpus=4 --graphics type=vnc,port=5986,listen=0.0.0.0  --network
> bridge=switch,model=virtio,mac=52:54:00:00:86:86 --import --noautoconsole
> --disk
> path=/home/images/RHEL86.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,
> size=20 --hostdev pci_0000_e3_0a_0  
> 
> The device xml:
> 
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0xe3' slot='0x0a' function='0x0'/>
>       </source>
>     </hostdev>
> 
> 
> 2.Hot-unplug the PF/VF
> # virsh detach-device-alias rhel86 hostdev0


> 3.check the PF/VF info in the vm
> # lspci or # ifconfig
> # dmesg

> 4. Hot-plug the PF/VF back to the vm
> # virsh attach-device rhel86 /tmp/device/0000\:e3\:0a.0.xml 



Verification Result : PASS

  This bug can be reproduced in the following test evn:
    qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a.x86_64
    libvirt-7.10.0-1.module+el8.6.0+13502+4f24a11d.x86_64

  This bug has been fixed in the following test env:
    qemu-kvm-6.2.0-6.module+el8.6.0+14167+61b0e671.x86_64
    libvirt-7.10.0-1.module+el8.6.0+13502+4f24a11d.x86_64

Comment 30 errata-xmlrpc 2022-05-10 13:24:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1759