Bug 2182961

Summary: virtqemud coredump when hotunplug a hostdev interface
Product: Red Hat Enterprise Linux 9 Reporter: yalzhang <yalzhang>
Component: libvirtAssignee: Peter Krempa <pkrempa>
libvirt sub component: Networking QA Contact: yalzhang <yalzhang>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: hhan, jdenemar, jtomko, lmen, pkrempa, smitterl, virt-maint, yanqzhan, yicui
Version: 9.3Keywords: Automation, Regression, Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-9.2.0-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-07 08:31:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 9.2.0
Embargoed:
Attachments:
Description Flags
bisection none

Description yalzhang@redhat.com 2023-03-30 04:38:48 UTC
Description of problem:
virtqemud coredump when hotunplug a hostdev interface

Version-Release number of selected component (if applicable):
libvirt-9.1.0-1.el9.x86_64
qemu-kvm-7.2.0-14.el9_2.x86_64
kernel-5.14.0-289.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start vm with a hostdev interface;# virsh dumpxml avocado-vt-vm1 --xpath //interface
<interface type="hostdev" managed="yes">
  <mac address="52:54:00:aa:5c:5a"/>
  <source>
    <address type="pci" domain="0x0000" bus="0x3b" slot="0x10" function="0x2"/>
  </source>
  <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
</interface>

2. After vm boot successfully, hot unplug the hostdev interface, the virtqemud coredump:
# virsh start avocado-vt-vm1
# pidof virtqemud
639569
# virsh detach-interface avocado-vt-vm1 hostdev 
error: Disconnected from qemu:///system due to end of file
error: Failed to detach interface
error: End of file while reading data: Input/output error

# coredumpctl list | grep 639569
Wed 2023-03-29 23:47:30 EDT 639569   0   0 SIGABRT present  /usr/sbin/virtqemud    1.0M

Some errors in the libvirtd log:
2023-03-30 03:47:32.772+0000: 640045: error : virPCIDeviceReset:1073 : internal error: Unable to reset PCI device 0000:3b:10.2: internal error: Active 0000:3b:00.0 devices on bus with 0000:3b:10.2, not doing bus reset
2023-03-30 03:47:32.772+0000: 640045: error : virHostdevResetAllPCIDevices:614 : Failed to reset PCI device: internal error: Unable to reset PCI device 0000:3b:10.2: internal error: Active 0000:3b:00.0 devices on bus with 0000:3b:10.2, not doing bus reset
2023-03-30 03:47:33.778+0000: 640110: error : virCgroupDenyDevicePath:2256 : Path '/dev/vfio/145' is not accessible: No such file or directory
2023-03-30 03:47:33.786+0000: 640110: error : virPCIDeviceTrySecondaryBusReset:838 : internal error: Active 0000:3b:00.0 devices on bus with 0000:3b:10.2, not doing bus reset

Actual results:
virtqemud coredump when hot-unplug a hostdev interface

Expected results:
virtqemud should not coredump

Additional info:
Test with libvirt-9.0.0-10.el9_2.x86_64 with the same qemu and kernel, no such issue

Comment 3 Peter Krempa 2023-03-30 07:26:00 UTC
Looks like a double free:

#8  0x00007f261ab425ed in g_free (mem=0x7f26080432d0) at ../glib/gmem.c:199
#9  0x00007f261a6bdd66 in virBitmapFree (bitmap=0x7f260802ff70) at ../src/util/virbitmap.c:97
#10 virBitmapFree (bitmap=0x7f260802ff70) at ../src/util/virbitmap.c:94
#11 0x00007f261a75d153 in virDomainNetDefFree (def=0x7f2604028860) at ../src/conf/domain_conf.c:2749
#12 virDomainNetDefFree (def=def@entry=0x7f2604028860) at ../src/conf/domain_conf.c:2704
#13 0x00007f26140fe3f4 in qemuDomainRemoveHostDevice (driver=0x7f25cc022310, vm=0x7f25cc08e850, hostdev=<optimized out>) at ../src/qemu/qemu_hotplug.c:4564

The free call seems to correspond with:

virBitmapFree(def->source.subsys.u.pci.origstates);

Comment 4 Han Han 2023-03-30 12:39:35 UTC
Created attachment 1954672 [details]
bisection

Bisection shows the regression comes from:
d9e4075d4e9e4d699b5083572a534545f35a91b1 is the first bad commit
commit d9e4075d4e9e4d699b5083572a534545f35a91b1
Author: Peter Krempa <pkrempa>
Date:   Thu Oct 6 13:17:00 2022 +0200

    conf: Store 'origstates' of PCI hostdevs in a bitmap
    
    Refactor the code to use a bitmap with an enum.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>
    Reviewed-by: Martin Kletzander <mkletzan>

 src/conf/domain_conf.c      | 97 ++++++++++++++++++++++-----------------------
 src/conf/domain_conf.h      | 31 +++++----------
 src/conf/virconftypes.h     |  2 -
 src/hypervisor/virhostdev.c | 25 +++++++-----
 4 files changed, 72 insertions(+), 83 deletions(-)


Run the following:
0. Clone libvirt git tree to ~. Extract the attachment to ~. Prepare the disk image as the domain XML rhel.xml; Update your vf PCI address to the inf.xml.
1. Run the virtqemud-onece.sh to trigger the crash. Then the buggy virtqemud version cannot start while the previous qemu-kvm process is running.
2. Run the virtqemud-abrt.sh as the script for bisection.

Comment 5 Peter Krempa 2023-03-30 13:20:53 UTC
Fixed by:

commit 0bfd11dd852335c1274b6dc1e771bd745d1fd94d 
Author: Peter Krempa <pkrempa>
Date:   Thu Mar 30 11:42:31 2023 +0200

    conf: Clear pointer to freed bitmap holding hostdev's 'origstates'
    
    'virDomainHostdevDefClear' must clear the pointers too as it can be
    invoked multiple times on the same object e.g. inside
    qemuDomainRemoveHostDevice once via virDomainHostdevDefFree which skips
    freeing the object if it's used via <interface> and thus has a 'net'
    definition corresponding to it, and then subsequently via
    virDomainNetDefFree.
    
    Fix it by clearing the pointer along with freeing it.
    
    Fixes: d9e4075d4e9
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2182961
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

v9.2.0-rc2-1-g0bfd11dd85

Comment 6 Han Han 2023-03-31 01:54:09 UTC
For the test of comment4, PASS on v9.2.0-rc2-1-g0bfd11dd85
+ timeout -s INT 3 /root/libvirt/build/src/virtqemud
2023-03-31 01:47:14.987+0000: 84639: info : libvirt version: 9.2.0
2023-03-31 01:47:14.987+0000: 84639: info : hostname: dell-per740xd-19.lab.eng.pek2.redhat.com
2023-03-31 01:47:14.987+0000: 84639: error : virCgroupDenyDevicePath:2256 : Path '/dev/vfio/137' is not accessible: No such file or directory
2023-03-31 01:47:14.987+0000: 84639: warning : qemuDomainRemoveHostDevice:4525 : Failed to remove host device cgroup ACL
+ '[' 124 -eq 134 ']'
+ exit 0

Comment 7 yalzhang@redhat.com 2023-04-04 16:16:50 UTC
No such issue exists in automation function test job with libvirt-9.2.0-1.el9.x86_64.

Comment 11 yalzhang@redhat.com 2023-05-19 01:53:13 UTC
Test on libvirt-9.3.0-2.el9.x86_64, the issue is fixed.

Comment 13 errata-xmlrpc 2023-11-07 08:31:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6409