Bug 1531393

Summary: [Q35] "DEVICE_DELETED" event didn't return after delete the second passthrough vf device
Product: Red Hat Enterprise Linux 7 Reporter: jingzhao <jinzhao>
Component: qemu-kvm-rhevAssignee: Peter Xu <peterx>
Status: CLOSED ERRATA QA Contact: jingzhao <jinzhao>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: alex.williamson, chayang, ehabkost, hhuang, jen, jinzhao, juzhang, knoel, lmiksik, marcel, michen, mrezanin, mtessun, peterx, rhodain, salmy, toneata, virt-maint, yafu, yfu
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.12.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1552224 1552626 1555213 (view as bug list) Environment:
Last Closed: 2018-11-01 11:04:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1552224, 1552626, 1555213, 1558351    
Attachments:
Description Flags
dmesg log none

Description jingzhao 2018-01-05 05:03:29 UTC
Created attachment 1377297 [details]
dmesg log

Description of problem:
 qmp didn't get the "DEVICE_DELETED" event after delete the second passthrough vf device, although the device had been deleted successfully in guest

Version-Release number of selected component (if applicable):
kernel-3.10.0-825.el7.x86_64
qemu-kvm-rhev-2.10.0-14.el7.x86_64
passthrough vf device: XL710

How reproducible:
3/3

Steps to Reproduce:
1. Boot guest with qemu command line [1]
2. Deleted two vf device in qmp
{"execute":"device_del","arguments":{"id":"vf2"}}
{"return": {}}
{"timestamp": {"seconds": 1515122236, "microseconds": 265862}, "event": "DEVICE_DELETED", "data": {"device": "vf2", "path": "/machine/peripheral/vf2"}}
{"execute":"device_del","arguments":{"id":"vf1"}}
{"return": {}}


Actual results:
devices can be deleted successfully
[root@localhost ~]# lspci -vvv -t
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
           +-01.0  Red Hat, Inc. QXL paravirtual graphic card
           +-02.0-[01]--
           +-0a.0-[02]----00.0  Red Hat, Inc. Virtio block device
           +-0a.1-[03]----00.0  Red Hat, Inc. Virtio network device
           +-0a.2-[04]----00.0  Red Hat, Inc. Virtio network device
           +-0a.5-[05]--
           +-1d.0  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
           +-1d.1  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
           +-1d.2  Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
           +-1d.7  Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

qmp didn't get the "DEVICE_DELETED" event

Expected results:
qmp can get the DEVICE_DELETED event

Additional info:
Didn't reproduce the issue with "virtio-net-pci" devices
guest kernel line doesn't contain "intel_iommu=on"

[1]
/usr/libexec/qemu-kvm \
-M q35,accel=kvm,kernel-irqchip=split \
-device intel-iommu,intremap=on,caching-mode=on,eim=on,device-iotlb=on \
-cpu Haswell-noTSX \
-nodefaults -rtc base=utc \
-m 4G \
-smp 4,sockets=4,cores=1,threads=1 \
-enable-kvm \
-uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \
-k en-us \
-nodefaults \
-chardev file,path=/home/seabios.log,id=seabios -device isa-debugcon,chardev=seabios,iobase=0x402 \
-boot menu=on \
-qmp tcp:0:6667,server,nowait \
-usb \
-device usb-tablet \
-vga qxl \
-device pcie-root-port,bus=pcie.0,id=root0,multifunction=on,chassis=1,addr=0xa.0 \
-drive file=/home/sriov/rhel75-seabios-bk.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-disk0,bus=root0 \
-device pcie-root-port,bus=pcie.0,id=root1,chassis=11,addr=0xa.1 \
-device virtio-net-pci,netdev=tap10,mac=00:52:68:26:31:03,bus=root1,id=net0 -netdev tap,id=tap10 \
-device pcie-root-port,bus=pcie.0,id=root2,chassis=12,addr=0xa.2 \
-device virtio-net-pci,netdev=tap11,mac=00:52:68:26:31:00,bus=root2,id=net1 -netdev tap,id=tap11 \
-device pcie-root-port,bus=pcie.0,id=root6,chassis=15,addr=0xa.5 \
-device vfio-pci,host=04:02.0,id=vf1,bus=root6,x-req=off \
-device pcie-root-port,bus=pcie.0,id=root7,chassis=4 \
-device vfio-pci,host=04:02.1,id=vf2,bus=root7,x-req=off \
-monitor stdio \
-vnc :1 \


Host devices info:
[root@dell-per730-29 ~]# lspci -v -s 04:02.0 
04:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
	Subsystem: Intel Corporation Device 0000
	Flags: bus master, fast devsel, latency 0, NUMA node 0
	[virtual] Memory at 92400000 (64-bit, prefetchable) [size=64K]
	[virtual] Memory at 92910000 (64-bit, prefetchable) [size=16K]
	Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [1a0] Transaction Processing Hints
	Capabilities: [1d0] Access Control Services
	Kernel driver in use: vfio-pci
	Kernel modules: i40evf

[root@dell-per730-29 ~]# lspci -v -s 04:00.0 
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
	Subsystem: Intel Corporation Ethernet Converged Network Adapter XL710-Q2
	Flags: bus master, fast devsel, latency 0, IRQ 55, NUMA node 0
	Memory at 91000000 (64-bit, prefetchable) [size=16M]
	Memory at 92808000 (64-bit, prefetchable) [size=32K]
	Expansion ROM at 92b00000 [disabled] [size=512K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [e0] Vital Product Data
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 88-90-15-ff-ff-fe-fd-3c
	Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [1a0] Transaction Processing Hints
	Capabilities: [1b0] Access Control Services
	Capabilities: [1d0] #19
	Kernel driver in use: i40e
	Kernel modules: i40e

[root@dell-per730-29 ~]# ll /sys/bus/pci/devices/0000\:04\:02.0/driver
lrwxrwxrwx. 1 root root 0 Jan  4 21:52 /sys/bus/pci/devices/0000:04:02.0/driver -> ../../../../bus/pci/drivers/vfio-pci

Comment 2 jingzhao 2018-01-05 05:10:14 UTC
Didn't reproduce the issue without -device intel-iommu,intremap=on,caching-mode=on,eim=on,device-iotlb=on 

The qemu command line:
/usr/libexec/qemu-kvm \
-M q35 \
-cpu Haswell-noTSX \
-nodefaults -rtc base=utc \
-m 4G \
-smp 4,sockets=4,cores=1,threads=1 \
-enable-kvm \
-uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \
-k en-us \
-nodefaults \
-global isa-debugcon.iobase=0x402 \
-boot menu=on \
-qmp tcp:0:6667,server,nowait \
-usb \
-device usb-tablet \
-vga qxl \
-device pcie-root-port,bus=pcie.0,id=root0,multifunction=on,chassis=1,addr=0xa.0 \
-drive file=/home/sriov/rhel75-seabios-bk.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-disk0,id=virtio-disk0,bus=root0 \
-device pcie-root-port,bus=pcie.0,id=root1,chassis=11,addr=0xa.1 \
-device virtio-net-pci,netdev=tap10,mac=00:52:68:26:31:03,bus=root1,id=net0 -netdev tap,id=tap10 \
-device pcie-root-port,bus=pcie.0,id=root2,chassis=12,addr=0xa.2 \
-device virtio-net-pci,netdev=tap11,mac=00:52:68:26:31:00,bus=root2,id=net1 -netdev tap,id=tap11 \
-device pcie-root-port,bus=pcie.0,id=root6,chassis=15,addr=0xa.5 \
-device vfio-pci,host=04:02.0,id=vf1,bus=root6,x-req=off \
-device pcie-root-port,bus=pcie.0,id=root7,chassis=4 \
-device vfio-pci,host=04:02.1,id=vf2,bus=root7,x-req=off \
-monitor stdio \
-vnc :1 \

Comment 3 jingzhao 2018-01-05 05:48:09 UTC
Marked it as regression since didn't reproduce it with qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64

Comment 8 Marcel Apfelbaum 2018-01-16 10:37:12 UTC
Hi Peter,

Following the logs, the Guest OS finishes the hot-unplug sequence
and powers down the device. It seems something on QEMU's IOMMU side
prevents the hot-unplug to finish.

Can you please have a look?
Thanks,
Marcel

Comment 9 Peter Xu 2018-01-16 14:03:33 UTC
(In reply to Marcel Apfelbaum from comment #8)
> Hi Peter,
> 
> Following the logs, the Guest OS finishes the hot-unplug sequence
> and powers down the device. It seems something on QEMU's IOMMU side
> prevents the hot-unplug to finish.
> 
> Can you please have a look?
> Thanks,
> Marcel

Hi, Marcel,

The event is sent in RCU thread:

#0  qapi_event_send_device_deleted (has_device=true, device=0x55c2500b81f0 "vfio0", path=0x55c2513530d0 "/machine/peripheral/vfio0", 
    errp=0x55c24f3f3d10 <error_abort>) at qapi-event.c:412
#1  0x000055c24e3ca86c in device_finalize (obj=0x55c25128b270) at /root/git/qemu/hw/core/qdev.c:1087
#2  0x000055c24e5ad6db in object_deinit (obj=0x55c25128b270, type=0x55c250094700) at /root/git/qemu/qom/object.c:462
#3  0x000055c24e5ad709 in object_deinit (obj=0x55c25128b270, type=0x55c2500a40c0) at /root/git/qemu/qom/object.c:466
#4  0x000055c24e5ad709 in object_deinit (obj=0x55c25128b270, type=0x55c250083030) at /root/git/qemu/qom/object.c:466
#5  0x000055c24e5ad74c in object_finalize (data=0x55c25128b270) at /root/git/qemu/qom/object.c:476
#6  0x000055c24e5ae6db in object_unref (obj=0x55c25128b270) at /root/git/qemu/qom/object.c:911
#7  0x000055c24e206988 in memory_region_unref (mr=0x55c25128b4e0) at /root/git/qemu/memory.c:1747
#8  0x000055c24e209d16 in do_address_space_destroy (as=0x55c25128b488) at /root/git/qemu/memory.c:2787
#9  0x000055c24e6da1c8 in call_rcu_thread (opaque=0x0) at /root/git/qemu/util/rcu.c:278
#10 0x00007f78faef4dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f78fac1e94d in clone () from /lib64/libc.so.6

Luckily I still have the debugging stuff on jinzhao's host due to the other bug, so I did a quick test.  I think it's because some references of vfio-pci device is leaked somehow due to existance of VT-d, since this is the backtrace I got when the error happens (or say, when the event is missed):

(gdb) bt
#0  object_unref (obj=0x5559618ef0c0) at /root/git/qemu/qom/object.c:904
#1  0x000055595d710988 in memory_region_unref (mr=0x5559618ef330) at /root/git/qemu/memory.c:1747
#2  0x000055595d713d16 in do_address_space_destroy (as=0x5559618ef2d8) at /root/git/qemu/memory.c:2787
#3  0x000055595dbe41c8 in call_rcu_thread (opaque=0x0) at /root/git/qemu/util/rcu.c:278
#4  0x00007f85ee1b5dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f85ededf94d in clone () from /lib64/libc.so.6
(gdb) p obj->ref
$2 = 3

Basically here obj->ref should be 1 otherwise device_finalize() for vfio-pci device won't be called, and event won't be sent.

I'll dig more tomorrow.  For now, I'll take over the bug since it's for sure vt-d related.

Thanks,
Peter

Comment 10 Peter Xu 2018-01-18 08:58:57 UTC
This is reproducable upstream too, and even without XL710 - I tested
with X540, which also hit this.

A funny thing is that upstream QEMU doesn't have this problem until
commit f7b879e072 ("qdev: defer DEVICE_DEL event until
instance_finalize()", 2017-10-18).  However I don't think that's the
problematic commit.  That commit just revealed another issue after it
fixed one.

After I built a very customized QEMU (and with some script for parsing) I think
these are the two refcounts that are missing (via MemoryRegion
referencing, since memory_region_ref() will add ref onto its owner):

Remaining refcount holders:            
{                                      
    "0x5639af8f8000": {                
        "count": 1,                    
        "devid": 1,                    
        "name": "0000:05:10.2 BAR 0 mmaps[0]",                                 
        "mr": "0x5639af8f8000"         
    },                                 
    "0x5639af8f86d0": {                
        "count": 1,                    
        "devid": 1,                    
        "name": "0000:05:10.2 BAR 3 mmaps[0]",                                 
        "mr": "0x5639af8f86d0"         
    }                                  
}                                      

It means that we referenced these two regions one more time for the
2nd VF (and the 1st VF have paired unrefs, since we have no problem with it):

  "0000:05:10.2 BAR 0 mmaps[0]"
  "0000:05:10.2 BAR 3 mmaps[0]"
  
They are the two BARs for the X540 VF device.  Though, I am still
investigating on how the VT-d device can affect them to leak that refcount.

Peter

Comment 11 Peter Xu 2018-01-19 08:49:55 UTC
Well... this funny refcount leakage seems to be a more general issue on memory API rather than VT-d itself.  I posted my thoughts and solution upstream as RFC (since it touches memory core and I haven't tested too much):

  http://lists.gnu.org/archive/html/qemu-devel/2018-01/msg04703.html

I'll wait for some feedback on the patches before I decide what to do next.

Peter

Comment 18 Oneata Mircea Teodor 2018-03-07 13:20:38 UTC
Please follow 0day process, check the link
https://mojo.redhat.com/docs/DOC-1022552

Comment 27 errata-xmlrpc 2018-11-01 11:04:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443