Bug 2027208
Summary: | [virtual network][vDPA] qemu crash after hot unplug vdpa device | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Lei Yang <leiyang> | |
Component: | qemu-kvm | Assignee: | Laurent Vivier <lvivier> | |
qemu-kvm sub component: | Networking | QA Contact: | Lei Yang <leiyang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aadam, chayang, jasowang, jinzhao, jmaloy, juzhang, lmiksik, lulu, lvivier, mrezanin, pezhang, virt-maint, wquan, yfu | |
Version: | 8.6 | Keywords: | Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-6.2.0-9.module+el8.6.0+14480+c0a3aa0f | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2059786 2060843 (view as bug list) | Environment: | ||
Last Closed: | 2022-05-10 13:24:19 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2059786, 2060843 |
Description
Lei Yang
2021-11-29 06:22:42 UTC
I'm not able to reproduce the problem with the vdpa simulator. Do you use a real vDPA device? What is the version of the kernel running in the guest? Thanks (In reply to Laurent Vivier from comment #1) > I'm not able to reproduce the problem with the vdpa simulator. > > Do you use a real vDPA device? > > What is the version of the kernel running in the guest? > > Thanks Hi Laurent Yes I use real vdpa: # lspci |grep ConnectX-6 3b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] 3b:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] Guest kernel version: kernel-4.18.0-353.el8.x86_64 Please feel free to tell me if you need me to provide an environment so that you can reproduce the problem. Best Regards Lei (In reply to Lei Yang from comment #2) > (In reply to Laurent Vivier from comment #1) > > I'm not able to reproduce the problem with the vdpa simulator. > > > > Do you use a real vDPA device? > > > > What is the version of the kernel running in the guest? > > > > Thanks > > Hi Laurent > > Yes I use real vdpa: > # lspci |grep ConnectX-6 > 3b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 > Dx] > 3b:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 > Dx] > > Guest kernel version: > kernel-4.18.0-353.el8.x86_64 > > Please feel free to tell me if you need me to provide an environment so that > you can reproduce the problem. Could you try to reproduce the problem with the simulator? You can use the following steps to create the device: # modprobe vhost-vdpa # modprobe vdpa_sim_net # vdpa dev add mgmtdev vdpasim_net name vdpasim0 ls /sys/devices/vdpasim0 driver power subsystem uevent vhost-vdpa-0 and you can use /dev/vhost-vdpa-0 with "vhostdev" in netdev_add (In reply to Laurent Vivier from comment #3) > (In reply to Lei Yang from comment #2) > > (In reply to Laurent Vivier from comment #1) > > > I'm not able to reproduce the problem with the vdpa simulator. > > > > > > Do you use a real vDPA device? > > > > > > What is the version of the kernel running in the guest? > > > > > > Thanks > > > > Hi Laurent > > > > Yes I use real vdpa: > > # lspci |grep ConnectX-6 > > 3b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 > > Dx] > > 3b:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 > > Dx] > > > > Guest kernel version: > > kernel-4.18.0-353.el8.x86_64 > > > > Please feel free to tell me if you need me to provide an environment so that > > you can reproduce the problem. > > Could you try to reproduce the problem with the simulator? > You can use the following steps to create the device: > > # modprobe vhost-vdpa > # modprobe vdpa_sim_net > # vdpa dev add mgmtdev vdpasim_net name vdpasim0 > ls /sys/devices/vdpasim0 > driver power subsystem uevent vhost-vdpa-0 > > and you can use /dev/vhost-vdpa-0 with "vhostdev" in netdev_add Hi Laurent I tried to test this scenario with an emulator, did not reproduce current problem. Test Steps: 1. Create simulator vdpa device [root@dell-per440-23 ~]# modprobe vhost-vdpa [root@dell-per440-23 ~]# modprobe vdpa_sim_net [root@dell-per440-23 ~]# vdpa dev add mgmtdev vdpasim_net name vdpasim0 [root@dell-per440-23 ~]# ls /sys/devices/vdpasim0 driver power subsystem uevent vhost-vdpa-0 2. Boot a guest without vdpa device 3.hotplug this device via qmp $ telnet 10.73.178.83 5555 Trying 10.73.178.83... Connected to 10.73.178.83. Escape character is '^]'. {"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 6}, "package": "qemu-kvm-6.1.0-5.module+el8.6.0+13430+8fdd5f85"}, "capabilities": ["oob"]}} {"execute":"qmp_capabilities"} {"return": {}} {'execute': 'netdev_add', 'arguments': {'type': 'vhost-vdpa', 'id': 'idKcVRcM', 'vhostdev': '/dev/vhost-vdpa-0'}} {"return": {}} {"execute": "device_add", "arguments": {"driver":"virtio-net-pci","netdev":"idKcVRcM","mac":"00:1a:4a:42:0b:01","id":"id7ARCys","bus":"pcie_extra_root_port_0","addr":"0x0"}} {"return": {}} {"timestamp": {"seconds": 1639486083, "microseconds": 575719}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "id7ARCys", "path": "/machine/peripheral/id7ARCys/virtio-backend"}} 4.stop guest via qmp {'execute': 'stop'} {"timestamp": {"seconds": 1638165586, "microseconds": 984624}, "event": "STOP"} {"return": {}} 5.continue guest via qmp {'execute': 'cont'} {"timestamp": {"seconds": 1638165588, "microseconds": 628139}, "event": "RESUME"} {"return": {}} 6.hot unplug vdpa device {'execute': 'device_del', 'arguments': {'id': 'id7ARCys'}} {"return": {}} {"timestamp": {"seconds": 1639486122, "microseconds": 330863}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/id7ARCys/virtio-backend"}} {"timestamp": {"seconds": 1639486122, "microseconds": 382079}, "event": "DEVICE_DELETED", "data": {"device": "id7ARCys", "path": "/machine/peripheral/id7ARCys"}} {'execute': 'netdev_del', 'arguments': {'id': 'idKcVRcM'}} {"return": {}} Guest works well. Best Regards Lei Hit same issue Test Version: kernel-4.18.0-360.el8.mr1880_220122_0148.x86_64 qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a.x86_64 (In reply to Lei Yang from comment #5) > Hit same issue > > Test Version: > kernel-4.18.0-360.el8.mr1880_220122_0148.x86_64 > qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a.x86_64 Is it possible for me to have access to the coredump of QEMU or to the machine to reproduce the problem myself? (gdb) bt #0 0x00007ffff488ba4f in raise () from /lib64/libc.so.6 #1 0x00007ffff485edb5 in abort () from /lib64/libc.so.6 #2 0x00007ffff58f9123 in g_assertion_message.cold () from /lib64/libglib-2.0.so.0 #3 0x00007ffff595220e in g_assertion_message_expr () from /lib64/libglib-2.0.so.0 #4 0x0000555555b41a41 in object_unref (objptr=0x55555699f6b0) at ../qom/object.c:1183 #5 object_unref (objptr=0x55555699f6b0) at ../qom/object.c:1177 #6 0x0000555555b41aaf in object_property_del_all (obj=0x555556970fa0) at ../qom/object.c:626 #7 object_finalize (data=0x555556970fa0) at ../qom/object.c:687 #8 object_unref (objptr=0x555556970fa0) at ../qom/object.c:1187 #9 0x0000555555b3ca51 in bus_free_bus_child (kid=0x55555697b750) at ../hw/core/qdev.c:55 #10 0x0000555555c6660b in call_rcu_thread (opaque=<optimized out>) at ../util/rcu.c:284 #11 0x0000555555c5d2f4 in qemu_thread_start (args=0x55555652d260) at ../util/qemu-thread-posix.c:556 #12 0x00007ffff4c0a17f in start_thread () from /lib64/libpthread.so.0 #13 0x00007ffff4876d83 in clone () from /lib64/libc.so.6 #6 0x0000555555b41aaf in object_property_del_all (obj=0x555556970fa0) at ../qom/object.c:626 626 prop->release(obj, prop->name, prop->opaque); (gdb) p obj->ref $2 = 0 (gdb) p *prop $4 = {name = 0x7ff8d001f4b0 "vhost-vdpa\\x2fhost-notifier@0x55555699f5c0 mmaps\\x5b0\\x5d[0]", type = 0x7ff8d001f3e0 "child<memory-region>", description = 0x0, get = 0x555555b43ab0 <object_get_child_property>, set = 0x0, resolve = 0x555555b3fe50 <object_resolve_child_property>, release = 0x555555b41b80 <object_finalize_child_property>, init = 0x0, opaque = 0x55555699f6b0, defval = 0x0} "vhost-vdpa/host-notifier@0x55555699f5c0 mmaps\\x5b0\\x5d[0]" is added by vhost_vdpa_host_notifier_init() using virtio_queue_set_host_notifier_mr(). A first guess would be we have vhost_vdpa_host_notifier_uninit() on the stop command that decrease obj->ref. Tested with upstream QEMU (55ef0b702bc2), and it seems fixed. (In reply to Laurent Vivier from comment #9) > Tested with upstream QEMU (55ef0b702bc2), and it seems fixed. It's not fixed: I was unable to reproduce the coredump because of another error: vhost_set_features failed: Device or resource busy (16) unable to start vhost net: 16: falling back on userspace virtio I have sent a fix upstream: https://patchew.org/QEMU/20220211170259.1388734-1-lvivier@redhat.com/mbox Author: Laurent Vivier <lvivier> Date: Fri Feb 11 17:49:36 2022 +0100 hw/virtio: vdpa: Fix leak of host-notifier memory-region If call virtio_queue_set_host_notifier_mr fails, should free host-notifier memory-region. This problem can trigger a coredump with some vDPA drivers (mlx5, but not with the vdpasim), if we unplug the virtio-net card from the guest after a stop/start. The same fix has been done for vhost-user: 1f89d3b91e3e ("hw/virtio: Fix leak of host-notifier memory-region") Fixes: d0416d487bd5 ("vhost-vdpa: map virtqueue notification area if possible") Cc: jasowang Resolves: https://bugzilla.redhat.com/2027208 Signed-off-by: Laurent Vivier <lvivier> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 04ea43704f5d..11f696468dc1 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -431,6 +431,7 @@ static int vhost_vdpa_host_notifier_init(struct vhost_dev *dev, int queue_index) g_free(name); if (virtio_queue_set_host_notifier_mr(vdev, queue_index, &n->mr, true)) { + object_unparent(OBJECT(&n->mr)); munmap(addr, page_size); goto err; } ==> Reproduced on qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a.x86_64 1. Boot a guest without vdpa device usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35,memory-backend=mem-machine_mem \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 28672 \ -object memory-backend-ram,size=28672M,id=mem-machine_mem \ -smp 32,maxcpus=32,cores=16,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server',ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,kvm_pv_unhalt=on \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel860-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -net none \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=4 \ -monitor stdio \ -qmp tcp:0:5555,server,nowait \ 2.hot plug vdpa device $ telnet 10.73.178.85 5555 Trying 10.73.178.85... Connected to 10.73.178.85. Escape character is '^]'. {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 6}, "package": "qemu-kvm-6.2.0-5.module+el8.6.0+14025+ca131e0a"}, "capabilities": ["oob"]}} {"execute":"qmp_capabilities"} {"return": {}} {'execute': 'netdev_add', 'arguments': {'type': 'vhost-vdpa', 'id': 'idKcVRcM', 'vhostdev': '/dev/vhost-vdpa-3'}} {"return": {}} {"execute": "device_add", "arguments": {"driver":"virtio-net-pci","netdev":"idKcVRcM","mac":"00:1a:4a:42:0b:01","id":"id7ARCys","bus":"pcie_extra_root_port_0","addr":"0x0"}} {"return": {}} 3.stop guest via qmp {'execute': 'stop'} {"timestamp": {"seconds": 1644809353, "microseconds": 951537}, "event": "STOP"} {"return": {}} 4.continue guest via qmp {'execute': 'cont'} {"timestamp": {"seconds": 1644809357, "microseconds": 199254}, "event": "RESUME"} {"return": {}} 5. hot unplug vdpa device {'execute': 'device_del', 'arguments': {'id': 'id7ARCys'}} {"return": {}} Connection closed by foreign host. 6. qemu crash ==> Repeated the above process, test pass on qemu-kvm-6.2.0-5.el8.BZ2027208. So this bug shuold fixed well on qemu-kvm-6.2.0-5.el8.BZ2027208. *** Bug 2039210 has been marked as a duplicate of this bug. *** I tried to test it with qemu-kvm-6.2.0-9.module+el8.6.0+14495+7194fa43.x86_64, there is no issue any more. Test Version: qemu-kvm-6.2.0-9.module+el8.6.0+14495+7194fa43.x86_64 kernel-4.18.0-372.el8.x86_64 Test Steps 1. Boot a guest without vdpa device /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -machine q35,memory-backend=mem-machine_mem \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 28672 \ -object memory-backend-ram,size=28672M,id=mem-machine_mem \ -smp 32,maxcpus=32,cores=16,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server',ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,kvm_pv_unhalt=on \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel860-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -net none \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=4 \ -monitor stdio \ -qmp tcp:0:5555,server,nowait \ 2. hotplug a vdpa device $ telnet 10.73.178.83 5555 Trying 10.73.178.83... Connected to 10.73.178.83. Escape character is '^]'. {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 6}, "package": "qemu-kvm-6.2.0-9.module+el8.6.0+14495+7194fa43"}, "capabilities": ["oob"]}} {"execute":"qmp_capabilities"} {"return": {}} {'execute': 'netdev_add', 'arguments': {'type': 'vhost-vdpa', 'id': 'idKcVRcM', 'vhostdev': '/dev/vhost-vdpa-7'}} {"return": {}} {"execute": "device_add", "arguments": {"driver":"virtio-net-pci","netdev":"idKcVRcM","mac":"00:1a:4a:42:0b:01","id":"id7ARCys","bus":"pcie_extra_root_port_0","addr":"0x0"}} {"return": {}} 3.Stop guest {'execute': 'stop'} {"timestamp": {"seconds": 1647484484, "microseconds": 938175}, "event": "STOP"} {"return": {}} 4. continue guest {'execute': 'cont'} {"timestamp": {"seconds": 1647484490, "microseconds": 139176}, "event": "RESUME"} {"return": {}} 5. Hotunplug device {"return": {}} {"timestamp": {"seconds": 1647484496, "microseconds": 410998}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/id7ARCys/virtio-backend"}} {"timestamp": {"seconds": 1647484496, "microseconds": 462144}, "event": "DEVICE_DELETED", "data": {"device": "id7ARCys", "path": "/machine/peripheral/id7ARCys"}} {'execute': 'netdev_del', 'arguments': {'id': 'idKcVRcM'}} {"return": {}} 6. guest wroks well Mirek, do you know why this BZ is stuck in MODIFIED state? Thanks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1759 |