2128929 – [rhel9.2] hotplug/hotunplug mlx vdpa device to the occupied addr port, then qemu core dump occurs after shutdown guest

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2128929 - [rhel9.2] hotplug/hotunplug mlx vdpa device to the occupied addr port, then qemu core dump occurs after shutdown guest

Summary: [rhel9.2] hotplug/hotunplug mlx vdpa device to the occupied addr port, then q...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ani Sinha
QA Contact:	Yiqian Wei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2227721
TreeView+	depends on / blocked

Reported:	2022-09-22 01:58 UTC by Lei Yang
Modified:	2023-11-07 09:16 UTC (History)
CC List:	13 users (show)
Fixed In Version:	qemu-kvm-8.0.0-7.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2227721 (view as bug list)
Environment:
Last Closed:	2023-11-07 08:26:38 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/centos-stream/src qemu-kvm merge_requests 174	None	opened	vhost-vdpa: do not cleanup the vdpa/vhost-net structures if peer nic is present	2023-06-29 15:40:58 UTC
Red Hat Issue Tracker	RHELPLAN-134645	None	None	None	2022-09-22 02:07:33 UTC
Red Hat Product Errata	RHSA-2023:6368	None	None	None	2023-11-07 08:27:17 UTC

Description Lei Yang 2022-09-22 01:58:47 UTC

Description of problem:
1.hotplug a vdpa device to the occupied addr port.
2.hotunplug this vdpa device.
3. qemu core dump occurs after shutdown guest via "shutdown -h now".
(qemu) qemu-kvm: ../hw/net/vhost_net.c:483: VHostNetState *get_vhost_net(NetClientState *): Assertion `vhost_net' failed.

Version-Release number of selected component (if applicable):
kernel-5.14.0-165.el9.x86_64
qemu-kvm-7.1.0-1.el9.x86_64
libvirt-8.7.0-1.el9.x86_64
edk2-ovmf-20220526git16779ede2d36-3.el9.noarch

How reproducible:
100%

Steps to Reproduce:
1.Boot guest with jason format
/usr/libexec/qemu-kvm \
-name guest=rhel9.2,debug-threads=on \
-blockdev '{"driver":"file","filename":"/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/rhel9.2_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-q35-rhel9.0.0,usb=off,vmport=off,smm=on,kernel_irqchip=split,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-cpu Cascadelake-Server,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,tsc-deadline=on,pmu=off \
-global driver=cfi.pflash01,property=secure,value=on \
-m 8192 \
-overcommit mem-lock=off \
-smp 6,sockets=6,dies=1,cores=1,threads=1 \
-nodefaults \
-boot strict=on \
-device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \
-device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \
-device '{"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}' \
-blockdev '{"driver":"file","filename":"/home/images_nfv-virt-rt-kvm/rhel9.2.qcow2","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device '{"driver":"virtio-blk-pci","bus":"pci.2","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1,"write-cache":"on"}' \
-device '{"driver":"virtio-vga","id":"video0","bus":"pcie.0","addr":"0x1"}' \
-monitor stdio \
-qmp tcp:0:5555,server,nowait \
-vnc :0 \

2.hotplug & hotunplug vdpa device via qmp
$ telnet 10.73.210.76 5555
Trying 10.73.210.76...
Connected to 10.73.210.76.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 7}, "package": "qemu-kvm-7.1.0-1.el9"}, "capabilities": ["oob"]}}
{"execute": "qmp_capabilities"}
{"return": {}}
{"execute":"netdev_add","arguments":{"type":"vhost-vdpa","id":"hostnet1","vhostdev":"/dev/vhost-vdpa-0"}}
{"return": {}}
{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x2.0x5"}}
{"return": {}}
{"execute":"device_del","arguments":{"id": "net1"}}
{"return": {}}
{"execute":"netdev_del","arguments":{"id":"hostnet1"}}
{"return": {}}

3.shutdown guest
# shutdown -h now

==> qemu core dunmp info:
gdb /usr/libexec/qemu-kvm core.qemu-kvm.26217
(gdb) bt full
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {94463127234540, 94463127234240, 94463127234540, 0, 0, 0, 0, 0, 549755813888, 8122152881412200192, 139728065789952, 8122152881412200192, 
            139728084199072, 140725314159792, 94463112425238, 139728084199072}}
        ret = <optimized out>
#1  0x00007f14fabd45b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
No locals.
#2  0x00007f14fab87ce6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007f14fab5b7f3 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x55e9e842af16, sa_sigaction = 0x55e9e842af16}, sa_mask = {__val = {483, 94463126874112, 18, 94463127234240, 139728083895520, 
              139728083922896, 0, 38654705681, 8122152881412200192, 140725314159872, 18446744073709550408, 9, 94463112425238, 483, 94463112567773, 94463130477504}}, 
          sa_flags = -88200859, sa_restorer = 0x7f14fad2a5e0 <__GI__IO_file_jumps>}
        sigs = {__val = {32, 94463130477504, 139728065834752, 139728082690853, 483, 94463112567773, 94463130477504, 139728082577098, 206158430256, 94463112425420, 
            139728083942496, 139728082691146, 206158430232, 140725314160096, 140725314159904, 8122152881412200192}}
#4  0x00007f14fab5b71b in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>)
    at assert.c:92
        str = 0x55e9e91f2800 "rǀ\267\354U"
        total = 4096
#5  0x00007f14fab80c66 in __GI___assert_fail (assertion=0x55e9e844dbdd "vhost_net", file=0x55e9e842af16 "../hw/net/vhost_net.c", line=483, 
    function=0x55e9e842afcc "VHostNetState *get_vhost_net(NetClientState *)") at assert.c:101
No locals.
#6  0x000055e9e7ea3506 in get_vhost_net (nc=<optimized out>) at ../hw/net/vhost_net.c:483
        vhost_net = 0x0
#7  0x000055e9e806e587 in virtio_net_set_status (vdev=0x55e9e9376550, status=0 '\000') at ../hw/net/virtio-net.c:251
        n = 0x55e9e9376550
        i = <optimized out>
        q = <optimized out>
        queue_status = <optimized out>
#8  0x000055e9e8097c98 in virtio_set_status (vdev=0x55e9e9376550, val=0 '\000') at ../hw/virtio/virtio.c:1997
        k = 0x55e9e9229990
        ret = <optimized out>
#9  0x000055e9e7f1b5e8 in vm_shutdown () at ../softmmu/runstate.c:334
No locals.
#10 0x000055e9e7f2500c in qemu_cleanup () at ../softmmu/runstate.c:822
No locals.
#11 0x000055e9e7dae836 in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=0x0) at ../softmmu/main.c:39
        status = 0
#12 main (argc=<optimized out>, argv=<optimized out>) at ../softmmu/main.c:47
No locals.
(gdb) 
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {94463127234540, 94463127234240, 94463127234540, 0, 0, 0, 0, 0, 549755813888, 8122152881412200192, 139728065789952, 8122152881412200192, 
            139728084199072, 140725314159792, 94463112425238, 139728084199072}}
        ret = <optimized out>
#1  0x00007f14fabd45b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
No locals.
#2  0x00007f14fab87ce6 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007f14fab5b7f3 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x55e9e842af16, sa_sigaction = 0x55e9e842af16}, sa_mask = {__val = {483, 94463126874112, 18, 94463127234240, 139728083895520, 
              139728083922896, 0, 38654705681, 8122152881412200192, 140725314159872, 18446744073709550408, 9, 94463112425238, 483, 94463112567773, 94463130477504}}, 
          sa_flags = -88200859, sa_restorer = 0x7f14fad2a5e0 <__GI__IO_file_jumps>}
        sigs = {__val = {32, 94463130477504, 139728065834752, 139728082690853, 483, 94463112567773, 94463130477504, 139728082577098, 206158430256, 94463112425420, 
            139728083942496, 139728082691146, 206158430232, 140725314160096, 140725314159904, 8122152881412200192}}
#4  0x00007f14fab5b71b in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>)
    at assert.c:92
        str = 0x55e9e91f2800 "rǀ\267\354U"
        total = 4096
#5  0x00007f14fab80c66 in __GI___assert_fail (assertion=0x55e9e844dbdd "vhost_net", file=0x55e9e842af16 "../hw/net/vhost_net.c", line=483, 
    function=0x55e9e842afcc "VHostNetState *get_vhost_net(NetClientState *)") at assert.c:101
No locals.
#6  0x000055e9e7ea3506 in get_vhost_net (nc=<optimized out>) at ../hw/net/vhost_net.c:483
        vhost_net = 0x0
#7  0x000055e9e806e587 in virtio_net_set_status (vdev=0x55e9e9376550, status=0 '\000') at ../hw/net/virtio-net.c:251
        n = 0x55e9e9376550
        i = <optimized out>
        q = <optimized out>
        queue_status = <optimized out>
#8  0x000055e9e8097c98 in virtio_set_status (vdev=0x55e9e9376550, val=0 '\000') at ../hw/virtio/virtio.c:1997
        k = 0x55e9e9229990
        ret = <optimized out>
#9  0x000055e9e7f1b5e8 in vm_shutdown () at ../softmmu/runstate.c:334
No locals.
#10 0x000055e9e7f2500c in qemu_cleanup () at ../softmmu/runstate.c:822
No locals.
#11 0x000055e9e7dae836 in qemu_main (argc=<optimized out>, argv=<optimized out>, envp=0x0) at ../softmmu/main.c:39
        status = 0
#12 main (argc=<optimized out>, argv=<optimized out>) at ../softmmu/main.c:47
No locals.

Actual results:
qemu core dump occurs after shutdown guest

Expected results:
There should be an error message when hotplug nic to the occupied addr port

Additional info:
1. The vhost-net device cannot be hotplug/hotplug successfully, but no qemu core dump occurs after the shutdown guest.
2. This is a wrong usage,so change bug priority to low.But QE think there should be an error message during hotplug, not core dump after shutdown, which is an unfriendly behavior from qe's point of view.

Comment 1 Lei Yang 2022-09-22 02:03:54 UTC

Additional info:

I tried to used the following qemu cmd, then repeat test the above test steps, it can not reproduced this bug.

qemu cmd:
/usr/libexec/qemu-kvm \
-name guest=rhel9.2,debug-threads=on \
-blockdev '{"driver":"file","filename":"/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/rhel9.2_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-q35-rhel9.0.0,usb=off,vmport=off,smm=on,kernel_irqchip=split,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-accel kvm \
-cpu Cascadelake-Server,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,tsc-deadline=on,pmu=off \
-global driver=cfi.pflash01,property=secure,value=on \
-m 8192 \
-overcommit mem-lock=off \
-smp 6,sockets=6,dies=1,cores=1,threads=1 \
-nodefaults \
-boot strict=on \
-device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
-device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
-device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
-device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
-device pcie-root-port,id=pcie-root-port-6,port=0x6,addr=0x1.0x6,bus=pcie.0,chassis=7 \
-blockdev '{"driver":"file","filename":"/home/images_nfv-virt-rt-kvm/rhel9.2.qcow2","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device '{"driver":"virtio-blk-pci","bus":"pcie-root-port-2","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1,"write-cache":"on"}' \
-device VGA,bus=pcie.0,addr=0x2 \
-monitor stdio \
-qmp tcp:0:5555,server,nowait \
-vnc :0 \

Comment 2 Lei Yang 2022-09-22 03:19:52 UTC

(In reply to Lei Yang from comment #1)
> Additional info:
> 
> I tried to used the following qemu cmd, then repeat test the above test
> steps, it can not reproduced this bug.
> 
Please ignore comment 1.

Comment 3 Laurent Vivier 2022-09-27 17:21:21 UTC

I don't think plug the card in an occupied addr port.

The port that contains a device is on bus pcie.0:

    -device '{"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}'

And you plug the virtio-net device on bus pci.6 (so in pcie-root-port you put on bus pcie.0):

    {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x2.0x5"}}

Moreover this behavior is not specific to vdpa or virtio-net, I've been able to reproduce it with driver=e1000 and driver=virtio-scsi.

The real bug is the crash when QEMU exit, and it is specific to vdpa.

This is an assertion that fails:

qemu-system-x86_64: .../hw/net/vhost_net.c:483: get_vhost_net: Assertion `vhost_net' failed.

462 VHostNetState *get_vhost_net(NetClientState *nc)
463 {
464     VHostNetState *vhost_net = 0;
...
480 #ifdef CONFIG_VHOST_NET_VDPA
481     case NET_CLIENT_DRIVER_VHOST_VDPA:
482         vhost_net = vhost_vdpa_get_vhost_net(nc);
483         assert(vhost_net);
484         break;
485 #endif
486     default:
487         break;
488     }
489 
490     return vhost_net;
491 }

Comment 4 Laurent Vivier 2022-09-27 17:28:47 UTC

I've been able to reproduce the problem with upstream QEMU (c48c9c6b33d7)

Comment 5 Laurent Vivier 2022-09-27 19:11:10 UTC

The problem seems to be with the netdev_add that registers two instances:

QMP: {"execute":"netdev_add","arguments":{"type":"vhost-vdpa","id":"hostnet1","vhostdev":"/dev/vhost-vdpa-0"}}

and the monitor we can see then:

(qemu) info network 
hostnet1: index=0,type=vhost-vdpa,vhost-vdpa
hostnet1: index=0,type=vhost-vdpa,vhost-vdpa
(qemu) 

That looks like the cause of the problem.

Comment 6 Laurent Vivier 2022-09-27 19:21:00 UTC

(In reply to Laurent Vivier from comment #5)
> The problem seems to be with the netdev_add that registers two instances:
> 
> QMP:
> {"execute":"netdev_add","arguments":{"type":"vhost-vdpa","id":"hostnet1",
> "vhostdev":"/dev/vhost-vdpa-0"}}
> 
> and the monitor we can see then:
> 
> (qemu) info network 
> hostnet1: index=0,type=vhost-vdpa,vhost-vdpa
> hostnet1: index=0,type=vhost-vdpa,vhost-vdpa
> (qemu) 

The same occurs without QMP:

(qemu) netdev_add vhost-vdpa,id=hostnet1,vhostdev=/dev/vhost-vdpa-0
(qemu) info network 
hostnet1: index=0,type=vhost-vdpa,vhost-vdpa
hostnet1: index=0,type=vhost-vdpa,vhost-vdpa

Comment 7 Laurent Vivier 2022-09-27 20:59:24 UTC

I reproduce the crash when I exit from the QMP console, and bisecting gives me:

commit a00e37a4be88a043fea3e8be3ee3a85f6c4939cf
Author: Alex Bennée <alex.bennee>
Date:   Tue Oct 26 11:22:24 2021 +0100

    chardev: don't exit() straight away on C-a x
    
    While there are a number of uses in the code-base of the exit(0)
    pattern it gets in the way of clean exit which can do all of it's
    house-keeping. In particular it was reported that you can crash
    plugins this way because TCG can still be running on other threads
    when the atexit callback is called.
    
    Use qmp_quit() instead which takes care of some housekeeping before
    triggering the shutdown.
    
    Signed-off-by: Alex Bennée <alex.bennee>
    Reported-by: Lukas Jünger <lukas.junger>
    Reviewed-by: Marc-André Lureau <marcandre.lureau>
    Reviewed-by: Philippe Mathieu-Daudé <philmd>
    Acked-by: Paolo Bonzini <pbonzini>
    Message-Id: <20211026102234.3961636-19-alex.bennee>

 chardev/char-mux.c | 3 ++-
 stubs/meson.build  | 1 +
 stubs/qmp-quit.c   | 8 ++++++++

The multiple instances of netdev is introduced by the multiqueue:

commit 402378407dbdce79ce745a13f5c84815f929cfdd (refs/bisect/good-402378407dbdce79ce745a13f5c84815f929cfdd)
Author: Jason Wang <jasowang>
Date:   Wed Oct 20 12:56:00 2021 +0800

    vhost-vdpa: multiqueue support
    
    This patch implements the multiqueue support for vhost-vdpa. This is
    done simply by reading the number of queue pairs from the config space
    and initialize the datapath and control path net client.
    
    Signed-off-by: Jason Wang <jasowang>
    Message-Id: <20211020045600.16082-11-jasowang>
    Reviewed-by: Michael S. Tsirkin <mst>
    Signed-off-by: Michael S. Tsirkin <mst>

because it calls several times net_vhost_vdpa_init()

Comment 8 Lei Yang 2022-09-28 01:29:00 UTC

Based on the comment 7, this bug should be add keywords "Regression" and change priority to "High". Please correct me if I'm wrong.

Comment 9 Laurent Vivier 2022-09-29 12:15:49 UTC

The problem happens because the address provided in device_add is not valid:

{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x2.0x5"}}

thus the card is not really plugged by the the kernel and then QEMU cannot really remove it.

We can remove the instance of the netdev because for QEMU the device is removed, but when we exit QEMU does a cleanup and as the card is not really removed it tries to cleanup it.
The instance of the netdev doesn't exist anymore so it triggers a crash.

To have multiple instances of netdev doesn't seem to be a problem with a correct address.

The following command works fine:

{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x0"}}

The problem seems to be introduced by the swith to ACPI PCI hotplug by default:

commit 17858a169508609ca9063c544833e5a1adeb7b52
Author: Julia Suvorova <jusual>
Date:   Tue Jul 13 02:42:04 2021 +0200

    hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35
    
    Q35 has three different types of PCI devices hot-plug: PCIe Native,
    SHPC Native and ACPI hot-plug. This patch changes the default choice
    for cold-plugged bridges from PCIe Native to ACPI Hot-plug with
    ability to use SHPC and PCIe Native for hot-plugged bridges.
    
    This is a list of the PCIe Native hot-plug issues that led to this
    change:
        * no racy behavior during boot (see 110c477c2ed)
        * no delay during deleting - after the actual power off software
          must wait at least 1 second before indicating about it. This case
          is quite important for users, it even has its own bug:
              https://bugzilla.redhat.com/show_bug.cgi?id=1594168
        * no timer-based behavior - in addition to the previous example,
          the attention button has a 5-second waiting period, during which
          the operation can be canceled with a second press. While this
          looks fine for manual button control, automation will result in
          the need to queue or drop events, and the software receiving
          events in all sort of unspecified combinations of attention/power
          indicator states, which is racy and uppredictable.
        * fixes:
            * https://bugzilla.redhat.com/show_bug.cgi?id=1752465
            * https://bugzilla.redhat.com/show_bug.cgi?id=1690256
    
    To return to PCIe Native hot-plug:
        -global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off
    
    Known issue: older linux guests need the following flag
    to allow hotplugged pci express devices to use io:
            -device pcie-root-port,io-reserve=4096.
    io is unusual for pci express so this seems minor.
    We'll fix this by a follow up patch.
    
    Signed-off-by: Julia Suvorova <jusual>
    Reviewed-by: Igor Mammedov <imammedo>
    Message-Id: <20210713004205.775386-6-jusual>
    Reviewed-by: Michael S. Tsirkin <mst>
    Signed-off-by: Michael S. Tsirkin <mst>
    Reviewed-by: David Gibson <david.id.au>

 hw/acpi/ich9.c | 2 +-
 hw/i386/pc.c   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

If I use "global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off" the card can be removed correctly and there is no crash.

Comment 10 Laurent Vivier 2022-09-29 12:29:58 UTC

The problem is not specific nor to virtio neither to vdpa, it can be reproduced with any PCI card:

(qemu) info network 
(qemu) netdev_add user,id=hostnet1
(qemu) info network 
hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
(qemu) device_add e1000e,netdev=hostnet1,mac=00:11:22:33:44:03,id=net1,bus=pci.6,addr=0x2.0x5
(qemu) info network 
net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
 \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
(qemu) device_del net1
(qemu) info network 
net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
 \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
(qemu) netdev_del hostnet1
(qemu) info network 
net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
 \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off

It doesn't crash because in contrary to vdpa there is no assert to check the consistency of the system but the problem is still there as the card is not really unplugged when it has been plugged to an invalid port address.

We can reproduce it with a SCSI card:

(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:29c0
      PCI subsystem 1af4:1100
      id ""
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x3000, 0x3fff]
      memory range [0xfde00000, 0xfdffffff]
      prefetchable memory range [0xfe600000, 0xfe7fffff]
      BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
      id "pci.6"
...
(qemu) device_add virtio-scsi,id=net1,bus=pci.6,addr=0x2.0x5
(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:29c0
      PCI subsystem 1af4:1100
      id ""
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x3000, 0x3fff]
      memory range [0xfde00000, 0xfdffffff]
      prefetchable memory range [0xfe600000, 0xfe7fffff]
      BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
      id "pci.6"
  Bus  3, device   2, function 5:
    SCSI controller: PCI device 1af4:1048
      PCI subsystem 1af4:1100
      IRQ 0, pin A
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      id "net1"
...
(qemu) device_del net1
(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:29c0
      PCI subsystem 1af4:1100
      id ""
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x3000, 0x3fff]
      memory range [0xfde00000, 0xfdffffff]
      prefetchable memory range [0xfe600000, 0xfe7fffff]
      BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
      id "pci.6"
  Bus  3, device   2, function 5:
    SCSI controller: PCI device 1af4:1048
      PCI subsystem 1af4:1100
      IRQ 0, pin A
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      id "net1"
...
The PCI card is not removed after the device_del

Comment 11 Laurent Vivier 2022-09-29 12:31:47 UTC

Moving to PCI subcomponent

Comment 12 Laurent Vivier 2022-09-29 12:37:55 UTC

To summarize

Reproducer:

/usr/libexec/qemu-kvm \
-machine q35 \
-cpu host -enable-kvm \
-m 8192 \
-smp 6,sockets=6,dies=1,cores=1,threads=1 \
-nodefaults \
-boot strict=on \
-device pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=true,addr=0x2 \
-device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-blockdev file,filename=/var/lib/libvirt/images/rhel9.1.0.qcow2,aio=threads,node-name=libvirt-1-storage \
-blockdev node-name=libvirt-1-format,driver=qcow2,file=libvirt-1-storage \
-device virtio-blk-pci,bus=pci.2,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on \
-serial mon:stdio \
-nographic

In QEMU monitor:

for networking card that can be listed by "info network"

(qemu) info network 
(qemu) netdev_add user,id=hostnet1
(qemu) info network 
hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
(qemu) device_add e1000e,netdev=hostnet1,mac=00:11:22:33:44:03,id=net1,bus=pci.6,addr=0x2.0x5
(qemu) info network 
net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
 \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
(qemu) device_del net1
(qemu) info network 
net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
 \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
(qemu) netdev_del hostnet1
(qemu) info network 
net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
 \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off

The netdev and the cards are not removed.

or for any PCI card:

(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:29c0
      PCI subsystem 1af4:1100
      id ""
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x3000, 0x3fff]
      memory range [0xfde00000, 0xfdffffff]
      prefetchable memory range [0xfe600000, 0xfe7fffff]
      BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
      id "pci.6"
...
(qemu) device_add virtio-scsi,id=net1,bus=pci.6,addr=0x2.0x5
(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:29c0
      PCI subsystem 1af4:1100
      id ""
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x3000, 0x3fff]
      memory range [0xfde00000, 0xfdffffff]
      prefetchable memory range [0xfe600000, 0xfe7fffff]
      BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
      id "pci.6"
  Bus  3, device   2, function 5:
    SCSI controller: PCI device 1af4:1048
      PCI subsystem 1af4:1100
      IRQ 0, pin A
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      id "net1"
...
(qemu) device_del net1
(qemu) info pci
  Bus  0, device   0, function 0:
    Host bridge: PCI device 8086:29c0
      PCI subsystem 1af4:1100
      id ""
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x3000, 0x3fff]
      memory range [0xfde00000, 0xfdffffff]
      prefetchable memory range [0xfe600000, 0xfe7fffff]
      BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
      id "pci.6"
  Bus  3, device   2, function 5:
    SCSI controller: PCI device 1af4:1048
      PCI subsystem 1af4:1100
      IRQ 0, pin A
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      id "net1"
...

The problem is the PCI card is not removed after the device_del.
The problem is introduced by 17858a169508 "hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35"
The workaround is to use "global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"
The problem is triggered by using an invalid port address.
The problem generates a crash with vDPA because vDPA has an assert that check the state of the netdev.

Comment 13 Lei Yang 2023-02-14 07:03:28 UTC

Hi Laurent

This problem still can reproduced on the latest qemu-kvm-7.2 version. Are there plans to fix this in the current release? If not, Please move to ITR=9.3.0 .

Test Version:
kernel-5.14.0-262.el9.x86_64
qemu-kvm-7.2.0-8.el9.x86_64
libvirt-9.0.0-3.el9.x86_64

Thanks
Lei

Comment 14 Laurent Vivier 2023-02-14 08:01:41 UTC

(In reply to Lei Yang from comment #13)
> Hi Laurent
> 
> This problem still can reproduced on the latest qemu-kvm-7.2 version. Are
> there plans to fix this in the current release? If not, Please move to
> ITR=9.3.0 .
>

Yes, it seems to late to be fixed now.

Comment 15 Laurent Vivier 2023-04-27 09:17:22 UTC

Julia,

any update (see comment #9)?

Should we close as WONTFIX?

Thanks

Comment 16 Julia Suvorova 2023-05-02 12:54:06 UTC

(In reply to Laurent Vivier from comment #15)
> Julia,
> 
> any update (see comment #9)?
> 
> Should we close as WONTFIX?
> 
> Thanks

Thanks for the investigation. There was protection against hotplug at the wrong address on the root ports. I'll take a look at what went wrong.

Comment 17 Ani Sinha 2023-06-08 07:55:42 UTC

Assigning it to myself after talking to Julia. Will take a stab at it.

Comment 18 Ani Sinha 2023-06-14 12:11:12 UTC

Even with the native hotplug case, we see that upon insertion of the ethernet device with a wring devfn, the device is actually not functional:

(qemu) info pci
...
 Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x6000, 0x6fff]
      memory range [0xc0800000, 0xc09fffff]
      prefetchable memory range [0x800400000, 0x8005fffff]
      BAR0: 32 bit memory at 0xc0e11000 [0xc0e11fff].
      id "pci.6"
  Bus  3, device   2, function 5:
    Ethernet controller: PCI device 8086:10d3
      PCI subsystem 8086:0000
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR2: I/O at 0xffffffffffffffff [0x001e].
      BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id "net1"

The addresses are all wrong (with kvm accelerator).
When the devfn are correct, we see proper BAR addresses for this inserted device:

(qemu) netdev_add socket,id=hostnet1,listen=:1234
(qemu) device_add e1000e,netdev=hostnet1,mac=00:11:22:33:44:03,id=net1,bus=pci.6,addr=0x0.0
(qemu) info pci
...
  Bus  0, device   2, function 5:
    PCI bridge: PCI device 1b36:000c
      IRQ 11, pin A
      BUS 0.
      secondary bus 3.
      subordinate bus 3.
      IO range [0x6000, 0x6fff]
      memory range [0xc0800000, 0xc09fffff]
      prefetchable memory range [0x800400000, 0x8005fffff]
      BAR0: 32 bit memory at 0xc0e11000 [0xc0e11fff].
      id "pci.6"
  Bus  3, device   0, function 0:
    Ethernet controller: PCI device 8086:10d3
      PCI subsystem 8086:0000
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xc0840000 [0xc085ffff].
      BAR1: 32 bit memory at 0xc0860000 [0xc087ffff].
      BAR2: I/O at 0xffffffffffffffff [0x001e].
      BAR3: 32 bit memory at 0xc0880000 [0xc0883fff].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id "net1"

Comment 19 Igor Mammedov 2023-06-15 11:33:26 UTC

(In reply to Laurent Vivier from comment #12)
> To summarize
> 
> Reproducer:
> 
> /usr/libexec/qemu-kvm \
> -machine q35 \
> -cpu host -enable-kvm \
> -m 8192 \
> -smp 6,sockets=6,dies=1,cores=1,threads=1 \
> -nodefaults \
> -boot strict=on \
> -device
> pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=true,
> addr=0x2 \
> -device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
> -device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
> -blockdev
> file,filename=/var/lib/libvirt/images/rhel9.1.0.qcow2,aio=threads,node-
> name=libvirt-1-storage \
> -blockdev node-name=libvirt-1-format,driver=qcow2,file=libvirt-1-storage \
> -device
> virtio-blk-pci,bus=pci.2,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,
> bootindex=1,write-cache=on \
> -serial mon:stdio \
> -nographic
> 
> In QEMU monitor:
> 
> for networking card that can be listed by "info network"
> 
> (qemu) info network 
> (qemu) netdev_add user,id=hostnet1
> (qemu) info network 
> hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> (qemu) device_add
> e1000e,netdev=hostnet1,mac=00:11:22:33:44:03,id=net1,bus=pci.6,addr=0x2.0x5
> (qemu) info network 
> net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
>  \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> (qemu) device_del net1
> (qemu) info network 
> net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
>  \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> (qemu) netdev_del hostnet1
> (qemu) info network 
> net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
>  \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> 
> The netdev and the cards are not removed.
> 
> or for any PCI card:
> 
> (qemu) info pci
>   Bus  0, device   0, function 0:
>     Host bridge: PCI device 8086:29c0
>       PCI subsystem 1af4:1100
>       id ""
> ...
>   Bus  0, device   2, function 5:
>     PCI bridge: PCI device 1b36:000c
>       IRQ 11, pin A
>       BUS 0.
>       secondary bus 3.
>       subordinate bus 3.
>       IO range [0x3000, 0x3fff]
>       memory range [0xfde00000, 0xfdffffff]
>       prefetchable memory range [0xfe600000, 0xfe7fffff]
>       BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
>       id "pci.6"
> ...
> (qemu) device_add virtio-scsi,id=net1,bus=pci.6,addr=0x2.0x5
> (qemu) info pci
>   Bus  0, device   0, function 0:
>     Host bridge: PCI device 8086:29c0
>       PCI subsystem 1af4:1100
>       id ""
> ...
>   Bus  0, device   2, function 5:
>     PCI bridge: PCI device 1b36:000c
>       IRQ 11, pin A
>       BUS 0.
>       secondary bus 3.
>       subordinate bus 3.
>       IO range [0x3000, 0x3fff]
>       memory range [0xfde00000, 0xfdffffff]
>       prefetchable memory range [0xfe600000, 0xfe7fffff]
>       BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
>       id "pci.6"
>   Bus  3, device   2, function 5:
>     SCSI controller: PCI device 1af4:1048
>       PCI subsystem 1af4:1100
>       IRQ 0, pin A
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
>       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
>       id "net1"
> ...
> (qemu) device_del net1
> (qemu) info pci
>   Bus  0, device   0, function 0:
>     Host bridge: PCI device 8086:29c0
>       PCI subsystem 1af4:1100
>       id ""
> ...
>   Bus  0, device   2, function 5:
>     PCI bridge: PCI device 1b36:000c
>       IRQ 11, pin A
>       BUS 0.
>       secondary bus 3.
>       subordinate bus 3.
>       IO range [0x3000, 0x3fff]
>       memory range [0xfde00000, 0xfdffffff]
>       prefetchable memory range [0xfe600000, 0xfe7fffff]
>       BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
>       id "pci.6"
>   Bus  3, device   2, function 5:
>     SCSI controller: PCI device 1af4:1048
>       PCI subsystem 1af4:1100
>       IRQ 0, pin A
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
>       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
>       id "net1"
> ...
> 
> The problem is the PCI card is not removed after the device_del.
> The problem is introduced by 17858a169508 "hw/acpi/ich9: Set ACPI PCI
> hot-plug as default on Q35"
> The workaround is to use "global
> ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"
> The problem is triggered by using an invalid port address.

I don't think that problem with device on non 0 function,
it is still present (device_del is just request). I'd say
something goes wrong with how we link/account for backend
and do cleanup. netdev_del shall fail if there is device
that uses it or at least not leave association in inconsistent
state.

> The problem generates a crash with vDPA because vDPA has an assert that
> check the state of the netdev.

Comment 20 Ani Sinha 2023-06-15 12:02:19 UTC

(In reply to Igor Mammedov from comment #19)
> (In reply to Laurent Vivier from comment #12)
> > To summarize
> > 
> > Reproducer:
> > 
> > /usr/libexec/qemu-kvm \
> > -machine q35 \
> > -cpu host -enable-kvm \
> > -m 8192 \
> > -smp 6,sockets=6,dies=1,cores=1,threads=1 \
> > -nodefaults \
> > -boot strict=on \
> > -device
> > pcie-root-port,port=16,chassis=1,id=pci.1,bus=pcie.0,multifunction=true,
> > addr=0x2 \
> > -device pcie-root-port,port=17,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
> > -device pcie-root-port,port=21,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
> > -blockdev
> > file,filename=/var/lib/libvirt/images/rhel9.1.0.qcow2,aio=threads,node-
> > name=libvirt-1-storage \
> > -blockdev node-name=libvirt-1-format,driver=qcow2,file=libvirt-1-storage \
> > -device
> > virtio-blk-pci,bus=pci.2,addr=0x0,drive=libvirt-1-format,id=virtio-disk0,
> > bootindex=1,write-cache=on \
> > -serial mon:stdio \
> > -nographic
> > 
> > In QEMU monitor:
> > 
> > for networking card that can be listed by "info network"
> > 
> > (qemu) info network 
> > (qemu) netdev_add user,id=hostnet1
> > (qemu) info network 
> > hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> > (qemu) device_add
> > e1000e,netdev=hostnet1,mac=00:11:22:33:44:03,id=net1,bus=pci.6,addr=0x2.0x5
> > (qemu) info network 
> > net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
> >  \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> > (qemu) device_del net1
> > (qemu) info network 
> > net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
> >  \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> > (qemu) netdev_del hostnet1
> > (qemu) info network 
> > net1: index=0,type=nic,model=e1000e,macaddr=00:11:22:33:44:03
> >  \ hostnet1: index=0,type=user,net=10.0.2.0,restrict=off
> > 
> > The netdev and the cards are not removed.
> > 
> > or for any PCI card:
> > 
> > (qemu) info pci
> >   Bus  0, device   0, function 0:
> >     Host bridge: PCI device 8086:29c0
> >       PCI subsystem 1af4:1100
> >       id ""
> > ...
> >   Bus  0, device   2, function 5:
> >     PCI bridge: PCI device 1b36:000c
> >       IRQ 11, pin A
> >       BUS 0.
> >       secondary bus 3.
> >       subordinate bus 3.
> >       IO range [0x3000, 0x3fff]
> >       memory range [0xfde00000, 0xfdffffff]
> >       prefetchable memory range [0xfe600000, 0xfe7fffff]
> >       BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
> >       id "pci.6"
> > ...
> > (qemu) device_add virtio-scsi,id=net1,bus=pci.6,addr=0x2.0x5
> > (qemu) info pci
> >   Bus  0, device   0, function 0:
> >     Host bridge: PCI device 8086:29c0
> >       PCI subsystem 1af4:1100
> >       id ""
> > ...
> >   Bus  0, device   2, function 5:
> >     PCI bridge: PCI device 1b36:000c
> >       IRQ 11, pin A
> >       BUS 0.
> >       secondary bus 3.
> >       subordinate bus 3.
> >       IO range [0x3000, 0x3fff]
> >       memory range [0xfde00000, 0xfdffffff]
> >       prefetchable memory range [0xfe600000, 0xfe7fffff]
> >       BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
> >       id "pci.6"
> >   Bus  3, device   2, function 5:
> >     SCSI controller: PCI device 1af4:1048
> >       PCI subsystem 1af4:1100
> >       IRQ 0, pin A
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       id "net1"
> > ...
> > (qemu) device_del net1
> > (qemu) info pci
> >   Bus  0, device   0, function 0:
> >     Host bridge: PCI device 8086:29c0
> >       PCI subsystem 1af4:1100
> >       id ""
> > ...
> >   Bus  0, device   2, function 5:
> >     PCI bridge: PCI device 1b36:000c
> >       IRQ 11, pin A
> >       BUS 0.
> >       secondary bus 3.
> >       subordinate bus 3.
> >       IO range [0x3000, 0x3fff]
> >       memory range [0xfde00000, 0xfdffffff]
> >       prefetchable memory range [0xfe600000, 0xfe7fffff]
> >       BAR0: 32 bit memory at 0xfe402000 [0xfe402fff].
> >       id "pci.6"
> >   Bus  3, device   2, function 5:
> >     SCSI controller: PCI device 1af4:1048
> >       PCI subsystem 1af4:1100
> >       IRQ 0, pin A
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       id "net1"
> > ...
> > 
> > The problem is the PCI card is not removed after the device_del.
> > The problem is introduced by 17858a169508 "hw/acpi/ich9: Set ACPI PCI
> > hot-plug as default on Q35"
> > The workaround is to use "global
> > ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"
> > The problem is triggered by using an invalid port address.
> 
> I don't think that problem with device on non 0 function,
> it is still present (device_del is just request). 

The problem is not non-zero function. I did some experiments and seems the issue is the wrong (non-zero) bus. Something like 

device_add e1000e,netdev=hostnet1,mac=00:11:22:33:44:03,id=net1,bus=pci.6,addr=0x0.0x5

works fine and the device does get removed.

Comment 21 Ani Sinha 2023-06-16 07:04:35 UTC

(In reply to Igor Mammedov from comment #19)
> (In reply to Laurent Vivier from comment #12)

> > The problem is the PCI card is not removed after the device_del.
> > The problem is introduced by 17858a169508 "hw/acpi/ich9: Set ACPI PCI
> > hot-plug as default on Q35"
> > The workaround is to use "global
> > ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"
> > The problem is triggered by using an invalid port address.
> 
> I don't think that problem with device on non 0 function,
> it is still present (device_del is just request). I'd say
> something goes wrong with how we link/account for backend
> and do cleanup. netdev_del shall fail if there is device
> that uses it or at least not leave association in inconsistent
> state.
> 

What I have found from my investigation is this:

When we are using native hotplug, unplug request goes through this chain:
pcie_cap_slot_unplug_request_cb() 
  -> pcie_unplug_device() 
      -> pcie_cap_slot_unplug_cb()
           ->qdev_unrealize().
      -> object_unparent()

This eventually cleans up the device. So this is a direct cleanup path. 

For acpi hotplug, unplug_device goes through this:

ich9_pm_device_unplug_request_cb()
  -> acpi_pcihp_device_unplug_request_cb()
       -> acpi_send_event()

Just as Igor mentioned above this sends an unplug event to the guest OS.

What I am observing is that it seems when the slot ID != 0, the guest OS seems to ignore this and we never seem to hit ich9_pm_device_unplug_cb().

So the following path is not triggered:

ich9_pm_device_unplug_cb()
  -> acpi_pcihp_device_unplug_cb()
       -> qdev_unrealize()
 
When the slot ID is 0 for the plugged interface card, this does get hit (like I mentioned in the above comment).

So the cleanup that us supposed to happen from ich9_pm_device_unplug_cb() never happens, the card remains plugged in and only upon guest shutdown, QEMU tries to clean this up and with vdpa device it causes a crash.

I do believe that the patch that I posted in the mailing list https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg03066.html is in the right direction and slot value anything other than 0 should be disallowed when a device is plugged into the pcie root ports.

Comment 22 Ani Sinha 2023-06-16 07:16:15 UTC

(In reply to Ani Sinha from comment #21)
> (In reply to Igor Mammedov from comment #19)
> > (In reply to Laurent Vivier from comment #12)
> 
> > > The problem is the PCI card is not removed after the device_del.
> > > The problem is introduced by 17858a169508 "hw/acpi/ich9: Set ACPI PCI
> > > hot-plug as default on Q35"
> > > The workaround is to use "global
> > > ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"
> > > The problem is triggered by using an invalid port address.
> > 
> > I don't think that problem with device on non 0 function,
> > it is still present (device_del is just request). I'd say
> > something goes wrong with how we link/account for backend
> > and do cleanup. netdev_del shall fail if there is device
> > that uses it or at least not leave association in inconsistent
> > state.
> > 
> 
> What I have found from my investigation is this:
> 
> When we are using native hotplug, unplug request goes through this chain:
> pcie_cap_slot_unplug_request_cb() 
>   -> pcie_unplug_device() 
>       -> pcie_cap_slot_unplug_cb()
>            ->qdev_unrealize().
>       -> object_unparent()
> 
> This eventually cleans up the device. So this is a direct cleanup path. 
> 
> For acpi hotplug, unplug_device goes through this:
> 
> ich9_pm_device_unplug_request_cb()
>   -> acpi_pcihp_device_unplug_request_cb()
>        -> acpi_send_event()
> 
> Just as Igor mentioned above this sends an unplug event to the guest OS.
> 
> What I am observing is that it seems when the slot ID != 0, the guest OS
> seems to ignore this and we never seem to hit ich9_pm_device_unplug_cb().
> 
> So the following path is not triggered:
> 
> ich9_pm_device_unplug_cb()
>   -> acpi_pcihp_device_unplug_cb()
>        -> qdev_unrealize()
>  
> When the slot ID is 0 for the plugged interface card, this does get hit
> (like I mentioned in the above comment).

Another data point is that acpi_pcihp_eject_slot() also does not get hit when slot ID is non-zero. When slot is 0, it hits and cleans up the device.

Comment 23 Ani Sinha 2023-06-16 07:27:04 UTC

(In reply to Ani Sinha from comment #21)
> (In reply to Igor Mammedov from comment #19)
> > (In reply to Laurent Vivier from comment #12)
> 
> > > The problem is the PCI card is not removed after the device_del.
> > > The problem is introduced by 17858a169508 "hw/acpi/ich9: Set ACPI PCI
> > > hot-plug as default on Q35"
> > > The workaround is to use "global
> > > ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off"
> > > The problem is triggered by using an invalid port address.
> > 
> > I don't think that problem with device on non 0 function,
> > it is still present (device_del is just request). I'd say
> > something goes wrong with how we link/account for backend
> > and do cleanup. netdev_del shall fail if there is device
> > that uses it or at least not leave association in inconsistent
> > state.
> > 
> 
> What I have found from my investigation is this:
> 
> When we are using native hotplug, unplug request goes through this chain:
> pcie_cap_slot_unplug_request_cb() 
>   -> pcie_unplug_device() 
>       -> pcie_cap_slot_unplug_cb()
>            ->qdev_unrealize().
>       -> object_unparent()
> 
> This eventually cleans up the device. So this is a direct cleanup path. 
> 
> For acpi hotplug, unplug_device goes through this:
> 
> ich9_pm_device_unplug_request_cb()
>   -> acpi_pcihp_device_unplug_request_cb()
>        -> acpi_send_event()
> 
> Just as Igor mentioned above this sends an unplug event to the guest OS.
> 
> What I am observing is that it seems when the slot ID != 0, the guest OS
> seems to ignore this and we never seem to hit ich9_pm_device_unplug_cb().
> 
> So the following path is not triggered:
> 
> ich9_pm_device_unplug_cb()
>   -> acpi_pcihp_device_unplug_cb()
>        -> qdev_unrealize()
>  

The full path is:

memory_region_write_accessor()
  -> pci_write()
       -> acpi_pcihp_eject_slot()
             -> hotplug_handler_unplug()
                  -> ich9_pm_device_unplug_cb()
                       -> acpi_pcihp_device_unplug_cb()
                             -> qdev_unrealize()
             -> object_unparent()

Comment 24 Igor Mammedov 2023-06-16 08:02:05 UTC

acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject action only.
Think in more generic terms, does it work if we never unplug nic (if yes, then why).

So far I'm convinced that unplug which didn't happen is a trigger that
exposes cleanup issue elsewhere.

Comment 25 Ani Sinha 2023-06-16 09:09:55 UTC

(In reply to Igor Mammedov from comment #24)
> acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> action only.
> Think in more generic terms, does it work if we never unplug nic (if yes,
> then why).

This bug is about hot unplugging a device that was plugged in with a wrong slot address. The way I see this is that the cleanup happens differently for native hotplug vs acpi. 
For native, we directly cleanup the device.
For acpi we depend on the guest triggering the cleanup upon receiving the acpi unplug notification 
which in our case never happens.

If we never unplugged the nic, the qemu cleanup code at guest shutdown would cleanup both the plugged in device and the netdev.

> 
> So far I'm convinced that unplug which didn't happen is a trigger that
> exposes cleanup issue elsewhere.

elaborate?

Comment 26 Igor Mammedov 2023-06-16 09:27:37 UTC

(In reply to Ani Sinha from comment #25)
> (In reply to Igor Mammedov from comment #24)
> > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > action only.
> > Think in more generic terms, does it work if we never unplug nic (if yes,
> > then why).
> 
> This bug is about hot unplugging a device that was plugged in with a wrong
> slot address. The way I see this is that the cleanup happens differently for

Where it's said that it's wrong address?

> native hotplug vs acpi. 
> For native, we directly cleanup the device.

> For acpi we depend on the guest triggering the cleanup upon receiving the
> acpi unplug notification 
> which in our case never happens.

once again, unplug request is just a request, the guest doesn't have to
to unplug device (it's fine for guest to ignore request).

In this case QEMU shall do cleanup on its own, when device is actually
destroyed during /machine tree deconstruction and in the right order.

> If we never unplugged the nic, the qemu cleanup code at guest shutdown would
> cleanup both the plugged in device and the netdev.

we asked it to be unplugged but it did not happen => so we never unplugged the nic

> > 
> > So far I'm convinced that unplug which didn't happen is a trigger that
> > exposes cleanup issue elsewhere.
> 
> elaborate?
see above

Comment 27 Ani Sinha 2023-06-16 09:37:00 UTC

(In reply to Ani Sinha from comment #25)

> > 
> > So far I'm convinced that unplug which didn't happen is a trigger that
> > exposes cleanup issue elsewhere.
> 
> elaborate?

OK I think I understand what you mean. There are two issues here:

(a) device_del is not removing the device. We explained that by observing that slot unplug is not getting triggered by the guest.
(b) the crash on vm shutdown with vdpa/virtio-net. Even if the device was not cleaned up from (a), the shutdown should be able to clean it up without crashes.

Let me know if I am wrong here.

Comment 28 Ani Sinha 2023-06-16 09:38:43 UTC

(In reply to Igor Mammedov from comment #26)
> (In reply to Ani Sinha from comment #25)
> > (In reply to Igor Mammedov from comment #24)
> > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > action only.
> > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > then why).
> > 
> > This bug is about hot unplugging a device that was plugged in with a wrong
> > slot address. The way I see this is that the cleanup happens differently for
> 
> Where it's said that it's wrong address?

See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9  and my experiments validate it.

Comment 29 Igor Mammedov 2023-06-16 10:49:27 UTC

(In reply to Ani Sinha from comment #27)
> (In reply to Ani Sinha from comment #25)
> 
> > > 
> > > So far I'm convinced that unplug which didn't happen is a trigger that
> > > exposes cleanup issue elsewhere.
> > 
> > elaborate?
> 
> OK I think I understand what you mean. There are two issues here:
> 
> (a) device_del is not removing the device. We explained that by observing
> that slot unplug is not getting triggered by the guest.

and it's normal situation, there is nothing wrong with it in theory.
In practice it might be fine to take your patch but see (b)

> (b) the crash on vm shutdown with vdpa/virtio-net. Even if the device was
> not cleaned up from (a), the shutdown should be able to clean it up without
> crashes.

yep, the device can be removed when guest tells QEMU to do, but if guest
does not
 (like in this case or imagine guest in process of being shutdown
  and acpiphp module being deactivated already, then it would race between SCI signal
  and guest shutdown process)
then QEMU should do cleanup on its own like for any other present PCI device.

So far I haven't seen an a clear explanation about where is goes wrong.

> Let me know if I am wrong here.

Comment 30 Igor Mammedov 2023-06-16 11:01:15 UTC

> (In reply to Igor Mammedov from comment #26)
> > (In reply to Ani Sinha from comment #25)
> > > (In reply to Igor Mammedov from comment #24)
> > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > action only.
> > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > then why).
> > > 
> > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > slot address. The way I see this is that the cleanup happens differently for
> > 
> > Where it's said that it's wrong address?
> 
> See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> and my experiments validate it.

I wouldn't treat comments (including mine as absolute source of truth) 

quote from comment 9:
"
  The problem happens because the address provided in device_add is not valid:
"

QEMU migh not support multifunction hotplug, but it doesn't make 'not valid' per se.

"
  {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x2.0x5"}}

  thus the card is not really plugged by the the kernel and then QEMU cannot really remove it.
"
                                                                                    ^^^^^^^^^^
"
  We can remove the instance of the netdev because for QEMU the device is removed, but when we exit QEMU does a cleanup and as the card is not really removed it tries to cleanup it.
"
    above statement contradicts with previous (highlighted) one, so has device been removed or not?

"
  The instance of the netdev doesn't exist anymore so it triggers a crash.(In reply to Ani Sinha from comment #28)
"

above shall not happen until device referencing it still exists.

Comment 31 Ani Sinha 2023-06-16 15:09:52 UTC

(In reply to Igor Mammedov from comment #29)
> (In reply to Ani Sinha from comment #27)
> > (In reply to Ani Sinha from comment #25)
> > 
> > > > 
> > > > So far I'm convinced that unplug which didn't happen is a trigger that
> > > > exposes cleanup issue elsewhere.
> > > 
> > > elaborate?
> > 
> > OK I think I understand what you mean. There are two issues here:
> > 
> > (a) device_del is not removing the device. We explained that by observing
> > that slot unplug is not getting triggered by the guest.
> 
> and it's normal situation, there is nothing wrong with it in theory.
> In practice it might be fine to take your patch but see (b)
> 
> > (b) the crash on vm shutdown with vdpa/virtio-net. Even if the device was
> > not cleaned up from (a), the shutdown should be able to clean it up without
> > crashes.
> 
> yep, the device can be removed when guest tells QEMU to do, but if guest
> does not
>  (like in this case or imagine guest in process of being shutdown
>   and acpiphp module being deactivated already, then it would race between
> SCI signal
>   and guest shutdown process)
> then QEMU should do cleanup on its own like for any other present PCI device.
> 
> So far I haven't seen an a clear explanation about where is goes wrong.
> 

OK let me try again.
The issue happens with acpi hotplug when we do *both* device_del on the nic and then a netdev_del on the net device.
We have already established why device_del actually does not remove the device (point (a) above).

So in netdev_del we have this call-chain:

hmp_netdev_del()
 -> qmp_netdev_del()
      -> qemu_del_net_client()
          -> qemu_cleanup_net_client()
              -> vhost_vdpa_cleanup()

vhost_vdpa_cleanup() sets s->vhost_net = NULL; (see line 193 in net/vhost-vdpa.c).

Therefore when the vm shuts down and qemu tries to cleanup the nic card, we have the following call chain
virtio_set_status() -> vhost_vdpa_get_vhost_net() -> get_vhost_net()
get_vhost_net() calls vhost_vdpa_get_vhost_net() that returns s->vhost_net() which is NULL and hence the assertion fails causing the crash.

I am not sure how to fix this issue. Seems the crash is all related/specific to vhost-net. This is orthogonal to why the nic was not removed on device_del previously.

Comment 32 Ani Sinha 2023-06-17 08:45:36 UTC

Ok one more thing. 

qemu_del_net_client() checks if there are any attached network interfaces. In the event when the network interface is cleanly removed (slot 0 was passed as its addresss), nc->peer is NULL and hence the following loop is skipped:

   /* If there is a peer NIC, delete and cleanup client, but do not free. */
    if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_NIC) {
        NICState *nic = qemu_get_nic(nc->peer);
        if (nic->peer_deleted) {
            return;
        }
        nic->peer_deleted = true;

        for (i = 0; i < queues; i++) {
            ncs[i]->peer->link_down = true;
        }

        if (nc->peer->info->link_status_changed) {
            nc->peer->info->link_status_changed(nc->peer);
        }

        for (i = 0; i < queues; i++) {
            qemu_cleanup_net_client(ncs[i]);
        }

        return;
    }

and it goes straight to freeing up of the clients calling qemu_cleanup_net_client() and and then qemu_free_net_client().
Thus during shutdown, there is no vdpa backend left.

When the interface is not deleted, nc->peer is non-null, the above loop is executed which eventually calls  vhost_vdpa_cleanup() through qemu_cleanup_net_client().
vhost_vdpa_cleanup() sets s->vhost_net = NULL;
The vdpa netdev backend is not freed.

Maybe if we removed the assertion from get_vhost_net(), it should "just work" since virtio_net_vhost_status() checks for a null return value when calling the former.

One another observation is get_vhost_net() does not assert on the return value for tap_get_vhost_net() so its not guranteed to always return non-null or assert.

Comment 33 Ani Sinha 2023-06-18 06:08:54 UTC

> Maybe if we removed the assertion from get_vhost_net(), it should "just work" since virtio_net_vhost_status() checks for a null return value when calling the former.

I was completely mistaken here. Lots of code paths in vhost-net and virtio-net simply assume that the return pointer from get_vhost_net() is non-null. Assertion there is justified.
I am beginning to think that this needs to be resolved at a different level. Perhaps qemu_del_net_client() is calling qemu_cleanup_net_client() too early?

Comment 34 Julia Suvorova 2023-06-18 13:27:41 UTC

(In reply to Igor Mammedov from comment #30)
> > (In reply to Igor Mammedov from comment #26)
> > > (In reply to Ani Sinha from comment #25)
> > > > (In reply to Igor Mammedov from comment #24)
> > > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > > action only.
> > > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > > then why).
> > > > 
> > > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > > slot address. The way I see this is that the cleanup happens differently for
> > > 
> > > Where it's said that it's wrong address?
> > 
> > See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> > and my experiments validate it.
> 
> I wouldn't treat comments (including mine as absolute source of truth) 
> 
> quote from comment 9:
> "
>   The problem happens because the address provided in device_add is not
> valid:
> "
> 
> QEMU migh not support multifunction hotplug, but it doesn't make 'not valid'
> per se.
> 
> "
>  
> {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":
> "hostnet1","mac":"00:11:22:33:44:03","id":
> "net1","bus":"pci.6","addr":"0x2.0x5"}}
> 
>   thus the card is not really plugged by the the kernel and then QEMU cannot
> really remove it.
> "
>                                                                             
> ^^^^^^^^^^

In that particular comment #12 example, what do you think pci address of device net1 and pci structure in general would be in the end, and what it should be?

> "
>   We can remove the instance of the netdev because for QEMU the device is
> removed, but when we exit QEMU does a cleanup and as the card is not really
> removed it tries to cleanup it.
> "
>     above statement contradicts with previous (highlighted) one, so has
> device been removed or not?
> 
> "
>   The instance of the netdev doesn't exist anymore so it triggers a
> crash.(In reply to Ani Sinha from comment #28)
> "
> 
> above shall not happen until device referencing it still exists.

Comment 35 Ani Sinha 2023-06-18 15:12:20 UTC

(In reply to Julia Suvorova from comment #34)
> (In reply to Igor Mammedov from comment #30)
> > > (In reply to Igor Mammedov from comment #26)
> > > > (In reply to Ani Sinha from comment #25)
> > > > > (In reply to Igor Mammedov from comment #24)
> > > > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > > > action only.
> > > > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > > > then why).
> > > > > 
> > > > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > > > slot address. The way I see this is that the cleanup happens differently for
> > > > 
> > > > Where it's said that it's wrong address?
> > > 
> > > See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> > > and my experiments validate it.
> > 
> > I wouldn't treat comments (including mine as absolute source of truth) 
> > 
> > quote from comment 9:
> > "
> >   The problem happens because the address provided in device_add is not
> > valid:
> > "
> > 
> > QEMU migh not support multifunction hotplug, but it doesn't make 'not valid'
> > per se.
> > 
> > "
> >  
> > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":
> > "hostnet1","mac":"00:11:22:33:44:03","id":
> > "net1","bus":"pci.6","addr":"0x2.0x5"}}
> > 
> >   thus the card is not really plugged by the the kernel and then QEMU cannot
> > really remove it.
> > "
> >                                                                             
> > ^^^^^^^^^^
> 
> In that particular comment #12 example, what do you think pci address of
> device net1 and pci structure in general would be in the end, and what it
> should be?

As mentioned in comment #9, 

> The following command works fine:
> {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x0"}}


"0x0" is the right address. At least the device address should be 0. If the function is non-zero, even if this is technically correct for a multifinction bridge, OSPM won't detect a hotplug.

> 
> > "
> >   We can remove the instance of the netdev because for QEMU the device is
> > removed, but when we exit QEMU does a cleanup and as the card is not really
> > removed it tries to cleanup it.
> > "
> >     above statement contradicts with previous (highlighted) one, so has
> > device been removed or not?
> > 
> > "
> >   The instance of the netdev doesn't exist anymore so it triggers a
> > crash.(In reply to Ani Sinha from comment #28)
> > "
> > 
> > above shall not happen until device referencing it still exists.

Comment 36 Ani Sinha 2023-06-19 07:15:14 UTC

I have sent two patches upstream in addition to the one I sent earlier for address check (part (a) of the issue) :

1. trivial - add a missing assertion - https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg03317.html
2. actual fix to part (b) of the issue - https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg03319.html

Hopefully 2) will be an acceptable solution after some discussions.

Comment 37 Julia Suvorova 2023-06-19 10:47:17 UTC

(In reply to Ani Sinha from comment #35)
> (In reply to Julia Suvorova from comment #34)
> > (In reply to Igor Mammedov from comment #30)
> > > > (In reply to Igor Mammedov from comment #26)
> > > > > (In reply to Ani Sinha from comment #25)
> > > > > > (In reply to Igor Mammedov from comment #24)
> > > > > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > > > > action only.
> > > > > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > > > > then why).
> > > > > > 
> > > > > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > > > > slot address. The way I see this is that the cleanup happens differently for
> > > > > 
> > > > > Where it's said that it's wrong address?
> > > > 
> > > > See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> > > > and my experiments validate it.
> > > 
> > > I wouldn't treat comments (including mine as absolute source of truth) 
> > > 
> > > quote from comment 9:
> > > "
> > >   The problem happens because the address provided in device_add is not
> > > valid:
> > > "
> > > 
> > > QEMU migh not support multifunction hotplug, but it doesn't make 'not valid'
> > > per se.
> > > 
> > > "
> > >  
> > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":
> > > "hostnet1","mac":"00:11:22:33:44:03","id":
> > > "net1","bus":"pci.6","addr":"0x2.0x5"}}
> > > 
> > >   thus the card is not really plugged by the the kernel and then QEMU cannot
> > > really remove it.
> > > "
> > >                                                                             
> > > ^^^^^^^^^^
> > 
> > In that particular comment #12 example, what do you think pci address of
> > device net1 and pci structure in general would be in the end, and what it
> > should be?
> 
> As mentioned in comment #9, 
> 
> > The following command works fine:
> > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x0"}}
> 
> 
> "0x0" is the right address. At least the device address should be 0. If the
> function is non-zero, even if this is technically correct for a
> multifinction bridge, OSPM won't detect a hotplug.

The question was to Igor, as I did not understand his opposition to correct port check.
 
> > 
> > > "
> > >   We can remove the instance of the netdev because for QEMU the device is
> > > removed, but when we exit QEMU does a cleanup and as the card is not really
> > > removed it tries to cleanup it.
> > > "
> > >     above statement contradicts with previous (highlighted) one, so has
> > > device been removed or not?
> > > 
> > > "
> > >   The instance of the netdev doesn't exist anymore so it triggers a
> > > crash.(In reply to Ani Sinha from comment #28)
> > > "
> > > 
> > > above shall not happen until device referencing it still exists.

Comment 38 Igor Mammedov 2023-06-19 13:58:24 UTC

(In reply to Julia Suvorova from comment #37)
> (In reply to Ani Sinha from comment #35)
> > (In reply to Julia Suvorova from comment #34)
> > > (In reply to Igor Mammedov from comment #30)
> > > > > (In reply to Igor Mammedov from comment #26)
> > > > > > (In reply to Ani Sinha from comment #25)
> > > > > > > (In reply to Igor Mammedov from comment #24)
> > > > > > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > > > > > action only.
> > > > > > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > > > > > then why).
> > > > > > > 
> > > > > > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > > > > > slot address. The way I see this is that the cleanup happens differently for
> > > > > > 
> > > > > > Where it's said that it's wrong address?
> > > > > 
> > > > > See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> > > > > and my experiments validate it.
> > > > 
> > > > I wouldn't treat comments (including mine as absolute source of truth) 
> > > > 
> > > > quote from comment 9:
> > > > "
> > > >   The problem happens because the address provided in device_add is not
> > > > valid:
> > > > "
> > > > 
> > > > QEMU migh not support multifunction hotplug, but it doesn't make 'not valid'
> > > > per se.
> > > > 
> > > > "
> > > >  
> > > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":
> > > > "hostnet1","mac":"00:11:22:33:44:03","id":
> > > > "net1","bus":"pci.6","addr":"0x2.0x5"}}
> > > > 
> > > >   thus the card is not really plugged by the the kernel and then QEMU cannot
> > > > really remove it.
> > > > "
> > > >                                                                             
> > > > ^^^^^^^^^^
> > > 
> > > In that particular comment #12 example, what do you think pci address of
> > > device net1 and pci structure in general would be in the end, and what it
> > > should be?
> > 
> > As mentioned in comment #9, 
> > 
> > > The following command works fine:
> > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x0"}}
> > 
> > 
> > "0x0" is the right address. At least the device address should be 0. If the
> > function is non-zero, even if this is technically correct for a
> > multifinction bridge, OSPM won't detect a hotplug.
> 
> The question was to Igor, as I did not understand his opposition to correct
> port check.
[...]
func != 0 is also technically correct address (even if guest won't re-enumerate it).
Like Michael's mentioned potentially we can do multi-function hotplug
(plug !0 functions 1st and then plug function 0 last, which will trigger bus
re-enumeration. Though I'll resist any notion that it is supported if
something doesn't work there)

My point was that it is not really hotplug issue (it's just a test case that tigers the problem).
I'd hazard a guess, It might explode with cold-plugged multifunction as well
and likely is not limited to PCIe only.
(i.e. Is skipping device_del and only doing netdev_del on !0 function backend trigers assert?)

Comment 39 Ani Sinha 2023-06-19 14:37:19 UTC

> func != 0 is also technically correct address (even if guest won't
> re-enumerate it).
> Like Michael's mentioned potentially we can do multi-function hotplug
> (plug !0 functions 1st and then plug function 0 last, which will trigger bus
> re-enumeration. Though I'll resist any notion that it is supported if
> something doesn't work there)
> 
> My point was that it is not really hotplug issue (it's just a test case that
> tigers the problem).
> I'd hazard a guess, It might explode with cold-plugged multifunction as well
> and likely is not limited to PCIe only.
> (i.e. Is skipping device_del and only doing netdev_del on !0 function
> backend trigers assert?)

I think we are putting too much emphasis on the function. Its irrelevant. As long as slot is non-zero, my experiments show that the guest won't eject the device.
That being said, Igor is right. If we skip device_del and just do netdev_del, the issue will trigger - I did a quick test. Function and slot are irrelevant here as we are deliberately not ejecting the nic.

Comment 40 Julia Suvorova 2023-06-19 14:59:00 UTC

(In reply to Igor Mammedov from comment #38)
> (In reply to Julia Suvorova from comment #37)
> > (In reply to Ani Sinha from comment #35)
> > > (In reply to Julia Suvorova from comment #34)
> > > > (In reply to Igor Mammedov from comment #30)
> > > > > > (In reply to Igor Mammedov from comment #26)
> > > > > > > (In reply to Ani Sinha from comment #25)
> > > > > > > > (In reply to Igor Mammedov from comment #24)
> > > > > > > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > > > > > > action only.
> > > > > > > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > > > > > > then why).
> > > > > > > > 
> > > > > > > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > > > > > > slot address. The way I see this is that the cleanup happens differently for
> > > > > > > 
> > > > > > > Where it's said that it's wrong address?
> > > > > > 
> > > > > > See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> > > > > > and my experiments validate it.
> > > > > 
> > > > > I wouldn't treat comments (including mine as absolute source of truth) 
> > > > > 
> > > > > quote from comment 9:
> > > > > "
> > > > >   The problem happens because the address provided in device_add is not
> > > > > valid:
> > > > > "
> > > > > 
> > > > > QEMU migh not support multifunction hotplug, but it doesn't make 'not valid'
> > > > > per se.
> > > > > 
> > > > > "
> > > > >  
> > > > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":
> > > > > "hostnet1","mac":"00:11:22:33:44:03","id":
> > > > > "net1","bus":"pci.6","addr":"0x2.0x5"}}
> > > > > 
> > > > >   thus the card is not really plugged by the the kernel and then QEMU cannot
> > > > > really remove it.
> > > > > "
> > > > >                                                                             
> > > > > ^^^^^^^^^^
> > > > 
> > > > In that particular comment #12 example, what do you think pci address of
> > > > device net1 and pci structure in general would be in the end, and what it
> > > > should be?
> > > 
> > > As mentioned in comment #9, 
> > > 
> > > > The following command works fine:
> > > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x0"}}
> > > 
> > > 
> > > "0x0" is the right address. At least the device address should be 0. If the
> > > function is non-zero, even if this is technically correct for a
> > > multifinction bridge, OSPM won't detect a hotplug.
> > 
> > The question was to Igor, as I did not understand his opposition to correct
> > port check.
> [...]
> func != 0 is also technically correct address (even if guest won't
> re-enumerate it).
> Like Michael's mentioned potentially we can do multi-function hotplug
> (plug !0 functions 1st and then plug function 0 last, which will trigger bus
> re-enumeration. Though I'll resist any notion that it is supported if
> something doesn't work there

I don't say anything about multifunction devices, that side is absolutely valid. The device number is wrong there, it's #2 instead of #0 on the root port, which is weird because it shouldn't exist.  
 
> My point was that it is not really hotplug issue (it's just a test case that
> tigers the problem).
> I'd hazard a guess, It might explode with cold-plugged multifunction as well
> and likely is not limited to PCIe only.
> (i.e. Is skipping device_del and only doing netdev_del on !0 function
> backend trigers assert?)

Comment 41 Ani Sinha 2023-06-19 15:21:14 UTC

(In reply to Julia Suvorova from comment #40)
> (In reply to Igor Mammedov from comment #38)
> > (In reply to Julia Suvorova from comment #37)
> > > (In reply to Ani Sinha from comment #35)
> > > > (In reply to Julia Suvorova from comment #34)
> > > > > (In reply to Igor Mammedov from comment #30)
> > > > > > > (In reply to Igor Mammedov from comment #26)
> > > > > > > > (In reply to Ani Sinha from comment #25)
> > > > > > > > > (In reply to Igor Mammedov from comment #24)
> > > > > > > > > > acpi_pcihp_eject_slot() triggers device removal on behalf of guest eject
> > > > > > > > > > action only.
> > > > > > > > > > Think in more generic terms, does it work if we never unplug nic (if yes,
> > > > > > > > > > then why).
> > > > > > > > > 
> > > > > > > > > This bug is about hot unplugging a device that was plugged in with a wrong
> > > > > > > > > slot address. The way I see this is that the cleanup happens differently for
> > > > > > > > 
> > > > > > > > Where it's said that it's wrong address?
> > > > > > > 
> > > > > > > See comment #9 above https://bugzilla.redhat.com/show_bug.cgi?id=2128929#c9 
> > > > > > > and my experiments validate it.
> > > > > > 
> > > > > > I wouldn't treat comments (including mine as absolute source of truth) 
> > > > > > 
> > > > > > quote from comment 9:
> > > > > > "
> > > > > >   The problem happens because the address provided in device_add is not
> > > > > > valid:
> > > > > > "
> > > > > > 
> > > > > > QEMU migh not support multifunction hotplug, but it doesn't make 'not valid'
> > > > > > per se.
> > > > > > 
> > > > > > "
> > > > > >  
> > > > > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":
> > > > > > "hostnet1","mac":"00:11:22:33:44:03","id":
> > > > > > "net1","bus":"pci.6","addr":"0x2.0x5"}}
> > > > > > 
> > > > > >   thus the card is not really plugged by the the kernel and then QEMU cannot
> > > > > > really remove it.
> > > > > > "
> > > > > >                                                                             
> > > > > > ^^^^^^^^^^
> > > > > 
> > > > > In that particular comment #12 example, what do you think pci address of
> > > > > device net1 and pci structure in general would be in the end, and what it
> > > > > should be?
> > > > 
> > > > As mentioned in comment #9, 
> > > > 
> > > > > The following command works fine:
> > > > > {"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x0"}}
> > > > 
> > > > 
> > > > "0x0" is the right address. At least the device address should be 0. If the
> > > > function is non-zero, even if this is technically correct for a
> > > > multifinction bridge, OSPM won't detect a hotplug.
> > > 
> > > The question was to Igor, as I did not understand his opposition to correct
> > > port check.
> > [...]
> > func != 0 is also technically correct address (even if guest won't
> > re-enumerate it).
> > Like Michael's mentioned potentially we can do multi-function hotplug
> > (plug !0 functions 1st and then plug function 0 last, which will trigger bus
> > re-enumeration. Though I'll resist any notion that it is supported if
> > something doesn't work there
> 
> I don't say anything about multifunction devices, that side is absolutely
> valid. The device number is wrong there, it's #2 instead of #0 on the root
> port, which is weird because it shouldn't exist.  
>  

Yes! 100% agreed with Julia here.

Comment 42 Ani Sinha 2023-06-22 10:52:12 UTC

(In reply to Ani Sinha from comment #36)
> I have sent two patches upstream in addition to the one I sent earlier for
> address check (part (a) of the issue) :
> 
> 1. trivial - add a missing assertion -
> https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg03317.html
> 2. actual fix to part (b) of the issue -
> https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg03319.html
> 
> Hopefully 2) will be an acceptable solution after some discussions.

Part (a) fix + fixes for tests that were breaking because of the change:

https://lists.gnu.org/archive/html/qemu-devel/2023-06/msg04502.html

Comment 43 Ani Sinha 2023-06-27 04:46:53 UTC

MR: https://gitlab.com/redhat/centos-stream/src/qemu-kvm/-/merge_requests/174

Comment 45 Yanan Fu 2023-07-11 09:11:24 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 46 Ani Sinha 2023-07-11 13:49:21 UTC

Raised this https://bugzilla.redhat.com/show_bug.cgi?id=2222009 to track the invalid device addr use issue.
An upstream patch has been pushed that spits out a warning 

https://gitlab.com/qemu-project/qemu/-/commit/ca92eb5defcf9d1c2106341744a73a03cf26e824

Comment 49 Lei Yang 2023-07-13 04:30:39 UTC

==>Reproduced this problem on the qemu-kvm-8.0.0-6.el9

=>Test Version:
kernel-5.14.0-336.el9.x86_64
qemu-kvm-8.0.0-6.el9.x86_64
edk2-ovmf-20230524-1.el9.noarch

=>Test Steps
1. Create a vdpa device
vdpa dev add name vdpa${i} mgmtdev pci/$pci_addr mac 00:11:22:33:44:${macaddr}
2. Boot a guest without nic device
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1'  \
-sandbox on  \
-blockdev '{"node-name": "file_ovmf_code", "driver": "file", "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", "auto-read-only": true, "discard": "unmap"}' \
-blockdev '{"node-name": "drive_ovmf_code", "driver": "raw", "read-only": true, "file": "file_ovmf_code"}' \
-blockdev '{"node-name": "file_ovmf_vars", "driver": "file", "filename": "/root/avocado/data/avocado-vt/avocado-vt-vm1_rhel930-64-virtio-scsi-ovmf_qcow2_filesystem_VARS.raw", "auto-read-only": true, "discard": "unmap"}' \
-blockdev '{"node-name": "drive_ovmf_vars", "driver": "raw", "read-only": false, "file": "file_ovmf_vars"}' \
-machine q35,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars,memory-backend=mem-machine_mem \
-device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \
-device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \
-device '{"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}' \
-nodefaults \
-m 8096 \
-object '{"size": 8489271296, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}'  \
-smp 28,maxcpus=28,cores=14,threads=1,dies=1,sockets=2  \
-cpu 'Icelake-Server-noTSX',+kvm_pv_unhalt \
-device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pci.2", "addr": "0x0"}' \
-blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/rhel930-64-virtio-scsi-ovmf.qcow2", "cache": {"direct": true, "no-flush": false}}' \
-blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
-device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' \
-device '{"driver":"virtio-vga","id":"video0","bus":"pcie.0","addr":"0x1"}' \
-vnc :0  \
-rtc base=utc,clock=host,driftfix=slew  \
-boot menu=off,order=cdn,once=c,strict=off \
-enable-kvm \
-device '{"id": "pcie_extra_root_port_0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x3", "chassis": 5}' \
-monitor stdio \
-qmp tcp:0:5555,server,nowait \

3. Hotpulg&unplug vdpa nic
{"execute": "qmp_capabilities"}
{"return": {}}
{"execute":"netdev_add","arguments":{"type":"vhost-vdpa","id":"hostnet1","vhostdev":"/dev/vhost-vdpa-0"}}
{"return": {}}
{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet1","mac":"00:11:22:33:44:03","id": "net1","bus":"pci.6","addr":"0x2.0x5"}}
{"return": {}}
{"execute":"device_del","arguments":{"id": "net1"}}
{"return": {}}
{"execute":"netdev_del","arguments":{"id":"hostnet1"}}
{"return": {}}

4. Shutdown guest
# shutdown -h now

5. guest hit qemu core dump
(qemu) qemu-kvm: ../hw/net/vhost_net.c:520: VHostNetState *get_vhost_net(NetClientState *): Assertion `vhost_net' failed.

==> So reproduced this problem on the qemu-kvm-8.0.0-6.el9

==>Verified this problem on the qemu-kvm-8.0.0-7.el9

=>Test Version
kernel-5.14.0-336.el9.x86_64
qemu-kvm-8.0.0-7.el9.x86_64
edk2-ovmf-20230524-1.el9.noarch

=>Test Steps
Repeated the above test steps, there is no issues any more, so this problem has been fixed very well on the qemu-kvm-8.0.0-7.el9, move it to "VERIFIED".

Comment 50 Laurent Vivier 2023-07-26 09:42:31 UTC

Asking Z-Stream for RHEL-9.2.0 primarily because the same commit will fix RHEL-200  (https://issues.redhat.com/browse/RHEL-200) for rhel-9.2.0 and Z-stream is not available with JIRA and RHEL-9.2.0.

The fix that needs to be backported is:
commit a0d7215e339b61c7d7a7b3fcf754954d80d93eb8
Author: Ani Sinha <anisinha>
Date:   Mon Jun 19 12:22:09 2023 +0530

    vhost-vdpa: do not cleanup the vdpa/vhost-net structures if peer nic is present
    
    When a peer nic is still attached to the vdpa backend, it is too early to free
    up the vhost-net and vdpa structures. If these structures are freed here, then
    QEMU crashes when the guest is being shut down. The following call chain
    would result in an assertion failure since the pointer returned from
    vhost_vdpa_get_vhost_net() would be NULL:
    
    do_vm_stop() -> vm_state_notify() -> virtio_set_status() ->
    virtio_net_vhost_status() -> get_vhost_net().
    
    Therefore, we defer freeing up the structures until at guest shutdown
    time when qemu_cleanup() calls net_cleanup() which then calls
    qemu_del_net_client() which would eventually call vhost_vdpa_cleanup()
    again to free up the structures. This time, the loop in net_cleanup()
    ensures that vhost_vdpa_cleanup() will be called one last time when
    all the peer nics are detached and freed.
    
    All unit tests pass with this change.
    
    CC: imammedo
    CC: jusual
    CC: mst
    Fixes: CVE-2023-3301
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2128929
    Signed-off-by: Ani Sinha <anisinha>
    Message-Id: <20230619065209.442185-1-anisinha>
    Reviewed-by: Michael S. Tsirkin <mst>
    Signed-off-by: Michael S. Tsirkin <mst>

This patch should fix a QEMU core dump in both cases (BZ 2128929 and RHEL-200)

Comment 54 errata-xmlrpc 2023-11-07 08:26:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6368

Note You need to log in before you can comment on or make changes to this bug.