Bug 2177620

Summary: [mlx vhost_vdpa][rhel 9.2]qemu core dump when hot unplug then hotplug a vdpa interface with multi-queue setting
Product: Red Hat Enterprise Linux 9 Reporter: Lei Yang <leiyang>
Component: qemu-kvmAssignee: Laurent Vivier <lvivier>
qemu-kvm sub component: Networking QA Contact: Lei Yang <leiyang>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aadam, chayang, eperezma, jinzhao, juzhang, lulu, lvivier, virt-maint, wquan, yalzhang, yama, ymankad
Version: 9.2Keywords: Regression, Triaged, ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-8.0.0-3.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2213864 (view as bug list) Environment:
Last Closed: 2023-11-07 08:27:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2180898    
Bug Blocks: 2213864    

Comment 9 Laurent Vivier 2023-03-15 09:02:01 UTC
@eperezma 

I don't think we can push this fix ("vdpa: stop all svq on device deletion") while the race condition between virtio-net and vhost is not fixed:
do you agree?

As reported by Lei, it doesn't fix RHEL-200 (https://issues.redhat.com/browse/RHEL-200) but moreover it introduces a regression

Comment 10 Lei Yang 2023-05-26 07:33:22 UTC
==> Reproduced this problem on the latest rhel 9.2 qemu-kvm version: qemu-kvm-7.2.0-14.el9_2.x86_64

=>Test Version
qemu-kvm-7.2.0-14.el9_2.x86_64
kernel-5.14.0-313.el9.x86_64
iproute-6.2.0-1.el9.x86_64

# flint -d 0000:17:00.0 q
Image type:            FS4
FW Version:            22.37.0154
FW Release Date:       17.3.2023
Product Version:       22.37.0154
Description:           UID                GuidsNumber
Base GUID:             b8cef603000a11f0        4
Base MAC:              b8cef60a11f0            4
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000359
Security Attributes:   N/A

=>Test steps
1. Create a multi queues vdpa device
# vdpa dev add name vdpa0 mgmtdev pci/$pci_addr mac 00:11:22:33:44:00 max_vqp 8

2. Boot a guest with this vdpa device
-device '{"driver": "virtio-net-pci", "mac": "00:11:22:33:44:00", "id": "net0", "netdev": "hostnet0", "mq": true, "vectors": 18, "bus": "pcie-root-port-3", "addr": "0x0"}'  \
-netdev vhost-vdpa,id=hostnet0,vhostdev=/dev/vhost-vdpa-0,queues=8  \

3. Hot unplug devive
{"execute": "device_del", "arguments": {"id": "net0"}}
{"return": {}}
{"timestamp": {"seconds": 1685085387, "microseconds": 922551}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/net0/virtio-backend"}}
{"timestamp": {"seconds": 1685085387, "microseconds": 973046}, "event": "DEVICE_DELETED", "data": {"device": "net0", "path": "/machine/peripheral/net0"}}
{"execute": "netdev_del", "arguments": {"id": "hostnet0"}}
{"return": {}}

4. Hotplug this device again
{"execute":"netdev_add","arguments":{"type":"vhost-vdpa","id":"hostnet0","vhostdev":"/dev/vhost-vdpa-0","queues": 8}}
{"return": {}}
{"execute":"device_add","arguments":{"driver":"virtio-net-pci","netdev":"hostnet0","mac":"00:11:22:33:44:00","id": "net0","bus":"pcie-root-port-3","addr":"0x0","mq":true,"vectors": 18}}
{"return": {}}

5. After a few moments, guest hit qemu core dump.

==>So reproduced this problem on qemu-kvm-7.2.0-14.el9_2.x86_64

==>Verified it on the qemu-kvm-8.0.0-3.el9.x86_64
=>Repeated the above test steps, guest works well, so this bug has been fixed very well on qemu-kvm-8.0.0-3.el9.x86_64.

Comment 11 Lei Yang 2023-05-26 07:38:28 UTC
Hello Laurent

Based on the above test result, QE would like to confirm two questions, could you please help review them, thanks in advance:

1. This bug has been fixed on the qemu-kvm-8.0.0-3.el9.x86_64. Can QE closed the current bug as "CURRENTRELEASE"?
2. It also can reproduced on the latest rhel 9.2 qemu-kvm version,is it need to backport?

Thanks
Lei

Comment 12 Laurent Vivier 2023-05-26 08:31:20 UTC
(In reply to Lei Yang from comment #11)
> Hello Laurent

Hi Lei,

> Based on the above test result, QE would like to confirm two questions,
> could you please help review them, thanks in advance:
> 
> 1. This bug has been fixed on the qemu-kvm-8.0.0-3.el9.x86_64. Can QE closed
> the current bug as "CURRENTRELEASE"?

Yes

> 2. It also can reproduced on the latest rhel 9.2 qemu-kvm version,is it need
> to backport?

it's a question for @eperezma 
And do you know which commits fix the problem?

Thanks

Comment 13 Lei Yang 2023-05-26 09:25:33 UTC
Hi Laurent

According to https://issues.redhat.com/browse/RHEL-274 test result,it should be fixed by this patch:

commit 2e1a9de96b487cf818a22d681cad8d3f5d18dcca
Author: Eugenio Pérez <eperezma>
Date:   Thu Feb 9 18:00:04 2023 +0100

    vdpa: stop all svq on device deletion
    
    Not stopping them leave the device in a bad state when virtio-net
    fronted device is unplugged with device_del monitor command.
    
    This is not triggable in regular poweroff or qemu forces shutdown
    because cleanup is called right after vhost_vdpa_dev_start(false).  But
    devices hot unplug does not call vdpa device cleanups.  This lead to all
    the vhost_vdpa devices without stop the SVQ but the last.
    
    Fix it and clean the code, making it symmetric with
    vhost_vdpa_svqs_start.
    
    Fixes: dff4426fa656 ("vhost: Add Shadow VirtQueue kick forwarding capabilities")
    Reported-by: Lei Yang <leiyang>
    Signed-off-by: Eugenio Pérez <eperezma>
    Message-Id: <20230209170004.899472-1-eperezma>
    Tested-by: Laurent Vivier <lvivier>
    Acked-by: Jason Wang <jasowang>

Thanks
Lei

Comment 14 Laurent Vivier 2023-05-26 11:40:22 UTC
(In reply to Lei Yang from comment #13)
> Hi Laurent
> 
> According to https://issues.redhat.com/browse/RHEL-274 test result,it should
> be fixed by this patch:
> 
> commit 2e1a9de96b487cf818a22d681cad8d3f5d18dcca
> Author: Eugenio Pérez <eperezma>
> Date:   Thu Feb 9 18:00:04 2023 +0100
> 
>     vdpa: stop all svq on device deletion
>     
>     Not stopping them leave the device in a bad state when virtio-net
>     fronted device is unplugged with device_del monitor command.
>     
>     This is not triggable in regular poweroff or qemu forces shutdown
>     because cleanup is called right after vhost_vdpa_dev_start(false).  But
>     devices hot unplug does not call vdpa device cleanups.  This lead to all
>     the vhost_vdpa devices without stop the SVQ but the last.
>     
>     Fix it and clean the code, making it symmetric with
>     vhost_vdpa_svqs_start.
>     
>     Fixes: dff4426fa656 ("vhost: Add Shadow VirtQueue kick forwarding
> capabilities")
>     Reported-by: Lei Yang <leiyang>
>     Signed-off-by: Eugenio Pérez <eperezma>
>     Message-Id: <20230209170004.899472-1-eperezma>
>     Tested-by: Laurent Vivier <lvivier>
>     Acked-by: Jason Wang <jasowang>
> 
> Thanks
> Lei

But according to comment #9 this fix introduces a regression, I think it is not enough.

Comment 15 Lei Yang 2023-05-29 02:02:54 UTC
Hi Laurent

According to QE's test result, comment 9 mentioned problem also had been fixed,just QE can not make sure which commit to fixed that problem. For more details please refer to: https://issues.redhat.com/browse/RHEL-200 latest comment.

Thanks
Lei

Comment 19 Laurent Vivier 2023-06-09 08:42:35 UTC
According comment #11, it's been fixed in QEMU 8.0.0 and comes with the rebase in RHEL 9.3.0.

Moving to MODIFIED, and asking for Z-stream

Comment 24 Lei Yang 2023-06-14 00:35:20 UTC
Based on the Comment 10 test result, move to "VERIFIED".

Comment 26 errata-xmlrpc 2023-11-07 08:27:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6368