Bug 2059809

Summary: [virtual network][rhel9][vdpa]Booting the guest with the vdpa device, hot unplugging it and then hot plugging it again, rebooting the guest causes qemu core dump
Product: Red Hat Enterprise Linux 9 Reporter: Lei Yang <leiyang>
Component: qemu-kvmAssignee: lulu <lulu>
qemu-kvm sub component: Networking QA Contact: Lei Yang <leiyang>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: aadam, chayang, coli, jasowang, jinzhao, juzhang, lulu, pezhang, virt-maint, wquan
Version: 9.0Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-09 01:41:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lei Yang 2022-03-02 04:54:20 UTC
Description of problem:
Booting the guest with the vdpa device, hot unplugging it and then hot plugging it again, rebooting the guest causes qemu core dump

Version-Release number of selected component (if applicable):
qemu-kvm-6.2.0-10.el9.x86_64
kernel-5.14.0-68.mr552_220223_1400.el9.x86_64
git clone https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git
vdpa-Add-support-to-configure-max-number-of-VQs.patch
vdpa-Remove-unsupported-command-line-option.patch
virtio-Define-bit-numbers-for-device-independent-fea.patch

# flint -d 0000:3b:00.0 q
Image type:            FS4
FW Version:            22.32.2004
FW Release Date:       13.1.2022
Product Version:       22.32.2004
Rom Info:              type=UEFI version=14.25.18 cpu=AMD64,AARCH64
                       type=PXE version=3.6.502 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             b8cef603000a110c        4
Base MAC:              b8cef60a110c            4
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000359
Security Attributes:   N/A

How reproducible:
Only once so far

Unfortunately, I only encountered it once,and no qemu core dump information was captured. In the process of continuous reproduction, I encountered another: Bug 2059799. Although the steps of the two bug tests are exactly the same, they do cause different results.

When qemu core dump occurs, qemu output: 
qemu-kvm: ../hw/virtio/vhost-vdpa.c:560: int vhost_vdpa_get_vq_index(struct vhost_dev *, int): Assertion `idx >= dev->vq_index && idx < dev->vq_index + dev->nvqs' failed.

Please feel free to connect me if you have any test need to be done.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 lulu@redhat.com 2022-03-07 06:03:27 UTC
Hi lei, 
would you help try to reproduce this issue in kernel-5.14.0-68.el9.x86_64? 
I wonder if this is an issue in mlx driver, 
here is the bug in rhel8.6 
https://bugzilla.redhat.com/show_bug.cgi?id=2048060
this bug has the same assert, this issue was fixed by the MR to sync the mlx driver
https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/1974

thanks
Cindy

Comment 3 Lei Yang 2022-03-07 07:58:47 UTC
(In reply to lulu from comment #2)
> Hi lei, 
> would you help try to reproduce this issue in kernel-5.14.0-68.el9.x86_64? 
> I wonder if this is an issue in mlx driver, 
> here is the bug in rhel8.6 
> https://bugzilla.redhat.com/show_bug.cgi?id=2048060
> this bug has the same assert, this issue was fixed by the MR to sync the mlx
> driver
> https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/1974
> 
> thanks
> Cindy

Hello Cindy

I tried to test on kernel-5.14.0-68.el9.x86_64, doesn't reprodue this bug. Therefore the current bug should be the same issue as bug 2048060.

Test Version:
kernel-5.14.0-68.el9.x86_64
qemu-kvm-6.2.0-10.el9.x86_64

Best Regards
Lei

Comment 4 Lei Yang 2022-03-09 01:41:50 UTC
This bug is a Mellanox firmware issue, Nvidia's team is syncing the latest driver. So set the bug to "CURRENTRELEASE". Please corrent me if I'm wrong.