Bug 1811863

Summary:	Boot guest with device assignment, choose PC machine and "pci=nomsi" in kernel line, guest kernel will show "Call Trace" when booting testpmd(rhel8.2)
Product:	Red Hat Enterprise Linux 8	Reporter:	Pei Zhang <pezhang>
Component:	qemu-kvm	Assignee:	Laurent Vivier <lvivier>
qemu-kvm sub component:	Networking	QA Contact:	Pei Zhang <pezhang>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	low
Priority:	low	CC:	aadam, alex.williamson, chayang, jinzhao, juzhang, peterx, virt-maint, yama
Version:	8.2	Keywords:	Triaged
Target Milestone:	rc
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1809978
Clones:	1811885 (view as bug list)		Environment:
Last Closed:	2021-01-20 10:08:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1809978
Bug Blocks:	1811885

Description Pei Zhang 2020-03-10 00:50:50 UTC

+++ This bug was initially created as a clone of Bug #1809978 +++

Description of problem:
Boot guest with device assignment and PC machine type, and add "pci=nomsi" to guest kernel line. Next start dpdk's testpmd in guest, there will be "Call Trace" info in guest kernel line.

Version-Release number of selected component (if applicable):
4.18.0-185.el8.x86_64
qemu-kvm-2.12.0-99.module+el8.2.0+5827+8c39933c.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Boot qemu with device assignment using PC machine type

/usr/libexec/qemu-kvm -name rhel8.2 \
-M pc \
-cpu host -m 8G \
-smp 4 \
-drive file=/home/images_nfv-virt-rt-kvm/rhel8.2.qcow2,format=qcow2,if=none,id=drive-virtio-blk0,werror=stop,rerror=stop \
-device virtio-blk-pci,drive=drive-virtio-blk0,id=virtio-blk0,bootindex=1 \
-vnc :2 \
-monitor stdio \
-nodefaults \
-device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16 \
-serial unix:/tmp/monitor1,server,nowait \
-netdev tap,id=hostnet0 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=18:66:da:5f:dd:01 \
-device vfio-pci,host=0000:5e:00.0,id=hostdev0 \
-device vfio-pci,host=0000:5e:00.1,id=hostdev1 \
-boot menu=on \


2. Add pci=nomsi to guest kernel line

# cat /proc/cmdline 
# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-185.el8.x86_64 root=/dev/mapper/rhel_vm--74--225-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--225-swap rd.lvm.lv=rhel_vm-74-225/root rd.lvm.lv=rhel_vm-74-225/swap skew_tick=1 nohz=on nohz_full=1,2,3,4,5 rcu_nocbs=1,2,3,4,5 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup iommu=pt intel_iommu=on skew_tick=1 nohz=on nohz_full=1,2,3,4,5 rcu_nocbs=1,2,3,4,5 tuned.non_isolcpus=00000001 ... pci=nomsi


3. In guest, load VFIO, hugepage

# modprobe vfio enable_unsafe_noiommu_mode=Y
# modprobe vfio-pci

# echo 1 >  /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

# dpdk-devbind --bind=vfio-pci 0000:00:05.0
# dpdk-devbind --bind=vfio-pci 0000:00:06.0

4. In guest, start testpmd. Guest kernel get Call Trace info.

/usr/bin/testpmd \
        -l 1,2,3 \
        -n 4 \
        -d /usr/lib64/librte_pmd_ixgbe.so \
        -w 0000:00:05.0 -w 0000:00:06.0 \
        -- \
        --nb-cores=2 \
        -i \
        --disable-rss \
        --rxd=512 --txd=512 \
        --rxq=1 --txq=1 \

# dmesg
[   70.658831] ixgbe 0000:00:06.0: complete
[   70.691665] vfio-pci 0000:00:06.0: Adding to iommu group 1
[   70.692553] vfio-pci 0000:00:06.0: Adding kernel taint for vfio-noiommu group on device
[  118.255300] vfio-pci 0000:00:05.0: vfio-noiommu device opened by user (testpmd:1735)
[  118.942906] vfio-pci 0000:00:06.0: vfio-noiommu device opened by user (testpmd:1735)
[  127.889733] irq 10: nobody cared (try booting with the "irqpoll" option)
[  127.890831] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G     U           --------- -  - 4.18.0-185.el8.x86_64 #1
[  127.892519] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
[  127.893830] Call Trace:
[  127.894291]  <IRQ>
[  127.894684]  dump_stack+0x5c/0x80
[  127.895254]  __report_bad_irq+0x37/0xae
[  127.895880]  note_interrupt.cold.9+0xa/0x69
[  127.896554]  handle_irq_event_percpu+0x6a/0x80
[  127.897265]  handle_irq_event+0x36/0x53
[  127.897882]  handle_fasteoi_irq+0x8b/0x130
[  127.898553]  handle_irq+0xbf/0x100
[  127.899119]  do_IRQ+0x49/0xe0
[  127.899616]  common_interrupt+0xf/0xf
[  127.900215] RIP: 0010:__do_softirq+0x76/0x30a
[  127.900911] Code: 81 05 ba 5b c1 69 00 01 00 00 c7 44 24 20 0a 00 00 00 44 89 34 24 48 c7 c0 00 94 02 00 65 66 c7 00 00 00 fb 66 0f 1f 44 00 00 <48> c7 44 24 08 00 51 a0 96 b8 ff ff ff ff 0f bc 04 24 83 c0 01 89
[  127.903763] RSP: 0018:ffff922a77a03f58 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdc
[  127.904934] RAX: 0000000000029400 RBX: 0000000000000024 RCX: 00000000ffffffff
[  127.906041] RDX: 000000000000006b RSI: 0000000000000006 RDI: ffffffff97801000
[  127.907146] RBP: ffffffff96a03dd8 R08: 00000021400930d1 R09: 0000000000000000
[  127.908252] R10: 0000000000000000 R11: 0000000000000000 R12: ffff922947d13400
[  127.909356] R13: 0000000000000024 R14: 0000000000000008 R15: 0000000000000000
[  127.910464]  ? common_interrupt+0xa/0xf
[  127.911089]  ? __do_softirq+0x4b/0x30a
[  127.911701]  irq_exit+0x100/0x110
[  127.912253]  do_IRQ+0x7f/0xe0
[  127.914039]  common_interrupt+0xf/0xf
[  127.915297]  </IRQ>
[  127.916314] RIP: 0010:native_safe_halt+0xe/0x10
[  127.917689] Code: ff ff 7f c3 65 48 8b 04 25 80 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90 e9 07 00 00 00 0f 00 2d 76 49 57 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 66 49 57 00 f4 c3 90 90 0f 1f 44 00
[  127.921872] RSP: 0018:ffffffff96a03e88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb
[  127.923714] RAX: ffffffff96094410 RBX: 0000000000000000 RCX: 0000000000000001
[  127.925483] RDX: 0000000000000001 RSI: 0000000000000083 RDI: 0000000000000000
[  127.927243] RBP: 0000000000000000 R08: 000000213d1a1e0b R09: 0000000000000001
[  127.928993] R10: 0000000094f4efa8 R11: 0000000000000007 R12: 0000000000000000
[  127.930726] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  127.932463]  ? __sched_text_end+0x6/0x6
[  127.933719]  default_idle+0x1c/0x130
[  127.934932]  do_idle+0x1f1/0x280
[  127.936103]  cpu_startup_entry+0x6f/0x80
[  127.937366]  start_kernel+0x53b/0x55b
[  127.938596]  secondary_startup_64+0xb7/0xc0
[  127.939882] handlers:
[  127.940891] [<00000000713efd45>] vp_interrupt
[  127.942191] Disabling IRQ #10


Actual results:
Guest get kernel Call trace info.

Expected results:
Guest should not get kernel Call trace info.

Additional info:
1. With Q35 machine type, this issue is gone.

2. Without "pci=nomsi" in guest kernel line, this issue is gone.

3. This bug was found by handling https://bugzilla.redhat.com/show_bug.cgi?id=1786404#c26.

--- Additional comment from RHEL Program Management on 2020-03-04 18:34:29 HKT ---

Since this bug report was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from John Ferlan on 2020-03-05 20:18:04 HKT ---

NB:  Moved the bug to rhel-7.9.0 since it's too late for 7.8.0. If a z-stream is needed, then that can be added.  Note well that qemu-kvm-rhev will not be shipped with RHV for 7.9.0, so if this must be fixed for RHV 4.3, a 7.8.z stream will be necessary. 

Ariel - searching on previous RHEL7 owners with qemu-kvm, testpmd, and dpdk turns up Jason and Maxime - so I'll start with you on this. I do note this was discovered while working on a Peter Xu bug.

Comment 1 Laurent Vivier 2020-03-10 16:48:36 UTC

> -device vfio-pci,host=0000:5e:00.0,id=hostdev0 \
> -device vfio-pci,host=0000:5e:00.1,id=hostdev1 \

Which PCI card are plugged in these ports on the host?

> [  127.889733] irq 10: nobody cared (try booting with the "irqpoll" option) 

This means kernel has received an interrupt it doesn't manage.

There are some related stuffs in v5.1 kernel, but they already merged in kernel-4.18.0-147.9:

e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification")
3e82a7f9031f ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked")
15d2aba7c602 ("PCI/portdrv: Use shared MSI/MSI-X vector for Bandwidth Management")
2078e1e7f7e0 ("[pci] PCI/LINK: Add Kconfig option (default off)")

Comment 2 Laurent Vivier 2020-03-10 16:53:54 UTC

Alex,

as it's seems related to VFIO,

do you have any idea why we have "irq 10: nobody cared" with "pci=nomsi" parameter?

Comment 3 Alex Williamson 2020-03-10 19:22:04 UTC

I think this is probably a combination of things, including vfio-pci advertising MSI/X interrupts that it cannot configure when used with pci=nomsi and perhaps the dpdk driver ignoring the error from MSI/X setup and allowing the device to continue to generate interrupts.  Perhaps there's something we can harden in either vfio-pci or the dpdk drivers to prevent this, but this is pretty low priority issue given that the VM (or host - this is reproducible with DPDK on bare metal) need to be configured with pci=nomsi, which is not a configuration we'd advise for anything other than conducting specific tests.

Comment 4 Pei Zhang 2020-03-11 06:23:46 UTC

(In reply to Laurent Vivier from comment #1)
> > -device vfio-pci,host=0000:5e:00.0,id=hostdev0 \
> > -device vfio-pci,host=0000:5e:00.1,id=hostdev1 \
> 
> Which PCI card are plugged in these ports on the host?
> 

Hi Laurent,

It's X540-AT2(10G, ixgbe) cards.

# lspci | grep Eth

5e:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

Comment 5 Laurent Vivier 2020-04-08 14:27:20 UTC

Tested on bare metal with:

kernel-4.18.0-193.el8.x86_64
dpdk-19.11-4.el8.x86_64

# dpdk-devbind --status

Network devices using DPDK-compatible driver
============================================
0000:05:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=vfio-pci unused=ixgbe
0000:05:00.1 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=vfio-pci unused=ixgbe

When I run testpmd directly on a host started with pci=nomsi testpmd correctly reports MSI-X errors:

# /usr/bin/testpmd         -l 1,2,3         -n 4         -d /usr/lib64/librte_pmd_ixgbe.so.20.0 -w 0000:05:00.0 -w 0000:05:00.1 --         --nb-cores=2         -i         --disable-rss         --rxd=512 --txd=512         --rxq=1 --txq=1
EAL: Detected 16 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No free hugepages reported in hugepages-2048kB
EAL: No free hugepages reported in hugepages-2048kB
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:05:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1528 net_ixgbe
EAL:   using IOMMU type 1 (Type 1)
EAL: Error enabling MSI-X interrupts for fd 39
EAL: PCI device 0000:05:00.1 on NUMA socket 0
EAL:   probe driver: 8086:1528 net_ixgbe
EAL: Error enabling MSI-X interrupts for fd 44
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=163456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=163456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 0)
EAL: Error disabling MSI-X interrupts for fd 39
EAL: Error enabling MSI-X interrupts for fd 39
Port 0: A0:36:9F:65:88:74
Configuring Port 1 (socket 0)
EAL: Error disabling MSI-X interrupts for fd 44
EAL: Error enabling MSI-X interrupts for fd 44
Port 1: A0:36:9F:65:88:76
Checking link statuses...
Done
testpmd> quit

Stopping port 0...
Stopping ports...
Done

Stopping port 1...
Stopping ports...
Done

Shutting down port 0...
Closing ports...
EAL: Error disabling MSI-X interrupts for fd 39
Done

Shutting down port 1...
Closing ports...
EAL: Error disabling MSI-X interrupts for fd 44
Done

Bye...

Comment 6 Laurent Vivier 2020-04-08 17:13:54 UTC

I've started a guest with the command line of comment #0 with pci=nomsi for the _guest_kernel and I have exactly the same error:

# /usr/bin/testpmd         -l 1,2,3         -n 4         -d /usr/lib64/librte_pmd_ixgbe.so.20.0         -w 0000:00:05.0 -w 0000:00:06.0         --         --nb-cores=2         -i         --disable-rss         --rxd=512 --txd=512         --rxq=1 --txq=1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No free hugepages reported in hugepages-2048kB
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:00:05.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:1528 net_ixgbe
EAL:   using IOMMU type 8 (No-IOMMU)
EAL: Error enabling MSI-X interrupts for fd 31
EAL: PCI device 0000:00:06.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:1528 net_ixgbe
EAL: Error enabling MSI-X interrupts for fd 35
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=163456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 0)
EAL: Error disabling MSI-X interrupts for fd 31
EAL: Error enabling MSI-X interrupts for fd 31
Port 0: A0:36:9F:65:88:74
Configuring Port 1 (socket 0)
EAL: Error disabling MSI-X interrupts for fd 35
EAL: Error enabling MSI-X interrupts for fd 35
Port 1: A0:36:9F:65:88:76
Checking link statuses...
Done
testpmd> 
Stopping port 0...
Stopping ports...
Done

Stopping port 1...
Stopping ports...
Done

Shutting down port 0...
Closing ports...
EAL: Error disabling MSI-X interrupts for fd 31
Done

Shutting down port 1...
Closing ports...
EAL: Error disabling MSI-X interrupts for fd 35
Done

Bye...

Comment 7 Alex Williamson 2020-04-08 17:27:03 UTC

Laurent, in order to see the original issue I believe you'll need to be on a system where the legacy IRQ for the device is shared with other devices.  This will leave the APIC through which the device would signal an INTx enabled, thus if the device generates interrupts after MSI-X configuration fails, it will do so via the INTx pin of the device, where no handler is installed.  I think the resolution is that vfio-pci needs to have a backup IRQ handler installed for this situation to actively mask the device when it generates an interrupt with no user configured consumer.  Thanks

Comment 8 Laurent Vivier 2020-04-08 17:37:20 UTC

If I start guest without pci=nomsi for guest kernel but with pci=nomsi for host kernel I have some warnings reported by qemu:

qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to modify vector, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to modify vector, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1
qemu-kvm: vfio: failed to enable vectors, -1

But I can start testpmd without error:

# /usr/bin/testpmd         -l 1,2,3         -n 4         -d /usr/lib64/librte_pmd_ixgbe.so.20.0         -w 0000:00:05.0 -w 0000:00:06.0         --         --nb-cores=2         -i         --disable-rss         --rxd=512 --txd=512         --rxq=1 --txq=1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No free hugepages reported in hugepages-2048kB
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:00:05.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:1528 net_ixgbe
EAL:   using IOMMU type 8 (No-IOMMU)
EAL: PCI device 0000:00:06.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:1528 net_ixgbe
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=163456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 0)
Port 0: A0:36:9F:65:88:74
Configuring Port 1 (socket 0
Port 1: A0:36:9F:65:88:76
Checking link statuses...
Done
testpmd> 

And I have no error in guest and host kernel dmesg

Comment 9 Laurent Vivier 2020-04-08 17:41:23 UTC

(In reply to Alex Williamson from comment #7)
> Laurent, in order to see the original issue I believe you'll need to be on a
> system where the legacy IRQ for the device is shared with other devices. 
> This will leave the APIC through which the device would signal an INTx
> enabled, thus if the device generates interrupts after MSI-X configuration
> fails, it will do so via the INTx pin of the device, where no handler is
> installed.  I think the resolution is that vfio-pci needs to have a backup
> IRQ handler installed for this situation to actively mask the device when it
> generates an interrupt with no user configured consumer.  Thanks

Thank you Alex.

How can I identify such devices?

I also wanted to check if dpdk correctly manages systems with MSI/X disabled and it seems ok (according to errors in comment #6)

Comment 10 Laurent Vivier 2020-06-22 13:19:39 UTC

Low priority issue, moving to 8.4.0

Comment 13 Pei Zhang 2021-01-20 10:08:23 UTC

Close this bz WONTFIX as discussed in the internal virtio-networking sync meeting. This is a low priority bz and it's corner case.