Description of problem: Kernel panic when running the bbdev test on ACC100 card Version-Release number of selected component (if applicable): dpdk-22.11-3.el9_2.x86_64.rpm pf-bb-config-22.11-3.el9.x86_64.rpm How reproducible: Steps to Reproduce: Run the bbdev test as following [root@dell-per740-61 ~]# modprobe vfio-pci [root@dell-per740-61 ~]# modprobe pci_pf_stub [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1 [root@dell-per740-61 ~]# lspci|grep accelerators af:00.0 Processing accelerators: Intel Corporation Device 0d5c [root@dell-per740-61 ~]# lspci -Dd 8086:0d5c | cut -d ' ' -f 1 0000:af:00.0 [root@dell-per740-61 ~]# dpdk-devbind.py -b vfio-pci 0000:af:00.0 Actual results: Kernel will panic after run "dpdk-devbind.py -b vfio-pci 0000:af:00.0".It not always reproduce on this system. But it occur about 4 times. call trace: [ 106.723198] Call Trace: [ 106.723200] <NMI> [ 106.723202] dump_stack_lvl+0x34/0x48 [ 106.723210] panic+0xea/0x2e4 [ 106.723217] __ghes_panic.cold+0x21/0x21 [ 106.723222] ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 [ 106.723227] ghes_notify_nmi+0x59/0xd0 [ 106.723229] nmi_handle+0x5b/0x120 [ 106.723236] default_do_nmi+0x40/0x130 [ 106.723240] exc_nmi+0x111/0x140 [ 106.723242] end_repeat_nmi+0x16/0x67 [ 106.723249] RIP: 0010:intel_idle+0x55/0xa0 [ 106.723254] Code: 48 89 d1 65 48 8b 04 25 c0 11 03 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d c2 13 4c 00 b9 01 00 00 00 48 89 f0 0f 01 c9 <65> 48 8b 04 25 c0 11 03 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b [ 106.723256] RSP: 0018:ffffffffb8403e50 EFLAGS: 00000046 [ 106.723259] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001 [ 106.723260] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe1427fa00ca8 [ 106.723261] RBP: ffffe1427fa00ca8 R08: 0000000000000002 R09: 0000000000000008 [ 106.723262] R10: 00000000000003da R11: 00000000000003d8 R12: ffffffffb88b8d40 [ 106.723263] R13: ffffffffb88b8e28 R14: 0000000000000002 R15: 0000000000000000 [ 106.723265] ? intel_idle+0x55/0xa0 [ 106.723268] ? intel_idle+0x55/0xa0 [ 106.723270] </NMI> [ 106.723271] <TASK> [ 106.723271] cpuidle_enter_state+0x81/0x42a [ 106.723274] cpuidle_enter+0x29/0x40 [ 106.723279] cpuidle_idle_call+0xfa/0x160 [ 106.723284] do_idle+0x78/0xe0 [ 106.723286] cpu_startup_entry+0x19/0x20 [ 106.723288] rest_init+0xca/0xd0 [ 106.723291] arch_call_rest_init+0xa/0x24 [ 106.723298] start_kernel+0x4a3/0x4c2 [ 106.723300] secondary_startup_64_no_verify+0xe5/0xeb [ 106.723307] </TASK> [ 0.000000] Linux version 5.14.0-348.el9.x86_64 (mockbuild.eng.bos.redhat.com) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jul 31 18:52:45 EDT 2023 Expected results: No kernel panic. Additional info: kernel panic job: https://beaker.engineering.redhat.com/jobs/8159304 console log: https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/81593/8159304/14376283/console.log It occurred on following kernel: 5.14.0-348.el9.x86_64 5.14.0-284.17.1.el9.x86_64 5.14.0-284.26.1.el9.x86_64
For vmcore file, it is big, it can download with following command. wget http://netqe-bj.usersys.redhat.com/share/tli/vm_core/dell740_61/kernel_5_14_0_384_el9/127.0.0.1-2023-08-07-06:58:24/vmcore.tar.gz
These are the messages right before the panic: [ 105.574907] VFIO - User Level meta-driver version: 0.3 [ 106.723168] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 106.723172] {1}[Hardware Error]: event severity: fatal [ 106.723174] {1}[Hardware Error]: Error 0, type: fatal [ 106.723176] {1}[Hardware Error]: section_type: PCIe error [ 106.723177] {1}[Hardware Error]: port_type: 4, root port [ 106.723178] {1}[Hardware Error]: version: 3.0 [ 106.723179] {1}[Hardware Error]: command: 0x0547, status: 0x4010 [ 106.723181] {1}[Hardware Error]: device_id: 0000:ae:00.0 [ 106.723183] {1}[Hardware Error]: slot: 4 [ 106.723183] {1}[Hardware Error]: secondary_bus: 0xaf [ 106.723185] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2030 [ 106.723186] {1}[Hardware Error]: class_code: 060400 [ 106.723187] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003 [ 106.723188] {1}[Hardware Error]: aer_uncor_status: 0x00004020, aer_uncor_mask: 0x00310000 [ 106.723189] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 106.723190] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000 [ 106.723193] Kernel panic - not syncing: Fatal hardware error! [ 106.723194] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.14.0-348.el9.x86_64 #1 [ 106.723197] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020 [ 106.723198] Call Trace: [ 106.723200] <NMI> [ 106.723202] dump_stack_lvl+0x34/0x48 [ 106.723210] panic+0xea/0x2e4 [ 106.723217] __ghes_panic.cold+0x21/0x21 [ 106.723222] ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 And that indicates a hardware issue. See this KCS: https://access.redhat.com/solutions/2200441 I would suggest moving the ACC100 card to another server and see if the problem moves with it or not. Thanks, fbl
I also run the older kernel 5.14.0-339.el9, it also has panic. https://beaker.engineering.redhat.com/jobs/8161593 And keep to use kernel 5.14.0-348.el9, dpdk changed to use dpdk-22.11-4.el9.x86_64.rpm, it has no panic. https://beaker.engineering.redhat.com/jobs/8161622 https://beaker.engineering.redhat.com/jobs/8161624