Bug 2229745
| Summary: | Kernel panic when running the bbdev test on ACC100 card | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | liting <tli> |
| Component: | DPDK | Assignee: | Maxime Coquelin <maxime.coquelin> |
| DPDK sub component: | other | QA Contact: | liting <tli> |
| Status: | CLOSED EOL | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | ctrautma, fleitner, jhsiao, ktraynor |
| Version: | FDP 23.C | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-10-08 17:49:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
For vmcore file, it is big, it can download with following command. wget http://netqe-bj.usersys.redhat.com/share/tli/vm_core/dell740_61/kernel_5_14_0_384_el9/127.0.0.1-2023-08-07-06:58:24/vmcore.tar.gz These are the messages right before the panic:
[ 105.574907] VFIO - User Level meta-driver version: 0.3
[ 106.723168] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 106.723172] {1}[Hardware Error]: event severity: fatal
[ 106.723174] {1}[Hardware Error]: Error 0, type: fatal
[ 106.723176] {1}[Hardware Error]: section_type: PCIe error
[ 106.723177] {1}[Hardware Error]: port_type: 4, root port
[ 106.723178] {1}[Hardware Error]: version: 3.0
[ 106.723179] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[ 106.723181] {1}[Hardware Error]: device_id: 0000:ae:00.0
[ 106.723183] {1}[Hardware Error]: slot: 4
[ 106.723183] {1}[Hardware Error]: secondary_bus: 0xaf
[ 106.723185] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2030
[ 106.723186] {1}[Hardware Error]: class_code: 060400
[ 106.723187] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 106.723188] {1}[Hardware Error]: aer_uncor_status: 0x00004020, aer_uncor_mask: 0x00310000
[ 106.723189] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
[ 106.723190] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
[ 106.723193] Kernel panic - not syncing: Fatal hardware error!
[ 106.723194] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.14.0-348.el9.x86_64 #1
[ 106.723197] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020
[ 106.723198] Call Trace:
[ 106.723200] <NMI>
[ 106.723202] dump_stack_lvl+0x34/0x48
[ 106.723210] panic+0xea/0x2e4
[ 106.723217] __ghes_panic.cold+0x21/0x21
[ 106.723222] ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0
And that indicates a hardware issue. See this KCS:
https://access.redhat.com/solutions/2200441
I would suggest moving the ACC100 card to another server and see if the problem moves with it or not.
Thanks,
fbl
I also run the older kernel 5.14.0-339.el9, it also has panic. https://beaker.engineering.redhat.com/jobs/8161593 And keep to use kernel 5.14.0-348.el9, dpdk changed to use dpdk-22.11-4.el9.x86_64.rpm, it has no panic. https://beaker.engineering.redhat.com/jobs/8161622 https://beaker.engineering.redhat.com/jobs/8161624 Hi, I have tried to reproduce several times, but did not occur to me. It looks like a HW error, do you reproduce it on another system? Also, not related to this crash, but: (In reply to liting from comment #0) > Steps to Reproduce: > Run the bbdev test as following > [root@dell-per740-61 ~]# modprobe vfio-pci > [root@dell-per740-61 ~]# modprobe pci_pf_stub > [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1 You try to modprobe vfio-pci twice, and the second time with passing a parameter. Note that it won't be taken into account: [root@wsfd-advnetlab03 ~]# modprobe vfio-pci [root@wsfd-advnetlab03 ~]# modprobe vfio-pci enable_sriov=1 [root@wsfd-advnetlab03 ~]# cat /sys/module/vfio_pci/parameters/enable_sriov N Finally, I don't think we support pci_pf_stub anymore, and it contradict with probing vfio-pci with SRIOV support. Regards, Maxime (In reply to Maxime Coquelin from comment #7) > Hi, > > I have tried to reproduce several times, but did not occur to me. > It looks like a HW error, do you reproduce it on another system? > > Also, not related to this crash, but: > > (In reply to liting from comment #0) > > Steps to Reproduce: > > Run the bbdev test as following > > [root@dell-per740-61 ~]# modprobe vfio-pci > > [root@dell-per740-61 ~]# modprobe pci_pf_stub > > [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1 > > You try to modprobe vfio-pci twice, and the second time with passing a > parameter. > Note that it won't be taken into account: > > [root@wsfd-advnetlab03 ~]# modprobe vfio-pci > [root@wsfd-advnetlab03 ~]# modprobe vfio-pci enable_sriov=1 > [root@wsfd-advnetlab03 ~]# cat /sys/module/vfio_pci/parameters/enable_sriov > N > > Finally, I don't think we support pci_pf_stub anymore, and it contradict > with probing vfio-pci with SRIOV support. > > Regards, > Maxime Hi Maxime, I moved the ACC100 card to another server(740-55), it still has panic. Thanks https://beaker.engineering.redhat.com/jobs/8237985 console log: https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/82379/8237985/14503609/console.log [ 172.793363] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 [ 172.793366] {1}[Hardware Error]: event severity: fatal [ 172.793368] {1}[Hardware Error]: Error 0, type: fatal [ 172.793369] {1}[Hardware Error]: section_type: PCIe error [ 172.793371] {1}[Hardware Error]: port_type: 4, root port [ 172.793372] {1}[Hardware Error]: version: 3.0 [ 172.793373] {1}[Hardware Error]: command: 0x0547, status: 0x4010 [ 172.793374] {1}[Hardware Error]: device_id: 0000:3a:00.0 [ 172.793376] {1}[Hardware Error]: slot: 1 [ 172.793376] {1}[Hardware Error]: secondary_bus: 0x3b [ 172.793377] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2030 [ 172.793378] {1}[Hardware Error]: class_code: 060400 [ 172.793379] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003 [ 172.793381] {1}[Hardware Error]: aer_uncor_status: 0x00004020, aer_uncor_mask: 0x00310000 [ 172.793382] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030 [ 172.793383] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000 [ 172.793386] Kernel panic - not syncing: Fatal hardware error! [ 172.793387] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.14.0-361.el9.x86_64 #1 [ 172.793389] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020 [ 172.793390] Call Trace: [ 172.793391] <NMI> [ 172.793393] dump_stack_lvl+0x34/0x48 [ 172.793401] panic+0xea/0x2e4 [ 172.793408] __ghes_panic.cold+0x21/0x21 [ 172.793416] ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 [ 172.793422] ghes_notify_nmi+0x59/0xd0 [ 172.793424] nmi_handle+0x5b/0x120 [ 172.793429] default_do_nmi+0x40/0x130 [ 172.793434] exc_nmi+0x111/0x140 [ 172.793437] end_repeat_nmi+0x16/0x67 [ 172.793440] RIP: 0010:intel_idle+0x55/0xa0 [ 172.793444] Code: 48 89 d1 65 48 8b 04 25 c0 11 03 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 32 f7 4b 00 b9 01 00 00 00 48 89 f0 0f 01 c9 <65> 48 8b 04 25 c0 11 03 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b [ 172.793446] RSP: 0018:ffffffffb8403e50 EFLAGS: 00000046 [ 172.793448] RAX: 0000000000000020 RBX: 0000000000000003 RCX: 0000000000000001 [ 172.793449] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffd1117fc00a98 [ 172.793450] RBP: ffffd1117fc00a98 R08: 0000000000000003 R09: 0000000000000001 [ 172.793451] R10: 000000000000baa2 R11: 0000000000008e67 R12: ffffffffb88b8f00 [ 172.793452] R13: ffffffffb88b9050 R14: 0000000000000003 R15: 0000000000000000 [ 172.793455] ? intel_idle+0x55/0xa0 [ 172.793456] ? intel_idle+0x55/0xa0 [ 172.793458] </NMI> [ 172.793458] <TASK> [ 172.793459] cpuidle_enter_state+0x81/0x42a [ 172.793461] cpuidle_enter+0x29/0x40 [ 172.793465] cpuidle_idle_call+0xfa/0x160 [ 172.793470] do_idle+0x78/0xe0 [ 172.793472] cpu_startup_entry+0x19/0x20 [ 172.793475] rest_init+0xca/0xd0 [ 172.793477] arch_call_rest_init+0xa/0x24 [ 172.793482] start_kernel+0x4a3/0x4c2 [ 172.793483] secondary_startup_64_no_verify+0xe5/0xeb [ 172.793488] </TASK> Still has panic on rhel9.4 beta. rhel9.4 beta dpdk23.11 https://beaker.engineering.redhat.com/jobs/8996447 rhel9.4 beta dpdk22.11 https://beaker.engineering.redhat.com/jobs/9001034 This bug did not meet the criteria for automatic migration and is being closed. If the issue remains, please open a new ticket in https://issues.redhat.com/browse/FDP |
Description of problem: Kernel panic when running the bbdev test on ACC100 card Version-Release number of selected component (if applicable): dpdk-22.11-3.el9_2.x86_64.rpm pf-bb-config-22.11-3.el9.x86_64.rpm How reproducible: Steps to Reproduce: Run the bbdev test as following [root@dell-per740-61 ~]# modprobe vfio-pci [root@dell-per740-61 ~]# modprobe pci_pf_stub [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1 [root@dell-per740-61 ~]# lspci|grep accelerators af:00.0 Processing accelerators: Intel Corporation Device 0d5c [root@dell-per740-61 ~]# lspci -Dd 8086:0d5c | cut -d ' ' -f 1 0000:af:00.0 [root@dell-per740-61 ~]# dpdk-devbind.py -b vfio-pci 0000:af:00.0 Actual results: Kernel will panic after run "dpdk-devbind.py -b vfio-pci 0000:af:00.0".It not always reproduce on this system. But it occur about 4 times. call trace: [ 106.723198] Call Trace: [ 106.723200] <NMI> [ 106.723202] dump_stack_lvl+0x34/0x48 [ 106.723210] panic+0xea/0x2e4 [ 106.723217] __ghes_panic.cold+0x21/0x21 [ 106.723222] ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 [ 106.723227] ghes_notify_nmi+0x59/0xd0 [ 106.723229] nmi_handle+0x5b/0x120 [ 106.723236] default_do_nmi+0x40/0x130 [ 106.723240] exc_nmi+0x111/0x140 [ 106.723242] end_repeat_nmi+0x16/0x67 [ 106.723249] RIP: 0010:intel_idle+0x55/0xa0 [ 106.723254] Code: 48 89 d1 65 48 8b 04 25 c0 11 03 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d c2 13 4c 00 b9 01 00 00 00 48 89 f0 0f 01 c9 <65> 48 8b 04 25 c0 11 03 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b [ 106.723256] RSP: 0018:ffffffffb8403e50 EFLAGS: 00000046 [ 106.723259] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001 [ 106.723260] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe1427fa00ca8 [ 106.723261] RBP: ffffe1427fa00ca8 R08: 0000000000000002 R09: 0000000000000008 [ 106.723262] R10: 00000000000003da R11: 00000000000003d8 R12: ffffffffb88b8d40 [ 106.723263] R13: ffffffffb88b8e28 R14: 0000000000000002 R15: 0000000000000000 [ 106.723265] ? intel_idle+0x55/0xa0 [ 106.723268] ? intel_idle+0x55/0xa0 [ 106.723270] </NMI> [ 106.723271] <TASK> [ 106.723271] cpuidle_enter_state+0x81/0x42a [ 106.723274] cpuidle_enter+0x29/0x40 [ 106.723279] cpuidle_idle_call+0xfa/0x160 [ 106.723284] do_idle+0x78/0xe0 [ 106.723286] cpu_startup_entry+0x19/0x20 [ 106.723288] rest_init+0xca/0xd0 [ 106.723291] arch_call_rest_init+0xa/0x24 [ 106.723298] start_kernel+0x4a3/0x4c2 [ 106.723300] secondary_startup_64_no_verify+0xe5/0xeb [ 106.723307] </TASK> [ 0.000000] Linux version 5.14.0-348.el9.x86_64 (mockbuild.eng.bos.redhat.com) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jul 31 18:52:45 EDT 2023 Expected results: No kernel panic. Additional info: kernel panic job: https://beaker.engineering.redhat.com/jobs/8159304 console log: https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/81593/8159304/14376283/console.log It occurred on following kernel: 5.14.0-348.el9.x86_64 5.14.0-284.17.1.el9.x86_64 5.14.0-284.26.1.el9.x86_64