Bug 2229745

Summary: Kernel panic when running the bbdev test on ACC100 card
Product: Red Hat Enterprise Linux Fast Datapath Reporter: liting <tli>
Component: DPDKAssignee: Maxime Coquelin <maxime.coquelin>
DPDK sub component: other QA Contact: liting <tli>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ctrautma, fleitner, jhsiao, ktraynor
Version: FDP 23.C   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liting 2023-08-07 14:10:20 UTC
Description of problem:
Kernel panic when running the bbdev test on ACC100 card

Version-Release number of selected component (if applicable):
dpdk-22.11-3.el9_2.x86_64.rpm
pf-bb-config-22.11-3.el9.x86_64.rpm

How reproducible:


Steps to Reproduce:
Run the bbdev test as following
[root@dell-per740-61 ~]# modprobe vfio-pci
 [root@dell-per740-61 ~]# modprobe  pci_pf_stub
 [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1
  [root@dell-per740-61 ~]# lspci|grep accelerators
    af:00.0 Processing accelerators: Intel Corporation Device 0d5c
 [root@dell-per740-61 ~]# lspci -Dd 8086:0d5c | cut -d ' ' -f 1
    0000:af:00.0
[root@dell-per740-61 ~]# dpdk-devbind.py -b vfio-pci 0000:af:00.0

Actual results:
Kernel will panic after run "dpdk-devbind.py -b vfio-pci 0000:af:00.0".It not always reproduce on this system. But it occur about 4 times.
call trace:
[  106.723198] Call Trace: 
[  106.723200]  <NMI> 
[  106.723202]  dump_stack_lvl+0x34/0x48 
[  106.723210]  panic+0xea/0x2e4 
[  106.723217]  __ghes_panic.cold+0x21/0x21 
[  106.723222]  ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 
[  106.723227]  ghes_notify_nmi+0x59/0xd0 
[  106.723229]  nmi_handle+0x5b/0x120 
[  106.723236]  default_do_nmi+0x40/0x130 
[  106.723240]  exc_nmi+0x111/0x140 
[  106.723242]  end_repeat_nmi+0x16/0x67 
[  106.723249] RIP: 0010:intel_idle+0x55/0xa0 
[  106.723254] Code: 48 89 d1 65 48 8b 04 25 c0 11 03 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d c2 13 4c 00 b9 01 00 00 00 48 89 f0 0f 01 c9 <65> 48 8b 04 25 c0 11 03 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 
[  106.723256] RSP: 0018:ffffffffb8403e50 EFLAGS: 00000046 
[  106.723259] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001 
[  106.723260] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe1427fa00ca8 
[  106.723261] RBP: ffffe1427fa00ca8 R08: 0000000000000002 R09: 0000000000000008 
[  106.723262] R10: 00000000000003da R11: 00000000000003d8 R12: ffffffffb88b8d40 
[  106.723263] R13: ffffffffb88b8e28 R14: 0000000000000002 R15: 0000000000000000 
[  106.723265]  ? intel_idle+0x55/0xa0 
[  106.723268]  ? intel_idle+0x55/0xa0 
[  106.723270]  </NMI> 
[  106.723271]  <TASK> 
[  106.723271]  cpuidle_enter_state+0x81/0x42a 
[  106.723274]  cpuidle_enter+0x29/0x40 
[  106.723279]  cpuidle_idle_call+0xfa/0x160 
[  106.723284]  do_idle+0x78/0xe0 
[  106.723286]  cpu_startup_entry+0x19/0x20 
[  106.723288]  rest_init+0xca/0xd0 
[  106.723291]  arch_call_rest_init+0xa/0x24 
[  106.723298]  start_kernel+0x4a3/0x4c2 
[  106.723300]  secondary_startup_64_no_verify+0xe5/0xeb 
[  106.723307]  </TASK> 
[    0.000000] Linux version 5.14.0-348.el9.x86_64 (mockbuild.eng.bos.redhat.com) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jul 31 18:52:45 EDT 2023 


Expected results:
No kernel panic.

Additional info:
kernel panic job:
https://beaker.engineering.redhat.com/jobs/8159304
console log:
https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/81593/8159304/14376283/console.log

It occurred on following kernel:
5.14.0-348.el9.x86_64
5.14.0-284.17.1.el9.x86_64
5.14.0-284.26.1.el9.x86_64

Comment 3 liting 2023-08-07 14:21:10 UTC
For vmcore file, it is big, it can download with following command.
wget http://netqe-bj.usersys.redhat.com/share/tli/vm_core/dell740_61/kernel_5_14_0_384_el9/127.0.0.1-2023-08-07-06:58:24/vmcore.tar.gz

Comment 4 Flavio Leitner 2023-08-07 17:23:14 UTC
These are the messages right before the panic:
[  105.574907] VFIO - User Level meta-driver version: 0.3 
[  106.723168] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 
[  106.723172] {1}[Hardware Error]: event severity: fatal 
[  106.723174] {1}[Hardware Error]:  Error 0, type: fatal 
[  106.723176] {1}[Hardware Error]:   section_type: PCIe error 
[  106.723177] {1}[Hardware Error]:   port_type: 4, root port 
[  106.723178] {1}[Hardware Error]:   version: 3.0 
[  106.723179] {1}[Hardware Error]:   command: 0x0547, status: 0x4010 
[  106.723181] {1}[Hardware Error]:   device_id: 0000:ae:00.0 
[  106.723183] {1}[Hardware Error]:   slot: 4 
[  106.723183] {1}[Hardware Error]:   secondary_bus: 0xaf 
[  106.723185] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2030 
[  106.723186] {1}[Hardware Error]:   class_code: 060400 
[  106.723187] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003 
[  106.723188] {1}[Hardware Error]:   aer_uncor_status: 0x00004020, aer_uncor_mask: 0x00310000 
[  106.723189] {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030 
[  106.723190] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000 
[  106.723193] Kernel panic - not syncing: Fatal hardware error! 
[  106.723194] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.14.0-348.el9.x86_64 #1 
[  106.723197] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020 
[  106.723198] Call Trace: 
[  106.723200]  <NMI> 
[  106.723202]  dump_stack_lvl+0x34/0x48 
[  106.723210]  panic+0xea/0x2e4 
[  106.723217]  __ghes_panic.cold+0x21/0x21 
[  106.723222]  ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 



And that indicates a hardware issue. See this KCS:
https://access.redhat.com/solutions/2200441

I would suggest moving the ACC100 card to another server and see if the problem moves with it or not.

Thanks,
fbl

Comment 5 liting 2023-08-08 01:29:55 UTC
I also run the older kernel 5.14.0-339.el9, it also has panic.
https://beaker.engineering.redhat.com/jobs/8161593
And keep to use kernel 5.14.0-348.el9, dpdk changed to use dpdk-22.11-4.el9.x86_64.rpm, it has no panic.
https://beaker.engineering.redhat.com/jobs/8161622
https://beaker.engineering.redhat.com/jobs/8161624