Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2229745

Summary: Kernel panic when running the bbdev test on ACC100 card
Product: Red Hat Enterprise Linux Fast Datapath Reporter: liting <tli>
Component: DPDKAssignee: Maxime Coquelin <maxime.coquelin>
DPDK sub component: other QA Contact: liting <tli>
Status: CLOSED EOL Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ctrautma, fleitner, jhsiao, ktraynor
Version: FDP 23.C   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-10-08 17:49:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liting 2023-08-07 14:10:20 UTC
Description of problem:
Kernel panic when running the bbdev test on ACC100 card

Version-Release number of selected component (if applicable):
dpdk-22.11-3.el9_2.x86_64.rpm
pf-bb-config-22.11-3.el9.x86_64.rpm

How reproducible:


Steps to Reproduce:
Run the bbdev test as following
[root@dell-per740-61 ~]# modprobe vfio-pci
 [root@dell-per740-61 ~]# modprobe  pci_pf_stub
 [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1
  [root@dell-per740-61 ~]# lspci|grep accelerators
    af:00.0 Processing accelerators: Intel Corporation Device 0d5c
 [root@dell-per740-61 ~]# lspci -Dd 8086:0d5c | cut -d ' ' -f 1
    0000:af:00.0
[root@dell-per740-61 ~]# dpdk-devbind.py -b vfio-pci 0000:af:00.0

Actual results:
Kernel will panic after run "dpdk-devbind.py -b vfio-pci 0000:af:00.0".It not always reproduce on this system. But it occur about 4 times.
call trace:
[  106.723198] Call Trace: 
[  106.723200]  <NMI> 
[  106.723202]  dump_stack_lvl+0x34/0x48 
[  106.723210]  panic+0xea/0x2e4 
[  106.723217]  __ghes_panic.cold+0x21/0x21 
[  106.723222]  ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 
[  106.723227]  ghes_notify_nmi+0x59/0xd0 
[  106.723229]  nmi_handle+0x5b/0x120 
[  106.723236]  default_do_nmi+0x40/0x130 
[  106.723240]  exc_nmi+0x111/0x140 
[  106.723242]  end_repeat_nmi+0x16/0x67 
[  106.723249] RIP: 0010:intel_idle+0x55/0xa0 
[  106.723254] Code: 48 89 d1 65 48 8b 04 25 c0 11 03 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d c2 13 4c 00 b9 01 00 00 00 48 89 f0 0f 01 c9 <65> 48 8b 04 25 c0 11 03 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 
[  106.723256] RSP: 0018:ffffffffb8403e50 EFLAGS: 00000046 
[  106.723259] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001 
[  106.723260] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe1427fa00ca8 
[  106.723261] RBP: ffffe1427fa00ca8 R08: 0000000000000002 R09: 0000000000000008 
[  106.723262] R10: 00000000000003da R11: 00000000000003d8 R12: ffffffffb88b8d40 
[  106.723263] R13: ffffffffb88b8e28 R14: 0000000000000002 R15: 0000000000000000 
[  106.723265]  ? intel_idle+0x55/0xa0 
[  106.723268]  ? intel_idle+0x55/0xa0 
[  106.723270]  </NMI> 
[  106.723271]  <TASK> 
[  106.723271]  cpuidle_enter_state+0x81/0x42a 
[  106.723274]  cpuidle_enter+0x29/0x40 
[  106.723279]  cpuidle_idle_call+0xfa/0x160 
[  106.723284]  do_idle+0x78/0xe0 
[  106.723286]  cpu_startup_entry+0x19/0x20 
[  106.723288]  rest_init+0xca/0xd0 
[  106.723291]  arch_call_rest_init+0xa/0x24 
[  106.723298]  start_kernel+0x4a3/0x4c2 
[  106.723300]  secondary_startup_64_no_verify+0xe5/0xeb 
[  106.723307]  </TASK> 
[    0.000000] Linux version 5.14.0-348.el9.x86_64 (mockbuild.eng.bos.redhat.com) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jul 31 18:52:45 EDT 2023 


Expected results:
No kernel panic.

Additional info:
kernel panic job:
https://beaker.engineering.redhat.com/jobs/8159304
console log:
https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/81593/8159304/14376283/console.log

It occurred on following kernel:
5.14.0-348.el9.x86_64
5.14.0-284.17.1.el9.x86_64
5.14.0-284.26.1.el9.x86_64

Comment 3 liting 2023-08-07 14:21:10 UTC
For vmcore file, it is big, it can download with following command.
wget http://netqe-bj.usersys.redhat.com/share/tli/vm_core/dell740_61/kernel_5_14_0_384_el9/127.0.0.1-2023-08-07-06:58:24/vmcore.tar.gz

Comment 4 Flavio Leitner 2023-08-07 17:23:14 UTC
These are the messages right before the panic:
[  105.574907] VFIO - User Level meta-driver version: 0.3 
[  106.723168] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 
[  106.723172] {1}[Hardware Error]: event severity: fatal 
[  106.723174] {1}[Hardware Error]:  Error 0, type: fatal 
[  106.723176] {1}[Hardware Error]:   section_type: PCIe error 
[  106.723177] {1}[Hardware Error]:   port_type: 4, root port 
[  106.723178] {1}[Hardware Error]:   version: 3.0 
[  106.723179] {1}[Hardware Error]:   command: 0x0547, status: 0x4010 
[  106.723181] {1}[Hardware Error]:   device_id: 0000:ae:00.0 
[  106.723183] {1}[Hardware Error]:   slot: 4 
[  106.723183] {1}[Hardware Error]:   secondary_bus: 0xaf 
[  106.723185] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2030 
[  106.723186] {1}[Hardware Error]:   class_code: 060400 
[  106.723187] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003 
[  106.723188] {1}[Hardware Error]:   aer_uncor_status: 0x00004020, aer_uncor_mask: 0x00310000 
[  106.723189] {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030 
[  106.723190] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000 
[  106.723193] Kernel panic - not syncing: Fatal hardware error! 
[  106.723194] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.14.0-348.el9.x86_64 #1 
[  106.723197] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020 
[  106.723198] Call Trace: 
[  106.723200]  <NMI> 
[  106.723202]  dump_stack_lvl+0x34/0x48 
[  106.723210]  panic+0xea/0x2e4 
[  106.723217]  __ghes_panic.cold+0x21/0x21 
[  106.723222]  ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 



And that indicates a hardware issue. See this KCS:
https://access.redhat.com/solutions/2200441

I would suggest moving the ACC100 card to another server and see if the problem moves with it or not.

Thanks,
fbl

Comment 5 liting 2023-08-08 01:29:55 UTC
I also run the older kernel 5.14.0-339.el9, it also has panic.
https://beaker.engineering.redhat.com/jobs/8161593
And keep to use kernel 5.14.0-348.el9, dpdk changed to use dpdk-22.11-4.el9.x86_64.rpm, it has no panic.
https://beaker.engineering.redhat.com/jobs/8161622
https://beaker.engineering.redhat.com/jobs/8161624

Comment 7 Maxime Coquelin 2023-08-25 14:34:41 UTC
Hi,

I have tried to reproduce several times, but did not occur to me.
It looks like a HW error, do you reproduce it on another system?

Also, not related to this crash, but:

(In reply to liting from comment #0)
> Steps to Reproduce:
> Run the bbdev test as following
> [root@dell-per740-61 ~]# modprobe vfio-pci
>  [root@dell-per740-61 ~]# modprobe  pci_pf_stub
>  [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1

You try to modprobe vfio-pci twice, and the second time with passing a parameter.
Note that it won't be taken into account:

[root@wsfd-advnetlab03 ~]# modprobe vfio-pci
[root@wsfd-advnetlab03 ~]# modprobe vfio-pci enable_sriov=1
[root@wsfd-advnetlab03 ~]# cat /sys/module/vfio_pci/parameters/enable_sriov
N

Finally, I don't think we support pci_pf_stub anymore, and it contradict with probing vfio-pci with SRIOV support.

Regards,
Maxime

Comment 8 liting 2023-08-29 02:55:55 UTC
(In reply to Maxime Coquelin from comment #7)
> Hi,
> 
> I have tried to reproduce several times, but did not occur to me.
> It looks like a HW error, do you reproduce it on another system?
> 
> Also, not related to this crash, but:
> 
> (In reply to liting from comment #0)
> > Steps to Reproduce:
> > Run the bbdev test as following
> > [root@dell-per740-61 ~]# modprobe vfio-pci
> >  [root@dell-per740-61 ~]# modprobe  pci_pf_stub
> >  [root@dell-per740-61 ~]# modprobe vfio-pci enable_sriov=1
> 
> You try to modprobe vfio-pci twice, and the second time with passing a
> parameter.
> Note that it won't be taken into account:
> 
> [root@wsfd-advnetlab03 ~]# modprobe vfio-pci
> [root@wsfd-advnetlab03 ~]# modprobe vfio-pci enable_sriov=1
> [root@wsfd-advnetlab03 ~]# cat /sys/module/vfio_pci/parameters/enable_sriov
> N
> 
> Finally, I don't think we support pci_pf_stub anymore, and it contradict
> with probing vfio-pci with SRIOV support.
> 
> Regards,
> Maxime

Hi Maxime,

I moved the ACC100 card to another server(740-55), it still has panic. Thanks
https://beaker.engineering.redhat.com/jobs/8237985
console log:
https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/82379/8237985/14503609/console.log
[  172.793363] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5 
[  172.793366] {1}[Hardware Error]: event severity: fatal 
[  172.793368] {1}[Hardware Error]:  Error 0, type: fatal 
[  172.793369] {1}[Hardware Error]:   section_type: PCIe error 
[  172.793371] {1}[Hardware Error]:   port_type: 4, root port 
[  172.793372] {1}[Hardware Error]:   version: 3.0 
[  172.793373] {1}[Hardware Error]:   command: 0x0547, status: 0x4010 
[  172.793374] {1}[Hardware Error]:   device_id: 0000:3a:00.0 
[  172.793376] {1}[Hardware Error]:   slot: 1 
[  172.793376] {1}[Hardware Error]:   secondary_bus: 0x3b 
[  172.793377] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2030 
[  172.793378] {1}[Hardware Error]:   class_code: 060400 
[  172.793379] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003 
[  172.793381] {1}[Hardware Error]:   aer_uncor_status: 0x00004020, aer_uncor_mask: 0x00310000 
[  172.793382] {1}[Hardware Error]:   aer_uncor_severity: 0x000ef030 
[  172.793383] {1}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000 
[  172.793386] Kernel panic - not syncing: Fatal hardware error! 
[  172.793387] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.14.0-361.el9.x86_64 #1 
[  172.793389] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020 
[  172.793390] Call Trace: 
[  172.793391]  <NMI> 
[  172.793393]  dump_stack_lvl+0x34/0x48 
[  172.793401]  panic+0xea/0x2e4 
[  172.793408]  __ghes_panic.cold+0x21/0x21 
[  172.793416]  ghes_in_nmi_queue_one_entry.constprop.0+0x1d9/0x2a0 
[  172.793422]  ghes_notify_nmi+0x59/0xd0 
[  172.793424]  nmi_handle+0x5b/0x120 
[  172.793429]  default_do_nmi+0x40/0x130 
[  172.793434]  exc_nmi+0x111/0x140 
[  172.793437]  end_repeat_nmi+0x16/0x67 
[  172.793440] RIP: 0010:intel_idle+0x55/0xa0 
[  172.793444] Code: 48 89 d1 65 48 8b 04 25 c0 11 03 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 32 f7 4b 00 b9 01 00 00 00 48 89 f0 0f 01 c9 <65> 48 8b 04 25 c0 11 03 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 
[  172.793446] RSP: 0018:ffffffffb8403e50 EFLAGS: 00000046 
[  172.793448] RAX: 0000000000000020 RBX: 0000000000000003 RCX: 0000000000000001 
[  172.793449] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffd1117fc00a98 
[  172.793450] RBP: ffffd1117fc00a98 R08: 0000000000000003 R09: 0000000000000001 
[  172.793451] R10: 000000000000baa2 R11: 0000000000008e67 R12: ffffffffb88b8f00 
[  172.793452] R13: ffffffffb88b9050 R14: 0000000000000003 R15: 0000000000000000 
[  172.793455]  ? intel_idle+0x55/0xa0 
[  172.793456]  ? intel_idle+0x55/0xa0 
[  172.793458]  </NMI> 
[  172.793458]  <TASK> 
[  172.793459]  cpuidle_enter_state+0x81/0x42a 
[  172.793461]  cpuidle_enter+0x29/0x40 
[  172.793465]  cpuidle_idle_call+0xfa/0x160 
[  172.793470]  do_idle+0x78/0xe0 
[  172.793472]  cpu_startup_entry+0x19/0x20 
[  172.793475]  rest_init+0xca/0xd0 
[  172.793477]  arch_call_rest_init+0xa/0x24 
[  172.793482]  start_kernel+0x4a3/0x4c2 
[  172.793483]  secondary_startup_64_no_verify+0xe5/0xeb 
[  172.793488]  </TASK>

Comment 11 liting 2024-03-07 03:00:02 UTC
Still has panic on rhel9.4 beta.
rhel9.4 beta dpdk23.11
https://beaker.engineering.redhat.com/jobs/8996447
rhel9.4 beta dpdk22.11
https://beaker.engineering.redhat.com/jobs/9001034

Comment 12 ovs-bot 2024-10-08 17:49:14 UTC
This bug did not meet the criteria for automatic migration and is being closed.
If the issue remains, please open a new ticket in https://issues.redhat.com/browse/FDP