Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1971174

Summary: [RHEL8.5] fabtests on BNXT ROCE device of BCM57508 produce with core files
Product: Red Hat Enterprise Linux 8 Reporter: Brian Chae <bchae>
Component: rdma-coreAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.5CC: hwkernel-mgr, rdma-dev-team, selvin.xavier, sxavier, tmichael
Target Milestone: betaFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2014054 (view as bug list) Environment:
Last Closed: 2022-12-12 07:27:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1969483, 2014054    

Description Brian Chae 2021-06-12 17:07:00 UTC
Description of problem:

When the "fabtests" run on BCM57508 BNXT ROCE device results in core files. 

However, other BNXT ROCE device of BCM57414 DOES NOT produce the core files after running the same "fabtests" suite.

Also, with RHEL8.4, the same "fabtests" when run on the same BCM57508 BNXT ROCE device DOES NOT produce the core files. So, this seems to be a REGRESSION issue


Version-Release number of selected component (if applicable):


DISTRO=RHEL-8.5.0-20210610.n.0

+ [21-06-10 16:57:57] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 Beta (Ootpa)

+ [21-06-10 16:57:57] uname -a
Linux rdma-dev-25.lab.bos.redhat.com 4.18.0-312.el8.x86_64 #1 SMP Wed Jun 2 16:30:46 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-06-10 16:57:57] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-312.el8.x86_64 root=/dev/mapper/rhel_rdma--dev--25-root ro intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=/dev/mapper/rhel_rdma--dev--25-swap rd.lvm.lv=rhel_rdma-dev-25/root rd.lvm.lv=rhel_rdma-dev-25/swap console=ttyS1,115200n81

+ [21-06-10 16:57:57] rpm -q rdma-core linux-firmware
rdma-core-35.0-1.el8.x86_64
linux-firmware-20201218-102.git05789708.el8.noarch

+ [21-06-10 16:57:57] tail /sys/class/infiniband/bnxt_re0/fw_ver /sys/class/infiniband/bnxt_re1/fw_ver
==> /sys/class/infiniband/bnxt_re0/fw_ver <==
216.4.59.0

==> /sys/class/infiniband/bnxt_re1/fw_ver <==
216.4.59.0
+ [21-06-10 16:57:57] lspci
+ [21-06-10 16:57:57] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
05:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
05:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)



+ [21-06-10 16:58:00] rpm -q fabtests
fabtests-1.12.1-1.el8.x86_64



BXNT ROCE hosts with this issue:

Clients: rdma-dev-26
Servers: rdma-dev-25


How reproducible:

100%


Steps to Reproduce:
1. With the above build, run the following fabtests command
2. On the server run the fabtests, first

/usr/bin/runfabtests.sh -T 60 -vvv -t quick psm3 172.31.45.125 172.31.45.126 | tee -a fabtests_psm3_quick.log


3. On the client run the fabtests, afterwards

/usr/bin/runfabtests.sh -T 60 -vvv -t quick psm3 172.31.45.125 172.31.45.126 | tee -a fabtests_psm3_quick.log


Actual results:

After "journal -a", the following messages show on both hosts:

--------------------

Running python3 As root: 
TIME                            PID   UID   GID SIG COREFILE  EXE
Thu 2021-06-10 16:59:11 EDT   50987     0     0   6 present   /usr/bin/fi_av_xfer
Thu 2021-06-10 16:59:21 EDT   51031     0     0   6 present   /usr/bin/fi_av_xfer
Thu 2021-06-10 17:03:58 EDT   52017     0     0   6 present   /usr/bin/fi_rdm_shared_av
Thu 2021-06-10 17:05:35 EDT   52254     0     0   6 present   /usr/bin/fi_unexpected_msg
Thu 2021-06-10 17:10:31 EDT   53201     0     0   6 present   /usr/bin/fi_multi_recv
Thu 2021-06-10 19:05:32 EDT   84782     0     0   6 present   /usr/bin/fi_av_xfer
Thu 2021-06-10 19:05:42 EDT   84838     0     0   6 present   /usr/bin/fi_av_xfer
Thu 2021-06-10 19:10:19 EDT   86069     0     0   6 present   /usr/bin/fi_rdm_shared_av
Thu 2021-06-10 19:11:56 EDT   86355     0     0   6 present   /usr/bin/fi_unexpected_msg
Thu 2021-06-10 19:16:52 EDT   87554     0     0   6 present   /usr/bin/fi_multi_recv
total 9732
-rw-r-----. 1 root root 991635 Jun 10 16:59 core.fi_av_xfer.0.ced50e75bb394f62afaad264d9a096fd.50987.1623358750000000.lz4
-rw-r-----. 1 root root 991502 Jun 10 16:59 core.fi_av_xfer.0.ced50e75bb394f62afaad264d9a096fd.51031.1623358760000000.lz4
-rw-r-----. 1 root root 991723 Jun 10 19:05 core.fi_av_xfer.0.ced50e75bb394f62afaad264d9a096fd.84782.1623366331000000.lz4
-rw-r-----. 1 root root 991821 Jun 10 19:05 core.fi_av_xfer.0.ced50e75bb394f62afaad264d9a096fd.84838.1623366341000000.lz4
-rw-r-----. 1 root root 991241 Jun 10 17:10 core.fi_multi_recv.0.ced50e75bb394f62afaad264d9a096fd.53201.1623359431000000.lz4
-rw-r-----. 1 root root 990901 Jun 10 19:16 core.fi_multi_recv.0.ced50e75bb394f62afaad264d9a096fd.87554.1623367012000000.lz4
-rw-r-----. 1 root root 989521 Jun 10 17:03 core.fi_rdm_shared_a.0.ced50e75bb394f62afaad264d9a096fd.52017.1623359038000000.lz4
-rw-r-----. 1 root root 979146 Jun 10 19:10 core.fi_rdm_shared_a.0.ced50e75bb394f62afaad264d9a096fd.86069.1623366619000000.lz4
-rw-r-----. 1 root root 990147 Jun 10 17:05 core.fi_unexpected_m.0.ced50e75bb394f62afaad264d9a096fd.52254.1623359135000000.lz4
-rw-r-----. 1 root root 990060 Jun 10 19:11 core.fi_unexpected_m.0.ced50e75bb394f62afaad264d9a096fd.86355.1623366715000000.lz4
Red Hat Enterprise Linux release 8.5 Beta (Ootpa)


Firmware Bug, please contact your hardware vendor.
Firmware Bug, please contact your hardware vendor.
Jun 10 16:54:22 rdma-dev-26.lab.bos.redhat.com kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored

Jun 10 16:54:22 rdma-dev-26.lab.bos.redhat.com kernel: DMAR: [Firmware Bug]: RMRR entry for device 01:00.0 is broken - applying workaround

Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Errors found during boot. Please check if it's a bug.
Jun 10 16:54:22 rdma-dev-26.lab.bos.redhat.com kernel: ERST: Error Record Serialization Table (ERST) support is initialized.

Jun 10 20:54:33 rdma-dev-26.lab.bos.redhat.com kernel: ACPI Error: No handler for Region [SYSI] (00000000319b546e) [IPMI] (20201113/evregion-133)

Jun 10 20:54:33 rdma-dev-26.lab.bos.redhat.com kernel: ACPI Error: Region IPMI (ID=7) has no handler (20201113/exfldio-265)

Jun 10 20:54:33 rdma-dev-26.lab.bos.redhat.com kernel: ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20201113/psparse-531)

Jun 10 20:54:33 rdma-dev-26.lab.bos.redhat.com kernel: ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20201113/psparse-531)

Jun 10 20:54:33 rdma-dev-26.lab.bos.redhat.com kernel: ACPI Error: AE_NOT_EXIST, Evaluating _PMC (20201113/power_meter-756)

Jun 10 20:54:46 rdma-dev-26.lab.bos.redhat.com nm-dispatcher[1283]: req:6 'up' [bnxt_roce.43], "/etc/NetworkManager/dispatcher.d/98-bnxt_roce.43-egress.conf": complete: failed with Script '/etc/NetworkManager/dispatcher.d/98-bnxt_roce.43-egress.conf' exited with error status 127.

Jun 10 20:54:46 rdma-dev-26.lab.bos.redhat.com NetworkManager[1184]: <warn>  [1623372886.0344] dispatcher: (6) /etc/NetworkManager/dispatcher.d/98-bnxt_roce.43-egress.conf failed (failed): Script '/etc/NetworkManager/dispatcher.d/98-bnxt_roce.43-egress.conf' exited with error status 127.

Jun 10 20:54:46 rdma-dev-26.lab.bos.redhat.com nm-dispatcher[1283]: req:8 'up' [bnxt_roce.45], "/etc/NetworkManager/dispatcher.d/98-bnxt_roce.45-egress.conf": complete: failed with Script '/etc/NetworkManager/dispatcher.d/98-bnxt_roce.45-egress.conf' exited with error status 127.

Jun 10 20:54:46 rdma-dev-26.lab.bos.redhat.com NetworkManager[1184]: <warn>  [1623372886.0609] dispatcher: (8) /etc/NetworkManager/dispatcher.d/98-bnxt_roce.45-egress.conf failed (failed): Script '/etc/NetworkManager/dispatcher.d/98-bnxt_roce.45-egress.conf' exited with error status 127.

Jun 10 16:59:07 rdma-dev-26.lab.bos.redhat.com fi_av_xfer[50987]: (nic/PSM)[50987]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x00001532f8753a9a psmi_handle_error (libfabric.so.1)

Jun 10 16:59:17 rdma-dev-26.lab.bos.redhat.com fi_av_xfer[51031]: (nic/PSM)[51031]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x000014b0463d6a9a psmi_handle_error (libfabric.so.1)

Jun 10 17:05:32 rdma-dev-26.lab.bos.redhat.com fi_unexpected_msg[52254]: (nic/PSM)[52254]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x0000148104eb4a9a psmi_handle_error (libfabric.so.1)

Jun 10 17:10:28 rdma-dev-26.lab.bos.redhat.com fi_multi_recv[53201]: (nic/PSM)[53201]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x0000147a0c577a9a psmi_handle_error (libfabric.so.1)

Jun 10 19:05:28 rdma-dev-26.lab.bos.redhat.com fi_av_xfer[84782]: (nic/PSM)[84782]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x0000145eba316a9a psmi_handle_error (libfabric.so.1)

Jun 10 19:05:38 rdma-dev-26.lab.bos.redhat.com fi_av_xfer[84838]: (nic/PSM)[84838]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x000014de5b524a9a psmi_handle_error (libfabric.so.1)

Jun 10 19:11:52 rdma-dev-26.lab.bos.redhat.com fi_unexpected_msg[86355]: (nic/PSM)[86355]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x000014a9ac0cda9a psmi_handle_error (libfabric.so.1)

Jun 10 19:16:49 rdma-dev-26.lab.bos.redhat.com fi_multi_recv[87554]: (nic/PSM)[87554]: Process connect/disconnect error: 8, opcode 206

                                                                        #3  0x00001463262c1a9a psmi_handle_error (libfabric.so.1)


Expected results:

No core files should be found after the "fabtests"


Additional info:

Comment 2 Brian Chae 2021-07-09 13:17:23 UTC
Hi, Honggang, I will retest after the firmware upgrade and will post the result afterwards.

Comment 3 Afom T. Michael 2021-07-13 17:14:51 UTC
Hello Selvin/Broadcom,

Do you know if the FW on the cards in rdma-dev-25/26 need to be updated? Any idea about this bug?

Thanks,
Afom

Comment 4 Selvin Xavier (Broadcom) 2021-07-18 18:40:24 UTC
Hi Afom,
 I am not sure about the rootcause of the issue. I have updated the FW on these two systems to latest GA FW. Can you please try again?
Thanks,
Selvin

Comment 7 RHEL Program Management 2022-12-12 07:27:42 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.