Bug 1897576

Summary:	SAN Switch rebooted and caused (?) OpenStack compute node to reboot
Product:	Red Hat Enterprise Linux 7	Reporter:	ggrimaux
Component:	kernel	Assignee:	Dick Kennedy (Broadcom ECD) <dkennedy>
kernel sub component:	Storage Drivers	QA Contact:	Lin Li <lilin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	acaringi, agk, bmarzins, bubrown, cwei, dkennedy, emilne, gconsalv, lilin, loberman, mircea.vutcovici, msnitzer, nmurray, nyewale, revers, toneata
Version:	7.7	Keywords:	Triaged, ZStream
Target Milestone:	rc
Target Release:	7.9
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-3.10.0-1160.39.1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1984118 (view as bug list)		Environment:
Last Closed:	2021-08-31 09:09:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1982096, 1984118

Description ggrimaux 2020-11-13 13:56:08 UTC

Description of problem:
Twice in two weeks this OpenStack compute node was rebooted automatically.
Not sure for the first time but the second time (yesterday) was linked to a SAN switch reboot (known issue on client side and will be fix soon).

The SAN is not used for the Operating System of the Hypervisor (compute node), only for instances running on it.

So we don't why the host crashes.

We have crash dump (both from yesteday and Nov 6th) on supportshell.

We need your help to understand how to prevent these crashes.
Thank you!

Below are some logs I got from /var/crash/127.0.0.1-2020-11-12-20\:37\:02/vmcore-dmesg.txt:
[571553.711255] device-mapper: multipath: Failing path 128:176.
[571553.711265] device-mapper: multipath: Failing path 70:160.
[571553.711274] device-mapper: multipath: Failing path 71:64.
[571553.711283] device-mapper: multipath: Failing path 71:32.
[571553.711292] device-mapper: multipath: Failing path 71:208.
[571553.711300] device-mapper: multipath: Failing path 71:240.
[571553.711309] device-mapper: multipath: Failing path 128:144.
[571553.711317] device-mapper: multipath: Failing path 128:112.
[571553.711331] device-mapper: multipath: Failing path 128:96.
[571553.711342] device-mapper: multipath: Failing path 128:0.
[571553.719778] sd 15:0:9:0: alua: port group 01 state A preferred supports tolusnA
[571553.719939] sd 15:0:9:0: alua: port group 01 state A preferred supports tolusnA
[571553.721913] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
[571553.730230] IP: [<ffffffffc069ce13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[571553.738094] PGD 800000af8eb13067 PUD bdd0685067 PMD 0
[571553.743514] Oops: 0000 [#1] SMP
...
[571553.893851]  wmi drm_panel_orientation_quirks dm_multipath nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod
[571553.904439] CPU: 15 PID: 2674 Comm: lpfc_worker_2 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.el7.x86_64 #1
[571553.917568] Hardware name: Dell Inc. PowerEdge R740/0JMK61, BIOS 2.6.4 04/09/2020
[571553.925642] task: ffff9fe7f067b150 ti: ffff9fe7e6fe4000 task.ti: ffff9fe7e6fe4000
[571553.933732] RIP: 0010:[<ffffffffc069ce13>]  [<ffffffffc069ce13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[571553.944407] RSP: 0018:ffff9fe7e6fe7b20  EFLAGS: 00010046
[571553.950373] RAX: 0000000000100000 RBX: ffffa047eab88e00 RCX: 000000010040002e
[571553.958166] RDX: 0000000000000000 RSI: ffffa047eab88e00 RDI: ffffa047f4c1c000
[571553.965958] RBP: ffff9fe7e6fe7b50 R08: ffffa0478ffc7c40 R09: 000000010040002e
[571553.974153] R10: 000000008ffc7f01 R11: ffffa0478ffc7c40 R12: ffffa047f4c1c000
[571553.981969] R13: ffff9fe7f8874060 R14: ffffa047eab88e00 R15: ffffa047f4c1c000
[571553.989788] FS:  0000000000000000(0000) GS:ffffa0487d1c0000(0000) knlGS:0000000000000000
[571553.999033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[571554.005500] CR2: 0000000000000090 CR3: 0000009a9fab4000 CR4: 00000000007627e0
[571554.013454] PKRU: 00000000
[571554.016910] Call Trace:
[571554.020139]  [<ffffffffc069f407>] lpfc_sli_release_iocbq+0x37/0x60 [lpfc]
[571554.027696]  [<ffffffffc06bd83e>] lpfc_els_free_iocb+0x14e/0x1d0 [lpfc]
[571554.035084]  [<ffffffffc06c1b03>] lpfc_cmpl_els_prli+0xe3/0x210 [lpfc]
[571554.042392]  [<ffffffffc06a60bd>] lpfc_sli_sp_handle_rspiocb+0x3fd/0x780 [lpfc]
[571554.050491]  [<ffffffffc06cea36>] ? lpfc_mbx_cmpl_reg_login+0xe6/0x160 [lpfc]
[571554.058423]  [<ffffffff9c2af1f5>] ? mod_timer+0x1b5/0x230
[571554.064628]  [<ffffffffc06b0172>] lpfc_sli_handle_slow_ring_event_s4+0x192/0x260 [lpfc]
[571554.073466]  [<ffffffffc06a03e2>] lpfc_sli_handle_slow_ring_event+0x12/0x20 [lpfc]
[571554.081862]  [<ffffffffc06d3afc>] lpfc_work_done+0x94c/0x14a0 [lpfc]
[571554.089046]  [<ffffffff9c9805c2>] ? __schedule+0x402/0x840
[571554.095378]  [<ffffffffc06d46c0>] lpfc_do_work+0x70/0x1e0 [lpfc]
[571554.102236]  [<ffffffff9c2c72e0>] ? wake_up_atomic_t+0x30/0x30
[571554.108927]  [<ffffffffc06d4650>] ? lpfc_work_done+0x14a0/0x14a0 [lpfc]
[571554.116398]  [<ffffffff9c2c61f1>] kthread+0xd1/0xe0
[571554.122246]  [<ffffffff9c2c6120>] ? insert_kthread_work+0x40/0x40
[571554.129181]  [<ffffffff9c98dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[571554.136453]  [<ffffffff9c2c6120>] ? insert_kthread_work+0x40/0x40
[571554.143383] Code: 28 48 c7 00 00 00 00 00 4d 85 ed 0f 84 a6 00 00 00 8b 86 74 01 00 00 a9 00 00 80 00 0f 85 76 01 00 00 48 8b 97 98 02 00 00 a8 40 <4c> 8b b2 90 00 00 00 74 0b 41 83 7d 24 02 0f 85 19 01 00 00 4d
[571554.165427] RIP  [<ffffffffc069ce13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[571554.173936]  RSP <ffff9fe7e6fe7b20>
[571554.178233] CR2: 0000000000000090

Version-Release number of selected component (if applicable):
3.10.0-1062.12.1.el7.x86_64

How reproducible:
Random but twice in 1 week.

Steps to Reproduce:
1. SAN switch reboot (crash)
2. Some random compute nodes will crash
3.

Actual results:
Server crashes

Expected results:
No crashes

Additional info:
We have dump from both crashes on supportshell.

Comment 2 Ben Marzinski 2020-11-17 15:39:08 UTC

There's nothing in that stack trace that points to multipath. That code path is for the LPFC Fibre Channel driver. Reassigning to the storage drivers team.

Comment 3 Dick Kennedy (Broadcom ECD) 2020-11-18 21:05:38 UTC

Do you still have the logs from that machine or the vmcore-dmesg?

Comment 5 ggrimaux 2020-11-18 21:47:38 UTC

Hi dick,

Yes I have that info.

I am adding the last part of it which I think is what you need/want:

[571553.704834] sd 15:0:9:0: alua: port group 01 state A preferred supports tolusnA
[571553.705012] sd 15:0:9:0: alua: port group 01 state A preferred supports tolusnA
[571553.711131] scsi 17:0:22:0: alua: Detached
[571553.711255] device-mapper: multipath: Failing path 128:176.
[571553.711265] device-mapper: multipath: Failing path 70:160.
[571553.711274] device-mapper: multipath: Failing path 71:64.
[571553.711283] device-mapper: multipath: Failing path 71:32.
[571553.711292] device-mapper: multipath: Failing path 71:208.
[571553.711300] device-mapper: multipath: Failing path 71:240.
[571553.711309] device-mapper: multipath: Failing path 128:144.
[571553.711317] device-mapper: multipath: Failing path 128:112.
[571553.711331] device-mapper: multipath: Failing path 128:96.
[571553.711342] device-mapper: multipath: Failing path 128:0.
[571553.719778] sd 15:0:9:0: alua: port group 01 state A preferred supports tolusnA
[571553.719939] sd 15:0:9:0: alua: port group 01 state A preferred supports tolusnA
[571553.721913] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
[571553.730230] IP: [<ffffffffc069ce13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[571553.738094] PGD 800000af8eb13067 PUD bdd0685067 PMD 0
[571553.743514] Oops: 0000 [#1] SMP
[571553.747017] Modules linked in: veth macsec binfmt_misc vhost_net vhost macvtap macvlan tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag fuse ebtable_filter ebtables tun ip6table_security ip6table_raw ip6table_mangle iptable_raw iptable_mangle overlay(T) vrouter(OE) 8021q garp mrp sch_ingress bonding openvswitch nf_nat_ipv6 nls_utf8 isofs nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_comment xt_multiport xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_umad rpcrdma rdma_ucm ib_uverbs ib_iser rdma_cm iw_cm ib_cm libiscsi iTCO_wdt iTCO_vendor_support
[571553.820894]  bnxt_re ib_core skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass dell_smbios dcdbas dell_wmi_descriptor pcspkr pcc_cpufreq i2c_i801 mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler tpm_crb acpi_power_meter acpi_pad ses enclosure scsi_transport_sas nf_conntrack sg br_netfilter bridge stp llc ip_tables xfs libcrc32c dm_service_time sd_mod lpfc mgag200 i2c_algo_bit drm_kms_helper crc32_pclmul crc32c_intel ghash_clmulni_intel syscopyarea aesni_intel sysfillrect lrw sysimgblt gf128mul fb_sys_fops glue_helper nvmet_fc ablk_helper ttm cryptd nvmet tg3 scsi_transport_iscsi crc_t10dif crct10dif_generic ahci crct10dif_pclmul nvme_fc drm nvme_fabrics libahci ptp bnxt_en megaraid_sas nvme_core pps_core scsi_transport_fc libata scsi_tgt crct10dif_common devlink
[571553.893851]  wmi drm_panel_orientation_quirks dm_multipath nfit libnvdimm sunrpc dm_mirror dm_region_hash dm_log dm_mod
[571553.904439] CPU: 15 PID: 2674 Comm: lpfc_worker_2 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.el7.x86_64 #1
[571553.917568] Hardware name: Dell Inc. PowerEdge R740/0JMK61, BIOS 2.6.4 04/09/2020
[571553.925642] task: ffff9fe7f067b150 ti: ffff9fe7e6fe4000 task.ti: ffff9fe7e6fe4000
[571553.933732] RIP: 0010:[<ffffffffc069ce13>]  [<ffffffffc069ce13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[571553.944407] RSP: 0018:ffff9fe7e6fe7b20  EFLAGS: 00010046
[571553.950373] RAX: 0000000000100000 RBX: ffffa047eab88e00 RCX: 000000010040002e
[571553.958166] RDX: 0000000000000000 RSI: ffffa047eab88e00 RDI: ffffa047f4c1c000
[571553.965958] RBP: ffff9fe7e6fe7b50 R08: ffffa0478ffc7c40 R09: 000000010040002e
[571553.974153] R10: 000000008ffc7f01 R11: ffffa0478ffc7c40 R12: ffffa047f4c1c000
[571553.981969] R13: ffff9fe7f8874060 R14: ffffa047eab88e00 R15: ffffa047f4c1c000
[571553.989788] FS:  0000000000000000(0000) GS:ffffa0487d1c0000(0000) knlGS:0000000000000000
[571553.999033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[571554.005500] CR2: 0000000000000090 CR3: 0000009a9fab4000 CR4: 00000000007627e0
[571554.013454] PKRU: 00000000
[571554.016910] Call Trace:
[571554.020139]  [<ffffffffc069f407>] lpfc_sli_release_iocbq+0x37/0x60 [lpfc]
[571554.027696]  [<ffffffffc06bd83e>] lpfc_els_free_iocb+0x14e/0x1d0 [lpfc]
[571554.035084]  [<ffffffffc06c1b03>] lpfc_cmpl_els_prli+0xe3/0x210 [lpfc]
[571554.042392]  [<ffffffffc06a60bd>] lpfc_sli_sp_handle_rspiocb+0x3fd/0x780 [lpfc]
[571554.050491]  [<ffffffffc06cea36>] ? lpfc_mbx_cmpl_reg_login+0xe6/0x160 [lpfc]
[571554.058423]  [<ffffffff9c2af1f5>] ? mod_timer+0x1b5/0x230
[571554.064628]  [<ffffffffc06b0172>] lpfc_sli_handle_slow_ring_event_s4+0x192/0x260 [lpfc]
[571554.073466]  [<ffffffffc06a03e2>] lpfc_sli_handle_slow_ring_event+0x12/0x20 [lpfc]
[571554.081862]  [<ffffffffc06d3afc>] lpfc_work_done+0x94c/0x14a0 [lpfc]
[571554.089046]  [<ffffffff9c9805c2>] ? __schedule+0x402/0x840
[571554.095378]  [<ffffffffc06d46c0>] lpfc_do_work+0x70/0x1e0 [lpfc]
[571554.102236]  [<ffffffff9c2c72e0>] ? wake_up_atomic_t+0x30/0x30
[571554.108927]  [<ffffffffc06d4650>] ? lpfc_work_done+0x14a0/0x14a0 [lpfc]
[571554.116398]  [<ffffffff9c2c61f1>] kthread+0xd1/0xe0
[571554.122246]  [<ffffffff9c2c6120>] ? insert_kthread_work+0x40/0x40
[571554.129181]  [<ffffffff9c98dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[571554.136453]  [<ffffffff9c2c6120>] ? insert_kthread_work+0x40/0x40
[571554.143383] Code: 28 48 c7 00 00 00 00 00 4d 85 ed 0f 84 a6 00 00 00 8b 86 74 01 00 00 a9 00 00 80 00 0f 85 76 01 00 00 48 8b 97 98 02 00 00 a8 40 <4c> 8b b2 90 00 00 00 74 0b 41 83 7d 24 02 0f 85 19 01 00 00 4d
[571554.165427] RIP  [<ffffffffc069ce13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[571554.173936]  RSP <ffff9fe7e6fe7b20>
[571554.178233]@ CR2: 0000000000000090

Above that part its only more multipath failure which I dont think you want.
Let me know if thats enough or not.

Thanks a lot!

Comment 6 ggrimaux 2020-12-07 09:21:05 UTC

Hi Dick.

Any update on this ? 

Thanks.

Comment 7 Dick Kennedy (Broadcom ECD) 2020-12-07 18:42:20 UTC

Can you attach the whole vmcore-dmesg file?

Comment 10 ggrimaux 2020-12-08 10:58:00 UTC

Hi Dick,

I added the two vmcore-dmesg about two distinct crashes that seem to point to the same issue.

If you need anything else please let me know.

Thank you.

Comment 12 ggrimaux 2021-01-05 13:00:05 UTC

Hi Dick,

Could we have an update on this please ?

Thank you.

Comment 13 ggrimaux 2021-01-13 10:22:30 UTC

Hi Dick,

Could we have an update on this please ?

Thank you.

Comment 14 Dick Kennedy (Broadcom ECD) 2021-01-19 14:19:40 UTC

I was hoping that you would provide the whole vmcore-dmesg file. 

The output you have out in the bz is a start but I wanted to see if when the switch is rebooted if it sends RSCNs first or  does  the link go down. 

I have not seen this on rhel7 before that is why I want to understand the actual steps that the switch takes.

The null dereference in the release might be an overwrite because all the funcs in the call stack all had to use that same pointer?

I will look closer at the trace and update the bz.

Comment 15 Mircea Vutcovici 2021-01-19 21:24:12 UTC

Hi Dick,

The vmcore-dmesg is complete. The problem is that dmesg is stored in a circular buffer which is 1MB, which is mathcing the file size.
The ring size is configured by CONFIG_LOG_BUF_SHIFT kernel option and it is 20 (2^20 = 1MB).
You can see this config var with: grep CONFIG_LOG_BUF_SHIFT /boot/config-$(uname -r)

Thank you, Mircea

Comment 16 Dick Kennedy (Broadcom ECD) 2021-01-26 19:25:00 UTC

 https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34523484

Can you try this brew build. 

I added a check for els_wq before the dereference.

Comment 17 ggrimaux 2021-01-27 10:20:04 UTC

Hi Dick,

Sorry I should have put the flag for backport to RHEL 7.7. 

We need to keep the kernel the same for the OpenStack version.

For this particular setup it is using a RHEL7.7 kernel.

Also is this safe to run in production ?

Thank you.

Comment 18 Ewan D. Milne 2021-02-02 19:49:29 UTC

DO NOT RUN TEST KERNELS FROM ENGINEERING IN A PRODUCTION ENVIRONMENT

We need GSS involvement in case there are other support issues.

cc: Laurence

Dick, please attach the patch you used.

Your kernel 3.10.0-1062.12.1.el7 is pretty old, you are missing a bunch of
CVE fixes, can you run a later version of 7.7.z ?

Comment 20 loberman 2021-02-02 20:18:09 UTC

@ggrimaux

Support will usually hand out test kernels but we always use the BZ #. 
We also always save the src.rpm
Just a best practice please if Eng builds these.
This is done so that we can track the kernel back to a BZ if a customer installs a test kernel and just leaves it there even though they are not supposed to.

If we build it not from a BZ, we use the sfdc case #

We also always add this.

NOTE:

This RPM has been provided by Red Hat for testing purposes only and is NOT supported for any other use. 
This RPM may contain changes that are necessary for debugging but that are not appropriate for other uses, or that are not compatible with third-party hardware or software. This RPM should NOT be deployed for purposes other than testing and debugging.

Thanks a lot
Laurence Oberman

Comment 21 loberman 2021-02-02 20:21:05 UTC

Also note. 
A best practice is to add log_buf_len=64M to the kernel grub line to avoid the ring buffer wrap around.
I am hoping to get that to be the default as we seldom have small memory servers in support now.

Regards
Laurence

Comment 22 Dick Kennedy (Broadcom ECD) 2021-02-03 15:27:50 UTC

rom a8548c7fac49bf8bfe456215a55219983488ba06 Mon Sep 17 00:00:00 2001
From: Dick Kennedy <dkennedy>
Date: Tue, 26 Jan 2021 11:32:16 -0500
Subject: [rhel-7.7.z PATCH e-stor] Fix for crash in
 __lpfc_sli_release_iocbq_s4 Moved the pring assignment inside a if of the
 els_wq.

---
 drivers/scsi/lpfc/lpfc_sli.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index e73b5866d062..b3c35fb36047 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -1254,8 +1254,6 @@ __lpfc_sli_release_iocbq_s4(struct lpfc_hba *phba, struct lpfc_iocbq *iocbq)

        lockdep_assert_held(&phba->hbalock);

-       lockdep_assert_held(&phba->hbalock); << THis needs to be removed in a different patch.
-
        if (iocbq->sli4_xritag == NO_XRI)
                sglq = NULL;
       else
@@ -1275,7 +1273,6 @@ __lpfc_sli_release_iocbq_s4(struct lpfc_hba *phba, struct lpfc_iocbq *iocbq)
                        goto out;
                }

-               pring = phba->sli4_hba.els_wq->pring;
                if ((iocbq->iocb_flag & LPFC_EXCHANGE_BUSY) &&
                        (sglq->state != SGL_XRI_ABORTED)) {
                        spin_lock_irqsave(&phba->sli4_hba.sgl_list_lock,
@@ -1295,8 +1292,11 @@ __lpfc_sli_release_iocbq_s4(struct lpfc_hba *phba, struct lpfc_iocbq *iocbq)
                                &phba->sli4_hba.sgl_list_lock, iflag);

                        /* Check if TXQ queue needs to be serviced */
-                       if (!list_empty(&pring->txq))
-                               lpfc_worker_wake_up(phba);
+                       if (phba->sli4_hba.els_wq) {
+                               pring = phba->sli4_hba.els_wq->pring;
+                               if (!list_empty(&pring->txq))
+                                       lpfc_worker_wake_up(phba);
+                       }
                }
        }

-- 
2.18.1

Comment 23 Dick Kennedy (Broadcom ECD) 2021-02-03 15:35:34 UTC

That patch is not upstream yet, it is in our queue to go upstream

Comment 24 Dick Kennedy (Broadcom ECD) 2021-02-23 20:00:37 UTC

https://marc.info/?l=linux-scsi&m=161308728219646&w=2

Posted but not accepted yet.

Comment 27 Ewan D. Milne 2021-02-24 17:52:04 UTC

Dick Kennedy is a partner engineer from Broadcom and cannot see private comments.
You need to make your comments and attachments public for him to read them.
Or, we can add Broadcom ECD group to this BZ so they can see more things (I think
you can do this on a per-comment in BZ now).

Having said that, the stack trace does look like the same problem.

[170618.556788] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
[170618.564978] IP: [<ffffffffc0647e13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[170618.572896] PGD 0
[170618.575206] Oops: 0000 [#1] SMP
[170618.578754] Modules linked in: vhost_net vhost macvtap macvlan dm_service_time udp_diag unix_diag af_packet_diag netlink_diag tcp_diag inet_diag fuse ebtable_filter ebtables ip6table_security ip6table_raw i\
p6table_mangle iptable_raw iptable_mangle tun overlay(T) vrouter(OE) 8021q garp mrp sch_ingress bonding openvswitch nf_nat_ipv6 nls_utf8 isofs nf_log_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_ta\
bles nf_log_ipv4 nf_log_common xt_LOG xt_comment xt_multiport xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat ib_ise\
rt iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm iw_cm dm_multipath
[170618.651968]  skx_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel dm_mod lrw gf128mul glue_helper ablk_helper cryptd pcs\
pkr ses enclosure mlx5_ib sg mei_me mei ib_uverbs hpwdt hpilo lpc_ich wmi ib_core tpm_crb ipmi_si pcc_cpufreq ipmi_devintf ipmi_msghandler acpi_power_meter nf_conntrack br_netfilter bridge stp llc ip_tables xfs\
 libcrc32c sd_mod lpfc mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx5_core ttm drm nvmet_fc nvmet crc32c_intel crc_t10dif crct10dif_generic crct10dif_pclmul nvme_fc smart\
pqi nvme_fabrics tg3 nvme_core mlxfw scsi_transport_fc devlink scsi_transport_sas scsi_tgt ptp crct10dif_common drm_panel_orientation_quirks pps_core uas usb_storage
[170618.723887] CPU: 14 PID: 15800 Comm: lpfc_worker_1 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
[170618.737113] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 11/13/2019
[170618.746288] task: ffffa144ef06d230 ti: ffffa144e6620000 task.ti: ffffa144e6620000
[170618.754426] RIP: 0010:[<ffffffffc0647e13>]  [<ffffffffc0647e13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[170618.765164] RSP: 0018:ffffa144e6623b20  EFLAGS: 00010046
[170618.771156] RAX: 0000000000100000 RBX: ffffa084f12d4800 RCX: 0000000100400035
[170618.778982] RDX: 0000000000000000 RSI: ffffa084f12d4800 RDI: ffffa084ccf58000
[170618.786803] RBP: ffffa144e6623b50 R08: ffffa144ddb48900 R09: 0000000100400035
[170618.794636] R10: 00000000ddb48e01 R11: ffffa144ddb48900 R12: ffffa084ccf58000
[170618.802472] R13: ffffa084ca85ede0 R14: ffffa084f12d4800 R15: ffffa084ccf58000
[170618.810320] FS:  0000000000000000(0000) GS:ffffa0857fb80000(0000) knlGS:0000000000000000
[170618.819158] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[170618.825632] CR2: 0000000000000090 CR3: 000000a5c1e56000 CR4: 00000000007627e0
[170618.833514] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[170618.841392] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[170618.849272] PKRU: 00000000
[170618.852703] Call Trace:
[170618.855904]  [<ffffffffc064a407>] lpfc_sli_release_iocbq+0x37/0x60 [lpfc]
[170618.863466]  [<ffffffffc066883e>] lpfc_els_free_iocb+0x14e/0x1d0 [lpfc]
[170618.870858]  [<ffffffffc066cb03>] lpfc_cmpl_els_prli+0xe3/0x210 [lpfc]
[170618.878165]  [<ffffffffc06510bd>] lpfc_sli_sp_handle_rspiocb+0x3fd/0x780 [lpfc]
[170618.886272]  [<ffffffffc0679a36>] ? lpfc_mbx_cmpl_reg_login+0xe6/0x160 [lpfc]
[170618.894210]  [<ffffffffb2eaf1f5>] ? mod_timer+0x1b5/0x230
[170618.900410]  [<ffffffffc065b172>] lpfc_sli_handle_slow_ring_event_s4+0x192/0x260 [lpfc]
[170618.909240]  [<ffffffffc064b3e2>] lpfc_sli_handle_slow_ring_event+0x12/0x20 [lpfc]
[170618.917641]  [<ffffffffc067eafc>] lpfc_work_done+0x94c/0x14a0 [lpfc]
[170618.924821]  [<ffffffffb35805a2>] ? __schedule+0x402/0x840
[170618.931139]  [<ffffffffc067f6c0>] lpfc_do_work+0x70/0x1e0 [lpfc]
[170618.937983]  [<ffffffffb2ec72e0>] ? wake_up_atomic_t+0x30/0x30
[170618.944637]  [<ffffffffc067f650>] ? lpfc_work_done+0x14a0/0x14a0 [lpfc]
[170618.952063]  [<ffffffffb2ec61f1>] kthread+0xd1/0xe0
[170618.957736]  [<ffffffffb2ec6120>] ? insert_kthread_work+0x40/0x40
[170618.964626]  [<ffffffffb358dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[170618.971871]  [<ffffffffb2ec6120>] ? insert_kthread_work+0x40/0x40
[170618.978750] Code: 28 48 c7 00 00 00 00 00 4d 85 ed 0f 84 a6 00 00 00 8b 86 74 01 00 00 a9 00 00 80 00 0f 85 76 01 00 00 48 8b 97 98 02 00 00 a8 40 <4c> 8b b2 90 00 00 00 74 0b 41 83 7d 24 02 0f 85 19 01 00 \
00 4d
[170618.999774] RIP  [<ffffffffc0647e13>] __lpfc_sli_release_iocbq_s4+0x63/0x260 [lpfc]
[170619.008228]  RSP <ffffa144e6623b20>
[170619.012467] CR2: 0000000000000090

Comment 29 Dick Kennedy (Broadcom ECD) 2021-03-09 15:00:44 UTC

The stack trace that Ewan is talking about is the same problem. It is the same kernel as before. The kernel that I built had a different rev. 

Anyway I have not posted the patch, still trying to get my git lab config up.
Ewan how do I proceed?

Comment 30 Ewan D. Milne 2021-03-11 13:45:57 UTC

The BZ has to be cloned for RHEL8, and the problem fixed in RHEL8 before
it is fixed in RHEL7, to avoid a regression if an upgrade occurs to RHEL8.

When patch is accepted upstream, you need to submit a MR to 8.5 and 7.9.z.

If a fix is desired in an earlier zStream, then we need to request that
it be fixed e.g. in 7.7.z and 7.8.z, unless the system can upgrade to 7.9.z

Gregoire, please advise if the customer is able to upgrade to 7.9.z,
it could take a while for a zStream fix to be available once the patch
is accepted upstream.

Comment 31 ggrimaux 2021-03-11 14:44:51 UTC

Hi Ewan,

Client can't update to 7.9 Kernel.

We will need this to be backported to the 7.7 branch.

I added a hotfix flag on this BZ.

Thank you.

Comment 36 Ewan D. Milne 2021-05-18 18:43:38 UTC

Dick, what is the status of the fix for this?  It will need an MR for 7.9.z
after being fixed in RHEL8.

Comment 37 Gianluca Consalvi 2021-06-15 12:12:48 UTC

Hi,
Could we have the status of this backport to RHEL 7.7?
thanks
Gianluca

Comment 41 Rob Evers 2021-07-20 15:59:17 UTC

Dick,  Is the patch appropriate for backport to rhel7.7.z do you know?

Comment 48 Lin Li 2021-08-06 07:54:07 UTC

Hi Dick/Rob/Ewan,
I hit a firmware bug with kernel-3.10.0-1160.39.1.el7.
Could you check if it is related to your patches?
Thanks in advance！

beaker job: https://beaker.engineering.redhat.com/recipes/10442345#task130056041
console log: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/08/56724/5672435/10442345/console.log



[19074.943674] WARNING: CPU: 0 PID: 44837 at drivers/base/firmware_class.c:1035 _request_firmware.isra.9+0x686/0x6d0 
[19074.997999] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel dm_service_time lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif sp5100_tco joydev pcspkr sg i2c_piix4 k10temp fam15h_power hpwdt hpilo ipmi_si ipmi_devintf ipmi_msghandler dm_multipath acpi_power_meter ip_tables xfs libcrc32c sd_mod radeon lpfc i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm nvmet_fc nvmet ahci crc_t10dif crct10dif_generic ata_generic nvme_fc pata_acpi nvme_fabrics drm nvme_core libahci pata_atiixp scsi_transport_fc be2net libata crct10dif_pclmul crc32c_intel hpsa scsi_tgt serio_raw crct10dif_common netxen_nic drm_panel_orientation_quirks scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod 
[19075.387473] CPU: 0 PID: 44837 Comm: kworker/0:1 Kdump: loaded Not tainted 3.10.0-1160.39.1.el7.x86_64 #1 
[19075.432443] Hardware name: HP ProLiant DL585 G7, BIOS A16 06/04/2013 
[19075.465445] Workqueue: events netxen_fwinit_work [netxen_nic] 
[19075.495491] Call Trace: 
[19075.508598]  [<ffffffffb8783539>] dump_stack+0x19/0x1b 
[19075.536225]  [<ffffffffb809b278>] __warn+0xd8/0x100 
[19075.562766]  [<ffffffffb809b3bd>] warn_slowpath_null+0x1d/0x20 
[19075.593099]  [<ffffffffb84cdb96>] _request_firmware.isra.9+0x686/0x6d0 
[19075.624609]  [<ffffffffb80e8ea3>] ? load_balance+0x1a3/0xa10 
[19075.651368]  [<ffffffffb84cdc0e>] request_firmware+0x2e/0x40 
[19075.680176]  [<ffffffffc02b3d4c>] netxen_request_firmware+0xec/0x640 [netxen_nic] 
[19075.715214]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19075.753809]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19075.790748]  [<ffffffffc02ae829>] netxen_start_firmware+0x1e9/0xbd0 [netxen_nic] 
[19075.826922]  [<ffffffffb8035c19>] ? sched_clock+0x9/0x10 
[19075.851777]  [<ffffffffb80de305>] ? sched_clock_cpu+0x85/0xc0 
[19075.880966]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19075.917192]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19075.955616]  [<ffffffffc02af2f6>] netxen_fwinit_work+0xe6/0x230 [netxen_nic] 
[19075.989153]  [<ffffffffb80bde8f>] process_one_work+0x17f/0x440 
[19076.020666]  [<ffffffffb80befa6>] worker_thread+0x126/0x3c0 
[19076.050745]  [<ffffffffb80bee80>] ? manage_workers.isra.26+0x2a0/0x2a0 
[19076.084552]  [<ffffffffb80c5e61>] kthread+0xd1/0xe0 
[19076.109377]  [<ffffffffb80c5d90>] ? insert_kthread_work+0x40/0x40 
[19076.140165]  [<ffffffffb8795de4>] ret_from_fork_nospec_begin+0xe/0x21 
[19076.174197]  [<ffffffffb80c5d90>] ? insert_kthread_work+0x40/0x40 
[19076.203364] ---[ end trace a779021c7b160b5c ]--- 
[19076.226207] netxen_nic 0000:04:00.0: firmware: phanfw.bin will not be loaded 
[19076.261036] ------------[ cut here ]------------ 
[19076.282669] WARNING: CPU: 0 PID: 44837 at drivers/base/firmware_class.c:1035 _request_firmware.isra.9+0x686/0x6d0 
[19076.333154] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel dm_service_time lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif sp5100_tco joydev pcspkr sg i2c_piix4 k10temp fam15h_power hpwdt hpilo ipmi_si ipmi_devintf ipmi_msghandler dm_multipath acpi_power_meter ip_tables xfs libcrc32c sd_mod radeon lpfc i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm nvmet_fc nvmet ahci crc_t10dif crct10dif_generic ata_generic nvme_fc pata_acpi nvme_fabrics drm nvme_core libahci pata_atiixp scsi_transport_fc be2net libata crct10dif_pclmul crc32c_intel hpsa scsi_tgt serio_raw crct10dif_common netxen_nic drm_panel_orientation_quirks scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod 
[19076.717631] CPU: 0 PID: 44837 Comm: kworker/0:1 Kdump: loaded Tainted: G        W      ------------   3.10.0-1160.39.1.el7.x86_64 #1 
[19076.774951] Hardware name: HP ProLiant DL585 G7, BIOS A16 06/04/2013 
[19076.806867] Workqueue: events netxen_fwinit_work [netxen_nic] 
[19076.833392] Call Trace: 
[19076.845041]  [<ffffffffb8783539>] dump_stack+0x19/0x1b 
[19076.871202]  [<ffffffffb809b278>] __warn+0xd8/0x100 
[19076.894064]  [<ffffffffb809b3bd>] warn_slowpath_null+0x1d/0x20 
[19076.922778]  [<ffffffffb84cdb96>] _request_firmware.isra.9+0x686/0x6d0 
[19076.954232]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19076.992218]  [<ffffffffb84cdc0e>] request_firmware+0x2e/0x40 
[19077.020032]  [<ffffffffc02b3d4c>] netxen_request_firmware+0xec/0x640 [netxen_nic] 
[19077.058137]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19077.098123]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19077.140465]  [<ffffffffc02ae829>] netxen_start_firmware+0x1e9/0xbd0 [netxen_nic] 
[19077.179219]  [<ffffffffb8035c19>] ? sched_clock+0x9/0x10 
[19077.207699]  [<ffffffffb80de305>] ? sched_clock_cpu+0x85/0xc0 
[19077.243718]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19077.282066]  [<ffffffffc02aab7a>] ? netxen_nic_hw_read_wx_2M+0x3a/0x100 [netxen_nic] 
[19077.318868]  [<ffffffffc02af2f6>] netxen_fwinit_work+0xe6/0x230 [netxen_nic] 
[19077.354382]  [<ffffffffb80bde8f>] process_one_work+0x17f/0x440 
[19077.383260]  [<ffffffffb80befa6>] worker_thread+0x126/0x3c0 
[19077.411161]  [<ffffffffb80bee80>] ? manage_workers.isra.26+0x2a0/0x2a0 
[19077.443724]  [<ffffffffb80c5e61>] kthread+0xd1/0xe0 
[19077.466607]  [<ffffffffb80c5d90>] ? insert_kthread_work+0x40/0x40 
[19077.497795]  [<ffffffffb8795de4>] ret_from_fork_nospec_begin+0xe/0x21 
[19077.527818]  [<ffffffffb80c5d90>] ? insert_kthread_work+0x40/0x40 
[19077.557837] ---[ end trace a779021c7b160b5d ]--- 
[19077.580682] netxen_nic 0000:04:00.0: firmware: nx3fwct.bin will not be loaded 
[19087.293997] netxen_nic: failed card response code:0x10 
[19087.321820] netxen_nic 0000:04:00.0: Failed to setup minidump rcode = -5 
[19087.738398] Restarting system.

Comment 49 Ewan D. Milne 2021-08-09 16:24:46 UTC

That looks like a longstanding problem, see e.g. bug 1425130 which was CLOSED WONTFIX.

Search BZ for RHEL7 kernel "Comment contains the string 'drivers/base/firmware_class.c'

In any case the netxen nic driver does not have anything to do with the lpfc driver.
So it does not appear to be related.

Comment 50 Lin Li 2021-08-09 23:07:29 UTC

Move to verified according to comment 48 and comment 49.

Comment 55 errata-xmlrpc 2021-08-31 09:09:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3327