Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1975441

Summary:	lpfc: NULL pointer dereference from lpfc_scsi_unprep_dma_buf
Product:	Red Hat Enterprise Linux 8	Reporter:	Robert Peterson <rpeterso>
Component:	kernel	Assignee:	Dick Kennedy (Broadcom ECD) <dkennedy>
kernel sub component:	Storage Drivers	QA Contact:	Storage QE <storage-qe>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	emilne, nstraz
Version:	8.4	Flags:	pm-rhel: mirror+
Target Milestone:	beta
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-15 12:50:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert Peterson 2021-06-23 16:44:37 UTC

Description of problem:
My lab box is intermittently crashing with kernel NULL pointer dereference.
The box is mainly used to host rhel8.4+ kvm guests.
It's available in the bos lab: fs-i40c-15.fs.lab.eng.bos.redhat.com
I have 4 vmcores now you can certainly log in and look at.
Find me on irc (nick bob) for log on credentials.

Version-Release number of selected component (if applicable):
4.18.0-311.el8.kpq1.x86_64

How reproducible:
Unknown - happens about once a week

Steps to Reproduce:
1. All I was doing was the "sync" command after from a kvm guest hosted by said box that crashed, after I rsynced a bunch of dirty data to it.

Actual results:
[85956.815390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[85956.823223] PGD 0 P4D 0 
[85956.825761] Oops: 0000 [#1] SMP NOPTI
[85956.829436] CPU: 32 PID: 0 Comm: swapper/32 Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-311.el8.kpq1.x86_64 #1
[85956.841075] Hardware name: Dell Inc. PowerEdge R740xd/0DY2X0, BIOS 2.10.0 11/12/2020
[85956.848855] RIP: 0010:dma_direct_unmap_sg+0x87/0x100
[85956.853818] Code: 08 e8 fd 0f 00 00 48 8b 44 24 08 41 83 c7 01 48 89 c7 e8 9c d2 31 00 45 39 fc 74 6f 48 8b 9d 90 02 00 00 48 8b 15 89 c7 4d 01 <44> 8b 58 18 48 f7 d2 48 c1 e3 0c 48 03 58 10 48 21 d3 48 8b 15 e8
[85956.872579] RSP: 0018:ffffb4324d208b20 EFLAGS: 00010246
[85956.877804] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[85956.884939] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9c6948fea0b0
[85956.892105] RBP: ffff9c6948fea0b0 R08: 0000000000000000 R09: 0000000000029780
[85956.899238] R10: 000c54c3ef17a068 R11: ffff9c69e0e505d1 R12: 000000000000000d
[85956.906406] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[85956.913573] FS:  0000000000000000(0000) GS:ffff9c97c0c00000(0000) knlGS:0000000000000000
[85956.921666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[85956.927423] CR2: 0000000000000018 CR3: 000000107f810002 CR4: 00000000007726e0
[85956.934577] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[85956.941705] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[85956.948870] PKRU: 55555554
[85956.951575] Call Trace:
[85956.954038]  <IRQ>
[85956.956072]  lpfc_scsi_unprep_dma_buf+0x69/0x70 [lpfc]
[85956.961212]  lpfc_fcp_io_cmd_wqe_cmpl+0x1b6/0x1460 [lpfc]
[85956.966618]  lpfc_sli4_fp_handle_fcp_wcqe.isra.26+0x129/0x3a0 [lpfc]
[85956.972970]  ? lpfc_sli4_fp_handle_fcp_wcqe.isra.26+0x129/0x3a0 [lpfc]
[85956.979514]  ? enqueue_task_fair+0x93/0x700
[85956.983698]  ? check_preempt_curr+0x7a/0x90
[85956.987900]  ? ttwu_do_wakeup+0x19/0x140
[85956.991834]  ? try_to_wake_up+0x1cd/0x550
[85956.995849]  ? lpfc_sli4_process_eq+0x50/0x4b0 [lpfc]
[85957.000909]  ? update_load_avg+0x7e/0x630
[85957.004925]  ? lpfc_sli4_fp_handle_cqe+0x19c/0x4c0 [lpfc]
[85957.010324]  lpfc_sli4_fp_handle_cqe+0x19c/0x4c0 [lpfc]
[85957.015550]  __lpfc_sli4_process_cq+0x105/0x250 [lpfc]
[85957.020690]  ? lpfc_sli4_fp_handle_fcp_wcqe.isra.26+0x3a0/0x3a0 [lpfc]
[85957.027214]  __lpfc_sli4_hba_process_cq+0x3c/0x110 [lpfc]
[85957.032613]  lpfc_cq_poll_hdler+0x16/0x20 [lpfc]
[85957.037234]  irq_poll_softirq+0x76/0x110
[85957.041160]  __do_softirq+0xd7/0x2d6
[85957.044740]  irq_exit+0xf7/0x100
[85957.047971]  do_IRQ+0x7f/0xd0
[85957.050941]  common_interrupt+0xf/0xf
[85957.054609]  </IRQ>

Expected results:
No crash

Additional info:
[root@fs-i40c-15 ~]# lspci | grep -i fib
18:00.0 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)
18:00.1 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)

Comment 1 Robert Peterson 2021-06-23 16:52:05 UTC

Note that the crashes are roughly one week apart:

[root@fs-i40c-15 /var/crash]# ls -l
total 0
drwxr-xr-x. 2 root root 67 May 27 12:37 127.0.0.1-2021-05-27-12:37:11
drwxr-xr-x. 2 root root 67 Jun  8 16:00 127.0.0.1-2021-06-08-16:00:15
drwxr-xr-x. 2 root root 67 Jun 15 10:49 127.0.0.1-2021-06-15-10:49:22
drwxr-xr-x. 2 root root 67 Jun 23 12:08 127.0.0.1-2021-06-23-12:07:47

[   14.404460] mlx5_core 0000:3b:00.1: firmware version: 16.27.6120
[   14.410509] mlx5_core 0000:3b:00.1: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:3a:00.0 (capabl
e of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[   14.463254] lpfc 0000:18:00.0: 0:3176 Port Name 0 Physical Link is functional
[   14.662266] lpfc 0000:18:00.1: 1:2574 IO channels: hdwQ 40 IRQ 40 MRQ: 0
[   14.682458] scsi host16: Emulex LPe32000 16Gb PCIe Fibre Channel Adapter on PCI bus 18 device 01 irq 775 PCI resettable
[   14.701239] mlx5_core 0000:3b:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[   14.710199] mlx5_core 0000:3b:00.1: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[   14.750578] mlx5_core 0000:3b:00.1: Port module event: module 1, Cable plugged
[   14.758086] mlx5_core 0000:3b:00.1: mlx5_pcie_event:296:(pid 1268): PCIe slot advertised sufficient power (75W).
[   14.775491] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   14.950356] lpfc 0000:18:00.1: 1:6448 Dual Dump is enabled
[   14.978969] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[   14.999010] mlx5_core 0000:3b:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   15.207090] mlx5_core 0000:3b:00.1: Supported tc offload range - chains: 4294967294, prios: 4294967295
[   15.225678] mlx5_core 0000:3b:00.1 ens1f1: renamed from eth1
[   15.237503] mlx5_core 0000:3b:00.0 ens1f0: renamed from eth0

Comment 4 Dick Kennedy (Broadcom ECD) 2021-09-14 20:48:20 UTC

Are the vm's doing pci-pass through for the fc-ports?
Is this kernel something that redhat ships to customers or is it only in the lab?  4.18.0-311.el8.kpq1.x86_64

Did it dump? if it did cab you attach the vmcore-dmesg.txt to the bz?
Id it did kdump then were you saving the console log output? if so attach it to the bz.

Comment 5 Robert Peterson 2021-09-15 12:15:23 UTC

My kvm guests were simply using the host lpfc devices as SCSI devices.

I updated my host to a newer kernel and have not seen the problem since.

My guests are now typically using the same devices, but as virtio, not SCSI.

I also had eng-ops swap the fibre cables to the EMC storage array.

I have noticed that the performance of lpfc seems to similarly glitch and
pause for long periods of time, but it no longer times out and gives me lpfc
errors from the kernel. Perhaps the lpfc timeouts were increased or error
paths improved?

I no longer have any vmcore dmesg files on that system.
I'll keep using the same hardware and watch for the problem, but I'm not sure
we can do anything more on this problem.