Bug 1975441 - lpfc: NULL pointer dereference from lpfc_scsi_unprep_dma_buf
Summary: lpfc: NULL pointer dereference from lpfc_scsi_unprep_dma_buf
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel
Version: 8.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: beta
: ---
Assignee: Dick Kennedy (Broadcom ECD)
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-23 16:44 UTC by Robert Peterson
Modified: 2021-09-15 12:50 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 12:50:10 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Robert Peterson 2021-06-23 16:44:37 UTC
Description of problem:
My lab box is intermittently crashing with kernel NULL pointer dereference.
The box is mainly used to host rhel8.4+ kvm guests.
It's available in the bos lab: fs-i40c-15.fs.lab.eng.bos.redhat.com
I have 4 vmcores now you can certainly log in and look at.
Find me on irc (nick bob) for log on credentials.

Version-Release number of selected component (if applicable):
4.18.0-311.el8.kpq1.x86_64

How reproducible:
Unknown - happens about once a week

Steps to Reproduce:
1. All I was doing was the "sync" command after from a kvm guest hosted by said box that crashed, after I rsynced a bunch of dirty data to it.

Actual results:
[85956.815390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[85956.823223] PGD 0 P4D 0 
[85956.825761] Oops: 0000 [#1] SMP NOPTI
[85956.829436] CPU: 32 PID: 0 Comm: swapper/32 Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-311.el8.kpq1.x86_64 #1
[85956.841075] Hardware name: Dell Inc. PowerEdge R740xd/0DY2X0, BIOS 2.10.0 11/12/2020
[85956.848855] RIP: 0010:dma_direct_unmap_sg+0x87/0x100
[85956.853818] Code: 08 e8 fd 0f 00 00 48 8b 44 24 08 41 83 c7 01 48 89 c7 e8 9c d2 31 00 45 39 fc 74 6f 48 8b 9d 90 02 00 00 48 8b 15 89 c7 4d 01 <44> 8b 58 18 48 f7 d2 48 c1 e3 0c 48 03 58 10 48 21 d3 48 8b 15 e8
[85956.872579] RSP: 0018:ffffb4324d208b20 EFLAGS: 00010246
[85956.877804] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[85956.884939] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9c6948fea0b0
[85956.892105] RBP: ffff9c6948fea0b0 R08: 0000000000000000 R09: 0000000000029780
[85956.899238] R10: 000c54c3ef17a068 R11: ffff9c69e0e505d1 R12: 000000000000000d
[85956.906406] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[85956.913573] FS:  0000000000000000(0000) GS:ffff9c97c0c00000(0000) knlGS:0000000000000000
[85956.921666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[85956.927423] CR2: 0000000000000018 CR3: 000000107f810002 CR4: 00000000007726e0
[85956.934577] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[85956.941705] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[85956.948870] PKRU: 55555554
[85956.951575] Call Trace:
[85956.954038]  <IRQ>
[85956.956072]  lpfc_scsi_unprep_dma_buf+0x69/0x70 [lpfc]
[85956.961212]  lpfc_fcp_io_cmd_wqe_cmpl+0x1b6/0x1460 [lpfc]
[85956.966618]  lpfc_sli4_fp_handle_fcp_wcqe.isra.26+0x129/0x3a0 [lpfc]
[85956.972970]  ? lpfc_sli4_fp_handle_fcp_wcqe.isra.26+0x129/0x3a0 [lpfc]
[85956.979514]  ? enqueue_task_fair+0x93/0x700
[85956.983698]  ? check_preempt_curr+0x7a/0x90
[85956.987900]  ? ttwu_do_wakeup+0x19/0x140
[85956.991834]  ? try_to_wake_up+0x1cd/0x550
[85956.995849]  ? lpfc_sli4_process_eq+0x50/0x4b0 [lpfc]
[85957.000909]  ? update_load_avg+0x7e/0x630
[85957.004925]  ? lpfc_sli4_fp_handle_cqe+0x19c/0x4c0 [lpfc]
[85957.010324]  lpfc_sli4_fp_handle_cqe+0x19c/0x4c0 [lpfc]
[85957.015550]  __lpfc_sli4_process_cq+0x105/0x250 [lpfc]
[85957.020690]  ? lpfc_sli4_fp_handle_fcp_wcqe.isra.26+0x3a0/0x3a0 [lpfc]
[85957.027214]  __lpfc_sli4_hba_process_cq+0x3c/0x110 [lpfc]
[85957.032613]  lpfc_cq_poll_hdler+0x16/0x20 [lpfc]
[85957.037234]  irq_poll_softirq+0x76/0x110
[85957.041160]  __do_softirq+0xd7/0x2d6
[85957.044740]  irq_exit+0xf7/0x100
[85957.047971]  do_IRQ+0x7f/0xd0
[85957.050941]  common_interrupt+0xf/0xf
[85957.054609]  </IRQ>

Expected results:
No crash

Additional info:
[root@fs-i40c-15 ~]# lspci | grep -i fib
18:00.0 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)
18:00.1 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)

Comment 1 Robert Peterson 2021-06-23 16:52:05 UTC
Note that the crashes are roughly one week apart:

[root@fs-i40c-15 /var/crash]# ls -l
total 0
drwxr-xr-x. 2 root root 67 May 27 12:37 127.0.0.1-2021-05-27-12:37:11
drwxr-xr-x. 2 root root 67 Jun  8 16:00 127.0.0.1-2021-06-08-16:00:15
drwxr-xr-x. 2 root root 67 Jun 15 10:49 127.0.0.1-2021-06-15-10:49:22
drwxr-xr-x. 2 root root 67 Jun 23 12:08 127.0.0.1-2021-06-23-12:07:47

[   14.404460] mlx5_core 0000:3b:00.1: firmware version: 16.27.6120
[   14.410509] mlx5_core 0000:3b:00.1: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:3a:00.0 (capabl
e of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[   14.463254] lpfc 0000:18:00.0: 0:3176 Port Name 0 Physical Link is functional
[   14.662266] lpfc 0000:18:00.1: 1:2574 IO channels: hdwQ 40 IRQ 40 MRQ: 0
[   14.682458] scsi host16: Emulex LPe32000 16Gb PCIe Fibre Channel Adapter on PCI bus 18 device 01 irq 775 PCI resettable
[   14.701239] mlx5_core 0000:3b:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[   14.710199] mlx5_core 0000:3b:00.1: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)
[   14.750578] mlx5_core 0000:3b:00.1: Port module event: module 1, Cable plugged
[   14.758086] mlx5_core 0000:3b:00.1: mlx5_pcie_event:296:(pid 1268): PCIe slot advertised sufficient power (75W).
[   14.775491] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   14.950356] lpfc 0000:18:00.1: 1:6448 Dual Dump is enabled
[   14.978969] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
[   14.999010] mlx5_core 0000:3b:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[   15.207090] mlx5_core 0000:3b:00.1: Supported tc offload range - chains: 4294967294, prios: 4294967295
[   15.225678] mlx5_core 0000:3b:00.1 ens1f1: renamed from eth1
[   15.237503] mlx5_core 0000:3b:00.0 ens1f0: renamed from eth0

Comment 4 Dick Kennedy (Broadcom ECD) 2021-09-14 20:48:20 UTC
Are the vm's doing pci-pass through for the fc-ports?
Is this kernel something that redhat ships to customers or is it only in the lab?  4.18.0-311.el8.kpq1.x86_64

Did it dump? if it did cab you attach the vmcore-dmesg.txt to the bz?
Id it did kdump then were you saving the console log output? if so attach it to the bz.

Comment 5 Robert Peterson 2021-09-15 12:15:23 UTC
My kvm guests were simply using the host lpfc devices as SCSI devices.

I updated my host to a newer kernel and have not seen the problem since.

My guests are now typically using the same devices, but as virtio, not SCSI.

I also had eng-ops swap the fibre cables to the EMC storage array.

I have noticed that the performance of lpfc seems to similarly glitch and
pause for long periods of time, but it no longer times out and gives me lpfc
errors from the kernel. Perhaps the lpfc timeouts were increased or error
paths improved?

I no longer have any vmcore dmesg files on that system.
I'll keep using the same hardware and watch for the problem, but I'm not sure
we can do anything more on this problem.


Note You need to log in before you can comment on or make changes to this bug.