Bug 1613543 - RHEL 7.6 Alpha - Doesn't see any volumes during installation process
Summary: RHEL 7.6 Alpha - Doesn't see any volumes during installation process
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.6
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Himanshu Madhani (Marvell)
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks: 1668824
TreeView+ depends on / blocked
 
Reported: 2018-08-07 19:50 UTC by jennifer.duong
Modified: 2022-03-13 15:21 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1668824 (view as bug list)
Environment:
Last Closed: 2019-04-05 17:31:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
anaconda logs (843 bytes, text/plain)
2018-08-09 20:57 UTC, jennifer.duong
no flags Details
program.log (1.60 KB, text/plain)
2018-08-24 14:10 UTC, jennifer.duong
no flags Details
storage.log (255 bytes, text/plain)
2018-08-24 14:11 UTC, jennifer.duong
no flags Details
syslog (13.24 MB, text/plain)
2018-08-24 14:11 UTC, jennifer.duong
no flags Details
Server crash (113.28 KB, image/jpeg)
2018-09-21 21:01 UTC, jennifer.duong
no flags Details
emergency mode (131.95 KB, image/png)
2018-11-21 17:08 UTC, jennifer.duong
no flags Details
serial console kernel-3.10.0-957.el7.bz1613543.n2n.x86_64.rpm (169.47 KB, text/plain)
2018-12-17 23:27 UTC, jennifer.duong
no flags Details
Shell Executable for UEFI shell (249.36 KB, application/zip)
2018-12-18 22:58 UTC, Himanshu Madhani (Marvell)
no flags Details
map (22.68 KB, image/png)
2019-01-02 22:09 UTC, jennifer.duong
no flags Details
FC trace (20.84 KB, application/zip)
2019-01-18 16:49 UTC, jennifer.duong
no flags Details
overview of storage when all devices are seen (9.07 KB, text/plain)
2019-02-25 21:10 UTC, Dwight (Bud) Brown
no flags Details
overview of 7.6 with inbox driver, no disk luns seen (7.09 KB, text/plain)
2019-02-25 21:13 UTC, Dwight (Bud) Brown
no flags Details
messages from 957-5.1, all devices present afterwards (1.29 MB, text/plain)
2019-02-28 13:42 UTC, Dwight (Bud) Brown
no flags Details
messages 957-5.1 w/lip & debug after devices disappeared (3.85 MB, text/plain)
2019-02-28 13:44 UTC, Dwight (Bud) Brown
no flags Details
messages - 957.5.1.el7.QLAV10_7.1.2 (302.80 KB, application/gzip)
2019-03-06 16:20 UTC, Robert Palco
no flags Details
ICTM1608S02H4-3-29-19 7.7 Nightly build (192.96 KB, text/plain)
2019-03-29 19:39 UTC, jennifer.duong
no flags Details

Description jennifer.duong 2018-08-07 19:50:56 UTC
Description of problem:
When I try to install RHEL 7.6 Alpha on my server, it either gets stuck during the discovering multipath devices step or it can't seem to find any of my volumes.

Version-Release number of selected component (if applicable):
RHEL 7.6 Alpha


Steps to Reproduce:
1. Try to install RHEL 7.6 Alpha to SANboot LUN

Actual results:
Fails to install because it can't find any volumes

Expected results:
Can find volumes and installs to SANboot LUN

Additional info:
Issue seems to be only occurring on servers that have QLE2692s and QLE26742s. I am running with FW version 8.08.03 and driver version 9.00.00.00.40.0-k1.

Comment 2 Ben Marzinski 2018-08-09 14:59:05 UTC
Are you able to get any of the anaconda logs?

Comment 3 jennifer.duong 2018-08-09 20:57:54 UTC
Created attachment 1474816 [details]
anaconda logs

Comment 4 jennifer.duong 2018-08-14 16:24:10 UTC
Are there any updates on this?

Thanks,

Jennifer Duong

Comment 5 Ben Marzinski 2018-08-23 22:47:22 UTC
The log attached doesn't really provide any information. I was looking specifically for the anancoda storage.log program.log and syslog output.  You might have to set up remote logging during install to get these since your installation is failing.

Comment 6 Ben Marzinski 2018-08-23 22:49:13 UTC
If you can provide the anaconda-tb-* log file. that should also be enough, since it contains all the other logs.

Comment 7 jennifer.duong 2018-08-24 14:10:50 UTC
Created attachment 1478545 [details]
program.log

Comment 8 jennifer.duong 2018-08-24 14:11:06 UTC
Created attachment 1478546 [details]
storage.log

Comment 9 jennifer.duong 2018-08-24 14:11:35 UTC
Created attachment 1478547 [details]
syslog

Comment 10 jennifer.duong 2018-08-24 14:13:18 UTC
Ben,

I was able to find storage.log, program.log, and the syslog output in the /tmp directory and have attached those. However, I could not find an anaconda-tb* log file in the /tmp directory.

Thanks,

Jennifer

Comment 11 jennifer.duong 2018-08-27 20:15:18 UTC
Ben,

I went ahead and tried installing RHEL 7.6 Beta in hopes that this might've been fixed in a different bug, but it looks like I'm still getting stuck during the discovering multipath devices step or it can't seem to find any of my volumes. The issue still seems to be only occurring on servers that have QLE2692s and QLE26742s

Thanks,

Jennifer

Comment 12 Ben Marzinski 2018-08-28 21:07:43 UTC
I assume that your volumes aren't visible because they are not working and were removed from the system.

Looking at the syslog output.

17:12:05,262 INFO kernel:scsi 12:0:0:0: Device offlined - not ready after error recovery
17:12:06,789 INFO kernel:scsi 13:0:0:0: Device offlined - not ready after error recovery
17:12:08,772 INFO kernel:scsi 14:0:0:0: Device offlined - not ready after error recovery
17:12:10,764 INFO kernel:scsi 15:0:0:0: Device offlined - not ready after error recovery
17:12:15,611 INFO systemd:Started LVM2 metadata daemon.
17:12:19,808 ERR kernel: rport-12:0-1: blocked FC remote port time out: removing target and saving binding
17:12:19,808 ERR kernel: rport-12:0-0: blocked FC remote port time out: removing target and saving binding
17:12:21,856 ERR kernel: rport-13:0-1: blocked FC remote port time out: removing target and saving binding
17:12:23,904 ERR kernel: rport-14:0-0: blocked FC remote port time out: removing target and saving binding
17:12:23,904 ERR kernel: rport-13:0-0: blocked FC remote port time out: removing target and saving binding
17:12:23,904 ERR kernel: rport-14:0-1: blocked FC remote port time out: removing target and saving binding
17:12:25,824 ERR kernel: rport-15:0-1: blocked FC remote port time out: removing target and saving binding
17:12:27,872 ERR kernel: rport-15:0-0: blocked FC remote port time out: removing target and saving binding

multipathd is running, but the only device it sees is sda, which isn't a SAN device

17:11:21,990 NOTICE kernel:sd 1:0:0:1: [sda] Attached SCSI removable disk

It looks like a USB device.

I'm reassigning this to a storage driver bug, since the syslog messages make it look like this is happening before multipath even has any chance to do anything with the storage.

Comment 13 Ewan D. Milne 2018-08-29 13:25:34 UTC
Looks like a timeout followed by an abort when probing on all 4 qla2xxx HBAs,
followed by SCSI EH escalations culminating in an adapter resets.  It would
seem that the HBAs could login to target 0 at least but couldn't access it.


17:11:21,219 WARNING kernel:qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.00.00.06.07.6-k.
17:11:21,219 WARNING kernel:qla2xxx [0000:04:00.0]-011c: : MSI-X vector count: 16.
17:11:21,219 WARNING kernel:qla2xxx [0000:04:00.0]-001d: : Found an ISP2261 irq 43 iobase 0xffff9bdd83ae6000.
17:11:21,221 DEBUG kernel:qla2xxx 0000:04:00.0: irq 44 for MSI/MSI-X
17:11:21,221 DEBUG kernel:qla2xxx 0000:04:00.0: irq 45 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 46 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 47 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 48 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 49 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 50 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 51 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 52 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 53 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 54 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 55 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 56 for MSI/MSI-X
17:11:21,222 DEBUG kernel:qla2xxx 0000:04:00.0: irq 57 for MSI/MSI-X
17:11:21,222 ERR kernel:qla2xxx [0000:04:00.0]-00c6:12: MSI-X: Failed to enable support with 16 vectors, using 14 vectors
17:11:21,407 WARNING kernel:qla2xxx [0000:04:00.0]-0075:12: ZIO mode 6 enabled; timer delay (200 us).
17:11:23,057 WARNING kernel:qla2xxx [0000:04:00.0]-d302:12: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
17:11:23,211 INFO kernel:scsi host12: qla2xxx
17:11:23,211 WARNING kernel:qla2xxx [0000:04:00.0]-00fb:12: QLogic QLE2742 - QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter.
17:11:23,211 WARNING kernel:qla2xxx [0000:04:00.0]-00fc:12: ISP2261: PCIe (8.0GT/s x8) @ 0000:04:00.0 hdma+ host#=12 fw=8.08.03 (d0d5).
17:11:23,217 WARNING kernel:qla2xxx [0000:04:00.1]-011c: : MSI-X vector count: 16.
17:11:23,217 WARNING kernel:qla2xxx [0000:04:00.1]-001d: : Found an ISP2261 irq 63 iobase 0xffff9bdd83b3e000.
17:11:23,218 DEBUG kernel:qla2xxx 0000:04:00.1: irq 64 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 65 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 66 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 67 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 68 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 69 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 70 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 71 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 74 for MSI/MSI-X
17:11:23,219 DEBUG kernel:qla2xxx 0000:04:00.1: irq 75 for MSI/MSI-X
17:11:23,220 DEBUG kernel:qla2xxx 0000:04:00.1: irq 76 for MSI/MSI-X
17:11:23,220 DEBUG kernel:qla2xxx 0000:04:00.1: irq 77 for MSI/MSI-X
17:11:23,220 DEBUG kernel:qla2xxx 0000:04:00.1: irq 78 for MSI/MSI-X
17:11:23,220 DEBUG kernel:qla2xxx 0000:04:00.1: irq 79 for MSI/MSI-X
17:11:23,220 ERR kernel:qla2xxx [0000:04:00.1]-00c6:13: MSI-X: Failed to enable support with 16 vectors, using 14 vectors
17:11:23,268 WARNING kernel:qla2xxx [0000:04:00.1]-0075:13: ZIO mode 6 enabled; timer delay (200 us).
17:11:24,213 WARNING kernel:qla2xxx [0000:04:00.0]-500a:12: LOOP UP detected (16 Gbps).
17:11:24,719 WARNING kernel:qla2xxx [0000:04:00.0]-ffff:12: register_localport: host-traddr=nn-0x20000024ff7ef9f4:pn-0x21000024ff7ef9f4 on portID:3d1300
17:11:25,056 WARNING kernel:qla2xxx [0000:04:00.1]-d302:13: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
17:11:25,211 INFO kernel:scsi host13: qla2xxx
17:11:25,216 WARNING kernel:qla2xxx [0000:04:00.1]-00fb:13: QLogic QLE2742 - QLogic 32Gb 2-port FC to PCIe Gen3 x8 Adapter.
17:11:25,216 WARNING kernel:qla2xxx [0000:04:00.1]-00fc:13: ISP2261: PCIe (8.0GT/s x8) @ 0000:04:00.1 hdma+ host#=13 fw=8.08.03 (d0d5).
17:11:25,218 WARNING kernel:qla2xxx [0000:05:00.0]-011c: : MSI-X vector count: 16.
17:11:25,218 WARNING kernel:qla2xxx [0000:05:00.0]-001d: : Found an ISP2261 irq 80 iobase 0xffff9bdd83b72000.
17:11:25,221 DEBUG kernel:qla2xxx 0000:05:00.0: irq 81 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 82 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 83 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 84 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 85 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 86 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 87 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 88 for MSI/MSI-X
17:11:25,222 DEBUG kernel:qla2xxx 0000:05:00.0: irq 89 for MSI/MSI-X
17:11:25,223 DEBUG kernel:qla2xxx 0000:05:00.0: irq 90 for MSI/MSI-X
17:11:25,223 DEBUG kernel:qla2xxx 0000:05:00.0: irq 91 for MSI/MSI-X
17:11:25,223 DEBUG kernel:qla2xxx 0000:05:00.0: irq 92 for MSI/MSI-X
17:11:25,223 DEBUG kernel:qla2xxx 0000:05:00.0: irq 93 for MSI/MSI-X
17:11:25,223 DEBUG kernel:qla2xxx 0000:05:00.0: irq 94 for MSI/MSI-X
17:11:25,223 ERR kernel:qla2xxx [0000:05:00.0]-00c6:14: MSI-X: Failed to enable support with 16 vectors, using 14 vectors
17:11:25,271 WARNING kernel:qla2xxx [0000:05:00.0]-0075:14: ZIO mode 6 enabled; timer delay (200 us).
17:11:25,961 WARNING kernel:qla2xxx [0000:04:00.1]-500a:13: LOOP UP detected (32 Gbps).
17:11:26,423 WARNING kernel:qla2xxx [0000:04:00.1]-ffff:13: register_localport: host-traddr=nn-0x20000024ff7ef9f5:pn-0x21000024ff7ef9f5 on portID:10300
17:11:27,032 WARNING kernel:qla2xxx [0000:05:00.0]-d302:14: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
17:11:27,187 INFO kernel:scsi host14: qla2xxx
17:11:27,192 WARNING kernel:qla2xxx [0000:05:00.0]-00fb:14: QLogic QLE2692 - QLogic 16Gb 2-port FC to PCIe Gen3 x8 Adapter.
17:11:27,192 WARNING kernel:qla2xxx [0000:05:00.0]-00fc:14: ISP2261: PCIe (8.0GT/s x8) @ 0000:05:00.0 hdma+ host#=14 fw=8.08.03 (d0d5).
17:11:27,194 WARNING kernel:qla2xxx [0000:05:00.1]-011c: : MSI-X vector count: 16.
17:11:27,194 WARNING kernel:qla2xxx [0000:05:00.1]-001d: : Found an ISP2261 irq 38 iobase 0xffff9bdd83bb0000.
17:11:27,197 DEBUG kernel:qla2xxx 0000:05:00.1: irq 95 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 96 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 97 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 98 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 99 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 100 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 101 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 102 for MSI/MSI-X
17:11:27,198 DEBUG kernel:qla2xxx 0000:05:00.1: irq 103 for MSI/MSI-X
17:11:27,199 DEBUG kernel:qla2xxx 0000:05:00.1: irq 104 for MSI/MSI-X
17:11:27,199 DEBUG kernel:qla2xxx 0000:05:00.1: irq 105 for MSI/MSI-X
17:11:27,199 DEBUG kernel:qla2xxx 0000:05:00.1: irq 106 for MSI/MSI-X
17:11:27,199 DEBUG kernel:qla2xxx 0000:05:00.1: irq 107 for MSI/MSI-X
17:11:27,199 DEBUG kernel:qla2xxx 0000:05:00.1: irq 108 for MSI/MSI-X
17:11:27,199 ERR kernel:qla2xxx [0000:05:00.1]-00c6:15: MSI-X: Failed to enable support with 16 vectors, using 14 vectors
17:11:27,247 WARNING kernel:qla2xxx [0000:05:00.1]-0075:15: ZIO mode 6 enabled; timer delay (200 us).
17:11:27,817 WARNING kernel:qla2xxx [0000:05:00.0]-500a:14: LOOP UP detected (16 Gbps).
17:11:28,322 WARNING kernel:qla2xxx [0000:05:00.0]-ffff:14: register_localport: host-traddr=nn-0x20000024ff1bfd96:pn-0x21000024ff1bfd96 on portID:3d1200
17:11:29,022 WARNING kernel:qla2xxx [0000:05:00.1]-d302:15: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
17:11:29,178 INFO kernel:scsi host15: qla2xxx
17:11:29,183 WARNING kernel:qla2xxx [0000:05:00.1]-00fb:15: QLogic QLE2692 - QLogic 16Gb 2-port FC to PCIe Gen3 x8 Adapter.
17:11:29,183 WARNING kernel:qla2xxx [0000:05:00.1]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:05:00.1 hdma+ host#=15 fw=8.08.03 (d0d5).
17:11:29,960 WARNING kernel:qla2xxx [0000:05:00.1]-500a:15: LOOP UP detected (16 Gbps).
17:11:30,467 WARNING kernel:qla2xxx [0000:05:00.1]-ffff:15: register_localport: host-traddr=nn-0x20000024ff1bfd97:pn-0x21000024ff1bfd97 on portID:10200
17:11:45,770 WARNING kernel:qla2xxx [0000:04:00.0]-801c:12: Abort command issued nexus=12:0:0 --  0 2003.
17:11:45,770 WARNING kernel:qla2xxx [0000:04:00.0]-8009:12: DEVICE RESET ISSUED nexus=12:0:0 cmd=ffff8e18e692c380.
17:11:45,770 ERR kernel:qla2xxx [0000:04:00.0]-5039:12: Async-tmf error - hdl=2b completion status(28).
17:11:47,774 ERR kernel:qla2xxx [0000:04:00.0]-800d:12: wait for pending cmds failed for cmd=ffff8e18e692c380.
17:11:47,774 WARNING kernel:qla2xxx [0000:04:00.0]-800f:12: DEVICE RESET FAILED: Waiting for command completions nexus=12:0:0 cmd=ffff8e18e692c380.
17:11:47,774 WARNING kernel:qla2xxx [0000:04:00.0]-8009:12: TARGET RESET ISSUED nexus=12:0:0 cmd=ffff8e18e692c380.
17:11:47,774 ERR kernel:qla2xxx [0000:04:00.0]-5039:12: Async-tmf error - hdl=2c completion status(28).
17:11:47,820 WARNING kernel:qla2xxx [0000:04:00.1]-801c:13: Abort command issued nexus=13:0:0 --  0 2003.
17:11:47,820 WARNING kernel:qla2xxx [0000:04:00.1]-8009:13: DEVICE RESET ISSUED nexus=13:0:0 cmd=ffff8e1ce69b6840.
17:11:47,820 ERR kernel:qla2xxx [0000:04:00.1]-5039:13: Async-tmf error - hdl=22 completion status(28).
17:11:49,775 ERR kernel:qla2xxx [0000:04:00.0]-800d:12: wait for pending cmds failed for cmd=ffff8e18e692c380.
17:11:49,775 WARNING kernel:qla2xxx [0000:04:00.0]-800f:12: TARGET RESET FAILED: Waiting for command completions nexus=12:0:0 cmd=ffff8e18e692c380.
17:11:49,775 WARNING kernel:qla2xxx [0000:04:00.0]-8012:12: BUS RESET ISSUED nexus=12:0:0.
17:11:49,775 ERR kernel:qla2xxx [0000:04:00.0]-5039:12: Async-tmf error - hdl=2e completion status(28).
17:11:49,802 WARNING kernel:qla2xxx [0000:05:00.0]-801c:14: Abort command issued nexus=14:0:0 --  0 2003.
17:11:49,802 WARNING kernel:qla2xxx [0000:05:00.0]-8009:14: DEVICE RESET ISSUED nexus=14:0:0 cmd=ffff8e1ce69b7640.
17:11:49,802 ERR kernel:qla2xxx [0000:05:00.0]-5039:14: Async-tmf error - hdl=2f completion status(28).
17:11:49,822 ERR kernel:qla2xxx [0000:04:00.1]-800d:13: wait for pending cmds failed for cmd=ffff8e1ce69b6840.
17:11:49,822 WARNING kernel:qla2xxx [0000:04:00.1]-800f:13: DEVICE RESET FAILED: Waiting for command completions nexus=13:0:0 cmd=ffff8e1ce69b6840.
17:11:49,822 WARNING kernel:qla2xxx [0000:04:00.1]-8009:13: TARGET RESET ISSUED nexus=13:0:0 cmd=ffff8e1ce69b6840.
17:11:49,822 ERR kernel:qla2xxx [0000:04:00.1]-5039:13: Async-tmf error - hdl=23 completion status(28).
17:11:49,976 WARNING kernel:qla2xxx [0000:04:00.0]-500b:12: LOOP DOWN detected (4 7 0 0).
17:11:50,718 WARNING kernel:qla2xxx [0000:04:00.0]-500a:12: LOOP UP detected (16 Gbps).
17:11:51,778 ERR kernel:qla2xxx [0000:04:00.0]-8014:12: Wait for pending commands failed.
17:11:51,778 ERR kernel:qla2xxx [0000:04:00.0]-802b:12: BUS RESET FAILED nexus=12:0:0.
17:11:51,778 WARNING kernel:qla2xxx [0000:04:00.0]-8018:12: ADAPTER RESET ISSUED nexus=12:0:0.
17:11:51,778 WARNING kernel:qla2xxx [0000:04:00.0]-00af:12: Performing ISP error recovery - ha=ffff8e1979878000.
17:11:51,786 WARNING kernel:qla2xxx [0000:05:00.1]-801c:15: Abort command issued nexus=15:0:0 --  0 2003.
17:11:51,786 WARNING kernel:qla2xxx [0000:05:00.1]-8009:15: DEVICE RESET ISSUED nexus=15:0:0 cmd=ffff8e1ce682b2c0.
17:11:51,786 ERR kernel:qla2xxx [0000:05:00.1]-5039:15: Async-tmf error - hdl=22 completion status(28).
17:11:51,803 WARNING kernel:qla2xxx [0000:04:00.0]-0075:12: ZIO mode 6 enabled; timer delay (200 us).
17:11:51,803 ERR kernel:qla2xxx [0000:05:00.0]-800d:14: wait for pending cmds failed for cmd=ffff8e1ce69b7640.
17:11:51,803 WARNING kernel:qla2xxx [0000:05:00.0]-800f:14: DEVICE RESET FAILED: Waiting for command completions nexus=14:0:0 cmd=ffff8e1ce69b7640.
17:11:51,803 WARNING kernel:qla2xxx [0000:05:00.0]-8009:14: TARGET RESET ISSUED nexus=14:0:0 cmd=ffff8e1ce69b7640.
17:11:51,804 ERR kernel:qla2xxx [0000:05:00.0]-5039:14: Async-tmf error - hdl=30 completion status(28).
17:11:51,825 ERR kernel:qla2xxx [0000:04:00.1]-800d:13: wait for pending cmds failed for cmd=ffff8e1ce69b6840.
17:11:51,825 WARNING kernel:qla2xxx [0000:04:00.1]-800f:13: TARGET RESET FAILED: Waiting for command completions nexus=13:0:0 cmd=ffff8e1ce69b6840.

Comment 14 jennifer.duong 2018-09-06 19:53:39 UTC
Ewan,

So what exactly does this mean? Should we contact QLogic about this?

Thanks,

Jennifer Duong

Comment 15 Ewan D. Milne 2018-09-06 20:43:44 UTC
It looks like a problem communicating between the HBA and the target (array).
The host is not able to issue commands to probe for devices, it appears.

First thing:
  - If you install an earlier version of the software (e.g. 7.5 GA) on the same
    machine, does it connect to the array without any issues?

Comment 16 jennifer.duong 2018-09-06 21:59:01 UTC
Yes, it does

Comment 17 Ewan D. Milne 2018-09-07 14:18:46 UTC
OK, thanks, we're looking at it.

Comment 18 jennifer.duong 2018-09-14 03:34:21 UTC
Has anyone gotten a chance to look further into this?

Thanks,

Jennifer Duong

Comment 19 Ewan D. Milne 2018-09-14 17:34:57 UTC
Can you please try setting qla2xxx.ql2xnvmeenable = 0 and see if that makes
any difference?

Also can you please report what model storage array you are attempting to
connect to and the f/w revision level (i.e. version of ONTAP if it is
a NetApp array) ?

I am not seeing a problem on any of the systems I have here and there are
no failure reports of this nature from our QE testing either.  I suspect
the problem may be related to the NVMe changes in the firmware or the driver
but am not certain.

Works OK with QLE2562 and QLE2662 as far as I can see.

There is one bug fix to qla2xxx in -938.el7 but I suspect you would not be
hitting the underlying problem to cause the issue during the device probe.

    [scsi] qla2xxx: Fix memory leak for allocating abort IOCB

Himanshu, any other ideas?

Comment 20 Himanshu Madhani (Marvell) 2018-09-14 18:02:01 UTC
Hi Ewan, 

(In reply to Ewan D. Milne from comment #19)
> Can you please try setting qla2xxx.ql2xnvmeenable = 0 and see if that makes
> any difference?
> 
> Also can you please report what model storage array you are attempting to
> connect to and the f/w revision level (i.e. version of ONTAP if it is
> a NetApp array) ?
> 
> I am not seeing a problem on any of the systems I have here and there are
> no failure reports of this nature from our QE testing either.  I suspect
> the problem may be related to the NVMe changes in the firmware or the driver
> but am not certain.
> 
> Works OK with QLE2562 and QLE2662 as far as I can see.
> 
> There is one bug fix to qla2xxx in -938.el7 but I suspect you would not be
> hitting the underlying problem to cause the issue during the device probe.
> 
>     [scsi] qla2xxx: Fix memory leak for allocating abort IOCB
> 
> Himanshu, any other ideas?

The logs are not very helpful to indicate issue with NVMe code since we have not seen this with any of our env.

Can you capture logs with ql2xextended_error_logging=0x5200b000

Just to confirm, is you setup direct attached or switch fabric mode? 

Thanks,
Himanshu

Comment 21 jennifer.duong 2018-09-21 21:01:18 UTC
Ewan,

The install went further this time around when setting qla2xxx.ql2xnvmeenable=0. However, when booting up, one of my servers seems to crash and the other boots into emergency mode. I am attempting to connect to an E5600 that is running 8.40 FW and an E2800 that is running 11.50 FW.

Himanshu,

I am running a combination of both direct and fabric attached. The server that is crashing is fabric, while the server that is booting into emergency mode is direct. I will be attaching a screenshot of the crash shortly. As far as the server that is in emergency mode, there doesn't look to be any message logs or syslog output. During my installation I went ahead and set qla2xxx.ql2xextended_error_logging=0x5200b000, but I'm not entirely sure what logs you would want if there are no message logs or syslog output.

Thanks,

Jennifer

Comment 22 jennifer.duong 2018-09-21 21:01:42 UTC
Created attachment 1485818 [details]
Server crash

Comment 23 Ewan D. Milne 2018-09-24 16:04:44 UTC
Is it possible for you to attach more of the information from the crash, such
as the stack trace?  i.e. is the machine connected to a serial console where
you can capture the output?

Crash appears to be in kmem_cache_alloc() but the RIP address looks beyond _end.
Was a crash dump generated?  If so can you make it available?

Comment 24 jennifer.duong 2018-10-03 15:36:06 UTC
Ewan,

I apologize for taking so long to get back. I tried setting up a serial console a while back, but I wasn't able to get it to work.

Thanks,

Jennifer Duong

Comment 25 jennifer.duong 2018-10-10 15:18:10 UTC
Ewan,

Is there an alternate method of capturing the output?

Thanks,

Jennifer Duong

Comment 26 Himanshu Madhani (Marvell) 2018-10-10 17:50:36 UTC
Hi Jennifer, 

(In reply to jennifer.duong from comment #25)
> Ewan,
> 
> Is there an alternate method of capturing the output?
> 
> Thanks,
> 
> Jennifer Duong

Can you try with snap 5 and see if you are able to make progress.

Thanks,
Himanshu

Comment 27 jennifer.duong 2018-10-11 16:00:58 UTC
Himanshu,

I tried installing SS5 both with and without both of the qla2xxx parameters, but it looks like both of my installations (1 x fabric-connect, 1 x direct-connect) resulted in the servers booting into emergency mode.

Thanks,

Jennifer Duong

Comment 28 jennifer.duong 2018-10-15 18:37:02 UTC
Himanshu,

It looks like I got the same results for RHEL 7.6 RC (both direct-connect and fabric-connect booting into emergency mode). Are there any next steps on debugging this issue if I'm unable to successfully get the serial console to work?

Thanks,

Jennifer Duong

Comment 29 Himanshu Madhani (Marvell) 2018-10-15 19:28:30 UTC
Hi Jennifer,

Just so that I am clear on issue here

With RHEL7.6-RC build, using ql2xnvmeenable=0 as well, you are seeing issue with Direct-connect and Fabric connect mode to discover your SANBoot LUNs. 

You are not able to capture serial console logs for this behavior. 

Please let me know if the above is correct. I will discuss with extended team at Marvell and get back to you on next step. 

Thanks,
Himanshu

Comment 30 jennifer.duong 2018-10-16 15:57:32 UTC
Himanshu,

Yes, that is correct.

Thanks,

Jennifer Duong

Comment 31 jennifer.duong 2018-10-19 14:57:47 UTC
Himanshu,

Do you by chance have any updates on the next step?

Thanks,

Jennifer Duong

Comment 32 jennifer.duong 2018-10-26 16:36:33 UTC
Himanshu,

Are there any next steps that I should take?

Thanks,

Jennifer Duong

Comment 33 Himanshu Madhani (Marvell) 2018-11-01 06:20:18 UTC
Hi Jennifer, 

Sorry about long delay. I wanted to get some more informations/steps from you so that we can understand your setup better

1. Do you see issue with Local Boot on the same host with Same Target? 
2. Can we try to reduce the connection and maybe just use Fabric Connect and see if you are able to capture logs?
3. Just to clarify, are you using FCP LUN for SAN Boot or FC-NVMe LUN for SAN Boot? We do not currently support FC-NVMe LUN for SAN Boot, so just wanted to confirm your configuration.
4. Also can you provide your Target Model number with firmware revision level on the Storage Adapters?

(In reply to jennifer.duong from comment #32)
> Himanshu,
> 
> Are there any next steps that I should take?
> 
> Thanks,
> 
> Jennifer Duong

Thanks,
Himanshu

Comment 34 jennifer.duong 2018-11-08 17:58:58 UTC
Hi Himanshu,

1. No, I do not see an issue with a local install. However, I am not able to see any volumes or either of my arrays after the installation.
2. I'm not entirely sure what you mean by reducing the connections and using fabric connect
3. FCP LUN
4. 1 x QLE2692 (FW:v8.08.03 DVR:v10.00.00.06.07.6-k) and 1 x QLE2742 (FW:v8.08.03 DVR:v10.00.00.06.07.6-k)

I've also noticed that if I downgrade my FW for my HBAs to v8.07.80, I'm able to install to my SANboot LUN just fine.

Comment 35 Himanshu Madhani (Marvell) 2018-11-13 23:28:31 UTC
Hi Jennifer, 

Apologies for delay in response. I was out sick last week. 

I am in process of porting our Boot from SAN patches onto RH76 GA kernel. 

For that kernel to try, what I would suggest is to use following steps

1. First use 8.07.08 firmware to install and boot from the SANBoot Kernel. 
2. Install new kernel with fixed qla2xxx driver. 
3. Update firmware to 8.08.03. Update your initramfs and see if you are able to boot from SANboot LUN.

Thanks,
Himanshu

Comment 36 jennifer.duong 2018-11-14 17:59:31 UTC
Himanshu,

Do you happen to have a link of where that kernel can be found?

Thanks,

Jennifer Duong

Comment 37 Himanshu Madhani (Marvell) 2018-11-15 04:38:27 UTC
Hi Jennifer, 

Sorry have not yet finished porting patches. I will have kernel built by Friday 11/16 and will ask Ewan to post it for you to download. 

(In reply to jennifer.duong from comment #36)
> Himanshu,
> 
> Do you happen to have a link of where that kernel can be found?
> 
> Thanks,
> 
> Jennifer Duong

Thanks,
Himanshu

Comment 38 Himanshu Madhani (Marvell) 2018-11-17 00:36:13 UTC
Jennifer, 

My server is very slow and could not make any progress for kernel build. 

I'll have this ready over weekend but you wont see it until Monday. Sorry for delay. 

Thanks,
Himanshu

(In reply to Himanshu Madhani (Cavium) from comment #37)
> Hi Jennifer, 
> 
> Sorry have not yet finished porting patches. I will have kernel built by
> Friday 11/16 and will ask Ewan to post it for you to download. 
> 
> (In reply to jennifer.duong from comment #36)
> > Himanshu,
> > 
> > Do you happen to have a link of where that kernel can be found?
> > 
> > Thanks,
> > 
> > Jennifer Duong
> 
> Thanks,
> Himanshu

Comment 39 Himanshu Madhani (Marvell) 2018-11-17 05:15:07 UTC
Hi Jennifer, 

I was able to kick-off build with following patches. 

* 619fe86 (HEAD, bz1613543) qla2xxx: Update driver version
* 0968b4e scsi: qla2xxx: Fix driver hang when FC-NVMe LUNs are configured
* dddf3a9 scsi: qla2xxx: Fix re-using LoopID when handle is in use
* 8934491 scsi: qla2xxx: Fix duplicate switch database entries
* 4678615 scsi: qla2xxx: Fix NVMe session hang on unload
* ef4ee55 scsi: qla2xxx: Fix stalled relogin
* c6ae09e scsi: qla2xxx: Fix unintended Logout

This build will be ready in couple hours and by Monday Ewan should be able to post it for you to try it out. 

In mean time can you provide me details of your configuration one more time.

1. What Target Array that you are using. 
    - I need specific model/version 
    - I need Software version on the target array

2. What adapter is on the Target system
   - I need exact ISP number 
   - I need Firmware running on the adapter

3. Can you provide me details of your switch 
   - I need Switch Model 
   - I need firmware/OS of your switch. 

Thanks,
Himanshu

Comment 40 Himanshu Madhani (Marvell) 2018-11-17 05:19:30 UTC
Hi Ewan, 

Can you please provide this build to Jennifer when its ready.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19205774

Thanks,
Himanshu

Comment 41 Ewan D. Milne 2018-11-19 18:48:24 UTC
RPMS for test kernel referred to in comment # 40 available for download at:

http://people.redhat.com/emilne/RPMS/.bz1613543_for_netapp/

The -debuginfo RPM is only present should crash analysis be necessary,
you do not need to install it initially for testing.

Please advise when download is complete so we can free up the space.

Comment 42 jennifer.duong 2018-11-19 20:22:12 UTC
Himanshu,

Here are the details of my config:

1) I have two arrays in my config. 1 x E5600 that is running 8.40 FW and 1 x E2800 that is running 11.50 FW. The array with my SANboot LUN is the E2800 running 11.50 FW.

2) Here is the FW version when I am not able to see any volumes: 1 x QLE2692 (FW:v8.08.03 DVR:v10.00.00.06.07.6-k) and 1 x QLE2742 (FW:v8.08.03 DVR:v10.00.00.06.07.6-k). Here is the FW version when I am able to see and install to my SANboot LUN: 1 x QLE2692 (FW:v8.07.80 DVR:v10.00.00.06.07.6-k) and 1 x QLE2742 (FW:v8.07.80 DVR:v10.00.00.06.07.6-k). I'm not quite sure what you mean by ISP number.

3) I have two switches in my config. 1 x Brocade G620 running v8.2.1a and 1 x Cisco 9148 running v8.3(1)

I have finished downloading all four RPMs. Also, would I be installing all three RPMs excluding the -debuginfo RPM before updating my FW to 8.08.03 and updating my initramfs and see if you are able to boot from SANboot LUN?

Comment 43 jennifer.duong 2018-11-21 15:59:52 UTC
I installed the three RPMs (all but the -debuginfo RPM) and updated my FW to 8.08.03. I tried updating my initramfs by running update-initramfs, but it said that the command wasn't found. I thought I had a package missing so I attempted to install the initramfs-tools package, but it said that package was not found. Since I couldn't seem to update my initramfs, I went ahead with a reboot and tried to boot into the new kernel, but it entered emergency mode.

What does updating my initramfs do? Is this the cause of why my host boot into emergency mode? What should my next steps be?

Thanks,

Jennifer Duong

Comment 44 Ewan D. Milne 2018-11-21 16:54:50 UTC
It should not have been necessary to rebuilt the initramfs, the installation
of the RPMs should have done that.  Can you attach the console output of the
boot when it entered emergency mode?

Comment 45 jennifer.duong 2018-11-21 17:01:40 UTC
It looks like I'm able to boot into kernel 3.10.0-957.el7.bz1613543.x86_64 with FW:v8.08.03 DVR:v10.00.99.06.07.6-k on my fabric-connect server, but not my direct-connect server. My direct-connect system boots into emergency mode as stated in my previous comment.

Thanks,

Jennifer Duong

Comment 46 jennifer.duong 2018-11-21 17:08:16 UTC
Created attachment 1507740 [details]
emergency mode

Comment 47 Himanshu Madhani (Marvell) 2018-11-22 06:28:51 UTC
Hi Jennifer, 

from the screen shot of the emergency mode. it does not look like qla2xxx driver is causing boot to fail. We would have to look at SOS report to find out why direct connect system went into Emergency mode. 

also for your Fabric connect server, Can you no issues with the provided kernel.

There is a known issue with direct connection and for that I will need to pull some additional patches. We have queued patches for RH77 submission which addresses Direct connect.

if you confirm your fabric connection is okay with this kernel. I can provide you another kernel with Direct connect fixes. 

(In reply to jennifer.duong from comment #45)
> It looks like I'm able to boot into kernel 3.10.0-957.el7.bz1613543.x86_64
> with FW:v8.08.03 DVR:v10.00.99.06.07.6-k on my fabric-connect server, but
> not my direct-connect server. My direct-connect system boots into emergency
> mode as stated in my previous comment.
> 
> Thanks,
> 
> Jennifer Duong

Thanks,
Himanshu

Comment 48 jennifer.duong 2018-12-07 17:23:41 UTC
Himanshu,

I'm able to boot into the host with kernel 3.10.0-957.el7.bz1613543.x86_64 and  FW:v8.08.03 DVR:v10.00.99.06.07.6-k, but haven't done any testing outside of that.

Thanks,

Jennifer Duong

Comment 49 jennifer.duong 2018-12-12 20:27:34 UTC
Himanshu,

I ran through my automated tests and it looks like kernel 3.10.0-957.el7.bz1613543.x86_64 running FW:v8.08.03 DVR:v10.00.99.06.07.6-k on my Qlogic fabric-connect system works properly. Do you have a status on the kernel with the Qlogic direct-connect fixes?

Thanks,

Jennifer Duong

Comment 50 Himanshu Madhani (Marvell) 2018-12-13 15:30:08 UTC
Hi Jenifer,

I've built kernel with n2n fixes. Please let me know the result once you have received the kernel. 

Note: These patches are going to be part of RH77 inbox driver. 


Hi Ewan, 

can you please make this build available to Jennifer for validating N2N (direct-connect) configuration.

https://brewweb.devel.redhat.com/taskinfo?taskID=19486813

Thanks,
Himanshu

Comment 51 Ewan D. Milne 2018-12-13 20:48:59 UTC
RPMS for test kernel referred to in comment # 50 available for download at:

http://people.redhat.com/emilne/RPMS/.bz1613543_for_netapp/

Please note that this is the kernel with the N2N changes, and has a
different name, i.e. the RPMs are named like:

kernel-3.10.0-957.el7.bz1613543.n2n.x86_64.rpm

etc.  I've left the earlier RPMs there for the time being, let me know
if you no longer need them.

The -debuginfo RPM is only present should crash analysis be necessary,
you do not need to install it initially for testing.

Please advise when download is complete so we can free up the space.

Comment 52 jennifer.duong 2018-12-17 18:18:07 UTC
Ewan,

I tried booting into kernel-3.10.0-957.el7.bz1613543.n2n.x86_64.rpm with FW:v8.08.03 DVR:v10.00.22.06.07.6-k and it boot into emergency mode.

Thanks,

Jennifer

Comment 53 Himanshu Madhani (Marvell) 2018-12-17 19:53:09 UTC
Hi Ewan, 

FYI, 

Here's list of patches that are part of n2n kernel

* scsi: qla2xxx: Save frame payload size from ICB
* scsi: qla2xxx: Fix race between switch cmd completion and timeout
* scsi: qla2xxx: Fix Management Server NPort handle reservation logic
* scsi: qla2xxx: Flush mailbox commands on chip reset
* scsi: qla2xxx: Fix session state stuck in Get Port DB
* scsi: qla2xxx: Fix redundant fc_rport registration
* scsi: qla2xxx: Silent erroneous message
* scsi: qla2xxx: Prevent sysfs access when chip is down
* scsi: qla2xxx: Add longer window for chip reset
* scsi: qla2xxx: Fix N2N link re-connect
* scsi: qla2xxx: Cleanup for N2N code

Thanks,
Himanshu

Comment 54 Ewan D. Milne 2018-12-17 20:24:35 UTC
So, this is with regular qla2xxx FC SCSI, correct?  This is not an NVMe target?

The screenshot from the emergency mode looks like the boot device was not found.
I assume from the earlier comments that the boot device was on the SAN.

Is is possible to provide output from a serial console attached to the system
as opposed to just a screenshot of the video?  We can't debug this with the
boot messages.

Comment 55 jennifer.duong 2018-12-17 23:26:18 UTC
Yes, this is with regular qla2xxx FC SCSI and not an NVMe target.

Is this what you mean by the output of the serial console? I will be uploading it shortly.

Comment 56 jennifer.duong 2018-12-17 23:27:59 UTC
Created attachment 1515180 [details]
serial console kernel-3.10.0-957.el7.bz1613543.n2n.x86_64.rpm

Comment 57 Ewan D. Milne 2018-12-18 13:31:23 UTC
Yes, that is the kind of serial output we need.  Unfortunately the
file you attached does not appear to contain the output from the
time period of the actual boot failure.  It ends with:

12/17/18 17:08:39: [  281.493586] ata1: exception Emask 0x50 SAct 0x0 SErr 0x40d0800 action 0xe frozen
12/17/18 17:08:39: [  281.501862] ata1: irq_stat 0x00400040, connection status changed
12/17/18 17:08:39: [  281.508577] ata1: SError: { HostInt PHYRdyChg CommWake 10B8B DevExch }
12/17/18 17:08:39: [  281.515865] ata1: hard resetting link
12/17/18 17:08:40: [  282.243029] ata1: SATA link down (SStatus 0 SControl 300)
12/17/18 17:08:45: [  287.248993] ata1: hard resetting link
12/17/18 17:08:45: [  287.557995] ata1: SATA link down (SStatus 0 SControl 300)
12/17/18 17:08:50: [  292.563966] ata1: hard resetting link
12/17/18 17:08:51: [  292.872958] ata1: SATA link down (SStatus 0 SControl 300)
12/17/18 17:08:51: [  292.878994] ata1.00: disabled
12/17/18 17:08:51: [  292.882323] ata1: EH complete
12/17/18 17:08:51: [  292.884951] sd 0:0:0:0: rejecting I/O to offline device
12/17/18 17:08:51: [  292.884955] sd 0:0:0:0: [sda] killing request
12/17/18 17:08:51: [  292.884976] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
12/17/18 17:08:51: [  292.884981] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 57 2c 2a e8 00 01 c0 00
12/17/18 17:08:51: [  292.884985] blk_update_request: I/O error, dev sda, sector 1462512360
12/17/18 17:08:51: [  292.885014] Buffer I/O error on dev dm-0, logical block 182685533, lost async page write
12/17/18 17:08:51: [  292.885019] Buffer I/O error on dev dm-0, logical block 182685534, lost async page write
12/17/18 17:08:51: [  292.885022] Buffer I/O error on dev dm-0, logical block 182685535, lost async page write
12/17/18 17:08:51: [  292.885024] Buffer I/O error on dev dm-0, logical block 182685536, lost async page write
12/17/18 17:08:51: [  292.885026] Buffer I/O error on dev dm-0, logical block 182685537, lost async page write
12/17/18 17:08:51: [  292.885029] Buffer I/O error on dev dm-0, logical block 182685538, lost async page write
12/17/18 17:08:51: [  292.885031] Buffer I/O error on dev dm-0, logical block 182685539, lost async page write
12/17/18 17:08:51: [  292.885034] Buffer I/O error on dev dm-0, logical block 182685540, lost async page write
12/17/18 17:08:51: [  292.885036] Buffer I/O error on dev dm-0, logical block 182685541, lost async page write
12/17/18 17:08:51: [  292.885038] Buffer I/O error on dev dm-0, logical block 182685542, lost async page write
12/17/18 17:08:51: [  293.010599] ata1.00: detaching (SCSI 0:0:0:0)
12/17/18 17:08:51: [  293.026870] sd 0:0:0:0: [sda] Stopping disk
12/17/18 17:08:51: [  293.031789] sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
12/17/18 17:08:52: [  294.056032] XFS (dm-0): metadata I/O error: block 0x39f371c4 ("xlog_iodone") error 5 numblks 64
12/17/18 17:08:52: [  294.056140] XFS (dm-0): metadata I/O error: block 0x3a3df420 ("xfs_buf_iodone_callback_error") error 5 numblks 32
12/17/18 17:08:52: [  294.077216] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1221 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffc0420c30
12/17/18 17:08:52: [  294.090908] XFS (dm-0): Log I/O Error Detected.  Shutting down filesystem
12/17/18 17:08:52: [  294.098487] XFS (dm-0): Please umount the filesystem and rectify the problem(s)
12/17/18 17:08:52: [  294.105948] XFS (dm-0): Failing async write on buffer block 0x56f7fd20. Retrying async write.
12/17/18 17:08:56: [  298.746849] Core dump to |/usr/libexec/abrt-hook-ccpp 6 0 10991 0 0 1545088136 e 10991 11033 ictm1608s02h4.ict.englab.netapp.com pipe failed
12/17/18 17:08:56: [  298.750581] Core dump to |/usr/libexec/abrt-hook-ccpp 6 0 8011 0 0 1545088136 e 8011 8236 ictm1608s02h4.ict.englab.netapp.com pipe failed

and then there are some BIOS boot messages, so it looks like the machine is
being either power cycled or rebooted.  The last output is:

12/17/18 17:11:56: ^[[8;1HBooting^[[8;9Hfrom^[[8;14HHard^[[8;19Hdrive^[[8;25HC:^[[9;1H.^[[10;1H^[[?25h
12/17/18 17:11:56: ^M
12/17/18 17:11:56:

What we would like to see are the messages leading up to the output on the earlier
screen capture you attached that might show us why it was able to load the Linux
kernel, but was unable to find the root and swap device later in the boot process.

---

It seems based on your earlier information that you could install and boot from SAN
successfully with the earlier 8.07.80 HBA firmware but not the 8.08.03 firmware so
this would seem to be either a firmware issue or a case where the newer firmware needs
driver changes as well, so what I am trying to understand is how you were able to
connect to the array in the first place if you were booting from SAN, but not later.

Both the arrays are E-series arrays, correct?  (We have one arriving here soon for testing.)

Comment 58 jennifer.duong 2018-12-18 16:49:26 UTC
Ewan,

I initially had the OS installed on the local hard drive and was booting from there with all my connections to the controllers disconnected so that it wouldn't try to SANboot. I had to upgrade the firmware back to 08.08.03 and once that was complete I connected all of the controller connections and disconnected the hard drive. From there I had to reboot the server so that it could try to SANboot with the 08.08.03 FW and kernel-3.10.0-957.el7.bz1613543.n2n.x86_64. Shouldn't messages looking like the machine is being power cycled or rebooted be expected? How would I be able to test out kernel-3.10.0-957.el7.bz1613543.n2n without power cycling my server or rebooting it? I checked my serial logs and what I provided you, from what I can tell, is the instance when my host boot into emergency mode. The only logs after that is this:

12/17/18 17:20:50: [?25l[1;1H       [1;9H    [1;14H     [1;20H       [2;1H        [2;10H       [2;18H        [2;27H   [2;31H    [0m[30;47m[4;1H      Red Hat Enterprise Linux Server (3.10.0-957.el7.bz1613543.n2n.x86_64) 7. [0m[37;40m[5;7HRed[5;11HHat[5;15HEnterprise[5;26HLinux[5;32HServer[5;39H(3.10.0-957.el7.bz1613543.x86_64)[5;73H7.6[5;77H(M [6;1H      Red Hat Enterprise Linux Server[6;39H(3.10.0-957.el7.x86_64)[6;63H7.6[6;67H(Maipo)[7;7HRed[7;11HHat[7;15HEnterprise[7;26HLinux[7;32HServer[7;39H(0-rescue-fb8addbcbf17439786d9ecdce2d202 [8;1H       [8;9H    [8;14H    [8;19H     [8;25H  [9;1H [21;7HUse[21;11Hthe[21;15H [21;17Hand[21;21H [21;23Hkeys[21;28Hto[21;31Hchange[21;38Hthe[21;42Hselection.[22;7HPress[22;13H'e'[22;17Hto[22;20Hedit[22;25Hthe[22;29Hselected[22;38Hitem,[22;44Hor[22;47H'c'[22;51Hfor[22;55Ha[22;57Hcommand[22;65Hprompt.[23;4HThe[23;8Hselected[23;17Hentry[23;23Hwill[23;28Hbe[23;31Hstarted[23;39Hautomatically[23;53Hin[23;56H5s.[23;56H4[23;56H3[23;56H2[23;56H1[4;1H                                                                               [5;7H   [5;11H   [5;15H          [5;26H     [5;32H      [5;39H                                 [5;73H   [5;77H   [6;7H   [6;11H   [6;15H          [6;26H     [6;32H      [6;39H                       [6;63H   [6;67H       [7;7H   [7;11H   [7;15H          [7;26H     [7;32H      [7;39H                                         [21;7H   [21;11H   [21;15H [21;17H   [21;21H [21;23H    [21;28H  [21;31H      [21;38H   [21;42H          [22;7H     [22;13H   [22;17H  [22;20H    [22;25H   [22;29H        [22;38H     [22;44H  [22;47H   [22;51H   [22;55H [22;57H       [22;65H       [23;4H   [23;8H        [23;17H     [23;23H    [23;28H  [23;31H       [23;39H             [23;53H  [23;56H   [1;1H[?25h[0;37;40m[2J[H

When I tried to boot into the n2n kernel, it listed all the kernels I can boot into and when I select that particular one, it tries to load it and looks like it's about to finish but then enters emergency mode. And yes, the arrays are E-Series arrays.

Thanks,

Jennifer

Comment 59 Himanshu Madhani (Marvell) 2018-12-18 22:57:45 UTC
Hi Jennifer, 

I want to rule out issue of qla2xxx in your direct connect setup. 

I want you to confirm if UEFI driver sees FC luns or not. Here's instructions i received from our UEFI developer for the instruction on how to verify FC LUNs are seen. 

--------- <snip> ---------

The UEFI Shell will let you see what Luns were discovered by the UEFI driver.  HPE servers have a built-in UEFI shell.  Attached is a shell executable that will work on other servers. Go to the server setup screens and look for an option to run a UEFI application.  This option will let you run the attached shell.

Once the shell is running, use the "map" command to see if the FC Luns were mapped.  FC Luns will have Fibre(WWPN, LUN) in their path name.

--------- </snip> ----------

I am attaching shell executable in case you are using other than HP Server. 

Can you capture this information.

Thanks,
Himanshu

Comment 60 Himanshu Madhani (Marvell) 2018-12-18 22:58:44 UTC
Created attachment 1515425 [details]
Shell Executable for UEFI shell

Comment 61 Vishal Agrawal 2018-12-22 05:32:08 UTC
Hello Himanshu and Ewan,

I have got a customer who seems to face the same issue as defined
in this bugzilla. Below I am providing all the information.

Support Case : 02276650
=======================

Issue : 
=======

After updating the kernel to 3.10.0-957.1.3.el7.x86_64, server is no more able to 
see LUN's coming from Qlogic HBA.

However when booted with Older kernel 3.10.0-862.el7.x86_64, server is able to detect
all the LUN's.

Qlogic Adapter Details from sosreport
=====================================

==============
fenacosrv92151   <====> [Hostname of server]
==============

All 4 Qlogic cards are exactly same.

$ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot
Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.

***Even the subsystem vendor and device id are same.

$ grep Fibre -A1 lspci 
2f:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter [1077:2261] (rev 01)
	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
--
2f:00.1 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter [1077:2261] (rev 01)
	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
--
58:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter [1077:2261] (rev 01)
	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
--
58:00.1 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter [1077:2261] (rev 01)
	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----

**** All 4 are Dual-port HBA.

$ grep QLogic lspci|grep HBA
		Product Name: QLogic 16Gb FC Dual-port HBA
		Product Name: QLogic 16Gb FC Dual-port HBA
		Product Name: QLogic 16Gb FC Dual-port HBA
		Product Name: QLogic 16Gb FC Dual-port HBA

*** Firmware versions:

fw=8.08.05    [Actual Firmware version 1.90.53] as per XClarity Controller's firmware web page

$ grep QLE -A1 sos_commands/logs/journalctl_--no-pager_--catalog_--boot
Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+ host#=15 fw=8.08.05 (d0d5).
--
Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+ host#=16 fw=8.08.05 (d0d5).
--
Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+ host#=17 fw=8.08.05 (d0d5).
--
Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.1 hdma+ host#=18 fw=8.08.05 (d0d5).

-------------------------
3.10.0-957.1.3.el7.x86_64
-------------------------

No disks are getting detected and we can see Qlogic 'Abort' messages along with 'TECH PREVIEW' message.

$ grep 'tech preview' -i sos_commands/logs/journalctl_--no-pager_--catalog_--boot
Dec 18 09:20:24 fenacosrv92151.main.corp.fenaco.com kernel: TECH PREVIEW: NVMe over FC may not be fully supported.

$ grep Abort sos_commands/logs/journalctl_--no-pager_--catalog_--boot
Dec 18 09:20:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-801c:16: Abort command issued nexus=16:0:0 --  0 2003.
Dec 18 09:20:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-801c:17: Abort command issued nexus=17:0:0 --  0 2003.
Dec 18 09:21:02 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-801c:16: Abort command issued nexus=16:0:0 --  1 2002.

$ grep fc-nvme -i sos_commands/logs/journalctl_--no-pager_--catalog_--boot
Dec 18 09:20:23 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-d302:15: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
Dec 18 09:20:25 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-d302:16: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
Dec 18 09:20:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-d302:17: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
Dec 18 09:20:29 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.1]-d302:18: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
Dec 18 09:20:51 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-d302:16: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
Dec 18 09:20:53 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-d302:17: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)

---------------------
3.10.0-862.el7.x86_64
---------------------

When same system is booted with Older RHEL7.5 GA kernel, it detects all disks coming via Qlogic HBA.

$ grep 'scsi host' sos_commands/logs/journalctl_--no-pager_--catalog_--boot|grep qla
Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: scsi host15: qla2xxx
Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: scsi host16: qla2xxx
Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: scsi host17: qla2xxx
Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: scsi host18: qla2xxx

Disks are coming only from scsi host 15 and scsi host 18.

$ cat sos_commands/multipath/multipath_-l 
mpathc (360060e80221598005041159800000447) dm-4 HITACHI ,OPEN-V          
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 15:0:0:1 sdc 8:32 active undef running
  `- 18:0:0:1 sdf 8:80 active undef running
mpathb (360060e80221598005041159800000448) dm-5 HITACHI ,OPEN-V          
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 15:0:0:2 sdd 8:48 active undef running
  `- 18:0:0:2 sdg 8:96 active undef running
mpatha (360060e80221598005041159800000446) dm-3 HITACHI ,OPEN-V          
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 15:0:0:0 sdb 8:16 active undef running
  `- 18:0:0:0 sde 8:64 active undef running

In Addition, we do not see any 'TECH PREVIEW' message for FC over NVME nor any 
'Abort' message.

======================
WORKAROUND IDENTIFIED:
======================

-> Applied QLogic parameter 'ql2xnvmeenable' set to 0.

# cat > /etc/modprobe.d/qla2xxx.conf
options qla2xxx ql2xnvmeenable=0

-> Rebuild initramfs for 3.10.0-957.1.3.el7.x86_64 and rebooted the server, it was able to 
   detect all the disks coming from Qlogic HBA.

Observation:
-----------

-> After disabling NVME support for Qlogic with above parameter, I did not see 'Abort' message
   neither 'TECH PREVIEW' message.

===================
ANOTHER WORKAROUND:
===================

-> Customer was using one more system with 3.10.0-957.1.3.el7.x86_64 having exact same model of QLogic Adapter [ QLogic QLE2692 ] 
   where disks were getting detected without any modification to 'qla2xxx' parameters.

-> That system was not showing any 'Abort' message nor it was showing any 'TECH PREVIEW' message.

-> I figured out that firmware version of QLOGIC HBA was lower on that server 

fw=8.05.63    [Actual Firmware version 1.90.43] as per XClarity Controller's firmware web page

$ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot -A1
Dec 11 08:39:47 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fb:14: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 11 08:39:47 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fc:14: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+ host#=14 fw=8.05.63 (d0d5).
--
Dec 11 08:39:49 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 11 08:39:49 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+ host#=16 fw=8.05.63 (d0d5).
--
Dec 11 08:39:51 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fb:17: QLogic QLE2690 - QLogic 16Gb FC Single-port HBA.
Dec 11 08:39:51 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+ host#=17 fw=8.05.63 (d0d5).
--
Dec 11 08:39:53 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:af:00.0]-00fb:18: QLogic QLE2690 - QLogic 16Gb FC Single-port HBA.
Dec 11 08:39:53 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx [0000:af:00.0]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:af:00.0 hdma+ host#=18 fw=8.05.63 (d0d5).

-> I asked Customer to downgrade the firmware on problematic server to 8.05.63 or '1.90.43' as per LENOVO THINKSYSTEM SR650 Drivers page and it worked.

https://datacentersupport.lenovo.com/in/en/downloads/DS501286

-> All disks are getting detected with fw=8.05.63 and without applying any parameter modification on qla2xxx module.

==============
fenacosrv92151
==============

AFTER FIRMWARE DOWNGRADE
------------------------

fw=8.05.63

$ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot -A1
Dec 21 15:38:22 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 21 15:38:22 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+ host#=15 fw=8.05.63 (d0d5).
--
Dec 21 15:38:24 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 21 15:38:24 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+ host#=16 fw=8.05.63 (d0d5).
--
Dec 21 15:38:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 21 15:38:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+ host#=17 fw=8.05.63 (d0d5).
--
Dec 21 15:38:28 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
Dec 21 15:38:28 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx [0000:58:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.1 hdma+ host#=18 fw=8.05.63 (d0d5).

$ cat sys/module/qla2xxx/parameters/ql2xnvmeenable 
1

$ cat sos_commands/multipath/multipath_-l
mpathc (360060e80221598005041159800000447) dm-3 HITACHI ,OPEN-V          
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 18:0:0:1 sdf 8:80 active undef running
  `- 15:0:0:1 sdc 8:32 active undef running
mpathb (360060e80221598005041159800000448) dm-5 HITACHI ,OPEN-V          
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 15:0:0:2 sdd 8:48 active undef running
  `- 18:0:0:2 sdg 8:96 active undef running
mpatha (360060e80221598005041159800000446) dm-4 HITACHI ,OPEN-V          
size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 15:0:0:0 sdb 8:16 active undef running
  `- 18:0:0:0 sde 8:64 active undef running

Also I do not observe any 'Abort' message or 'TECH PREVIEW' message

QUESTION:
=========

-> Here I see 2 workaround 

o Downgrade the firmware of QLogic HBA
o set nvme support to 0 in qla2xxx parameter using 'ql2xnvmeenable=0'

-> Does this needs fix in qla2xxx kernel module or does this needs fix in QLogic firmware?

Thanks,

Comment 62 Milan P. Gandhi 2018-12-27 10:10:14 UTC
Another customer (SFDC#02281410) has reported similar issues in LUN discovery while using following Qlogic adapters with kernel-3.10.0-957

[50-sosreport-loraakbp11-02281410-2018-12-26-xzctxsu]$ less var/log/dmesg |grep -i qla2|grep fw
[    7.200130] qla2xxx [0000:86:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:86:00.0 hdma+ host#=15 fw=8.08.03 (d0d5).
[    9.160122] qla2xxx [0000:86:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:86:00.1 hdma+ host#=16 fw=8.08.03 (d0d5).
[   11.280063] qla2xxx [0000:87:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:87:00.0 hdma+ host#=17 fw=8.08.03 (d0d5).
[   13.235063] qla2xxx [0000:87:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:87:00.1 hdma+ host#=18 fw=8.08.03 (d0d5).

Previously the customer was using kernel-3.10.0-862.14.4.el7.x86_64 and the HITACHI OPEN-V LUNs were visible through all the 4 HBAs. But after the upgrade to 957 kernel, the LUNs were visible through only one of the above HBAs.

We had then disabled the ql2xnvmeenable option for qla2xxx module, rebuild the initial ram disk image for 957 kernel and the issues in LUN discovery were fixed.

Comment 63 Ewan D. Milne 2019-01-02 17:42:48 UTC
Himanshu is a partner engineer from Cavium and cannot see private comments.

Please verify with your customers that the information in comment # 61 and
comment # 62 can be shared with the HBA vendor, and then un-check the
private comment field.

Comment 64 Ewan D. Milne 2019-01-02 17:46:30 UTC
Resetting needinfo from comment # 59.

Comment 65 Vishal Agrawal 2019-01-02 17:52:18 UTC
Hi Himanshu,

Kindly check #61 and #62 for detailed information from 2 reported cases.

Thanks,

Comment 66 Himanshu Madhani (Marvell) 2019-01-02 18:15:46 UTC
Hello Vishal, 

Are these customer using Fabric mode or direct attached? 

(In reply to vishal agrawal from comment #61)
> Hello Himanshu and Ewan,
> 
> I have got a customer who seems to face the same issue as defined
> in this bugzilla. Below I am providing all the information.
> 
> Support Case : 02276650
> =======================
> 
> Issue : 
> =======
> 
> After updating the kernel to 3.10.0-957.1.3.el7.x86_64, server is no more
> able to 
> see LUN's coming from Qlogic HBA.
> 
> However when booted with Older kernel 3.10.0-862.el7.x86_64, server is able
> to detect
> all the LUN's.
> 
> Qlogic Adapter Details from sosreport
> =====================================
> 
> ==============
> fenacosrv92151   <====> [Hostname of server]
> ==============
> 
> All 4 Qlogic cards are exactly same.
> 
> $ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> 
> ***Even the subsystem vendor and device id are same.
> 
> $ grep Fibre -A1 lspci 
> 2f:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> Channel to PCIe Adapter [1077:2261] (rev 01)
> 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> --
> 2f:00.1 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> Channel to PCIe Adapter [1077:2261] (rev 01)
> 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> --
> 58:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> Channel to PCIe Adapter [1077:2261] (rev 01)
> 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> --
> 58:00.1 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> Channel to PCIe Adapter [1077:2261] (rev 01)
> 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> 
> **** All 4 are Dual-port HBA.
> 
> $ grep QLogic lspci|grep HBA
> 		Product Name: QLogic 16Gb FC Dual-port HBA
> 		Product Name: QLogic 16Gb FC Dual-port HBA
> 		Product Name: QLogic 16Gb FC Dual-port HBA
> 		Product Name: QLogic 16Gb FC Dual-port HBA
> 
> *** Firmware versions:
> 
> fw=8.08.05    [Actual Firmware version 1.90.53] as per XClarity Controller's
> firmware web page
> 
> $ grep QLE -A1 sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+
> host#=15 fw=8.08.05 (d0d5).
> --
> Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+
> host#=16 fw=8.08.05 (d0d5).
> --
> Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+
> host#=17 fw=8.08.05 (d0d5).
> --
> Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.1 hdma+
> host#=18 fw=8.08.05 (d0d5).
> 
> -------------------------
> 3.10.0-957.1.3.el7.x86_64
> -------------------------
> 
> No disks are getting detected and we can see Qlogic 'Abort' messages along
> with 'TECH PREVIEW' message.
> 
> $ grep 'tech preview' -i
> sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> Dec 18 09:20:24 fenacosrv92151.main.corp.fenaco.com kernel: TECH PREVIEW:
> NVMe over FC may not be fully supported.
> 
> $ grep Abort sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> Dec 18 09:20:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-801c:16: Abort command issued nexus=16:0:0 --  0 2003.
> Dec 18 09:20:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-801c:17: Abort command issued nexus=17:0:0 --  0 2003.
> Dec 18 09:21:02 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-801c:16: Abort command issued nexus=16:0:0 --  1 2002.
> 
> $ grep fc-nvme -i sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> Dec 18 09:20:23 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-d302:15: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> Dec 18 09:20:25 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-d302:16: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> Dec 18 09:20:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-d302:17: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> Dec 18 09:20:29 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.1]-d302:18: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> Dec 18 09:20:51 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-d302:16: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> Dec 18 09:20:53 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-d302:17: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> 
> ---------------------
> 3.10.0-862.el7.x86_64
> ---------------------
> 
> When same system is booted with Older RHEL7.5 GA kernel, it detects all
> disks coming via Qlogic HBA.
> 
> $ grep 'scsi host'
> sos_commands/logs/journalctl_--no-pager_--catalog_--boot|grep qla
> Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: scsi host15:
> qla2xxx
> Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: scsi host16:
> qla2xxx
> Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: scsi host17:
> qla2xxx
> Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: scsi host18:
> qla2xxx
> 
> Disks are coming only from scsi host 15 and scsi host 18.
> 
> $ cat sos_commands/multipath/multipath_-l 
> mpathc (360060e80221598005041159800000447) dm-4 HITACHI ,OPEN-V          
> size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=0 status=active
>   |- 15:0:0:1 sdc 8:32 active undef running
>   `- 18:0:0:1 sdf 8:80 active undef running
> mpathb (360060e80221598005041159800000448) dm-5 HITACHI ,OPEN-V          
> size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=0 status=active
>   |- 15:0:0:2 sdd 8:48 active undef running
>   `- 18:0:0:2 sdg 8:96 active undef running
> mpatha (360060e80221598005041159800000446) dm-3 HITACHI ,OPEN-V          
> size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=0 status=active
>   |- 15:0:0:0 sdb 8:16 active undef running
>   `- 18:0:0:0 sde 8:64 active undef running
> 
> In Addition, we do not see any 'TECH PREVIEW' message for FC over NVME nor
> any 
> 'Abort' message.
> 
> ======================
> WORKAROUND IDENTIFIED:
> ======================
> 
> -> Applied QLogic parameter 'ql2xnvmeenable' set to 0.
> 
> # cat > /etc/modprobe.d/qla2xxx.conf
> options qla2xxx ql2xnvmeenable=0
> 
> -> Rebuild initramfs for 3.10.0-957.1.3.el7.x86_64 and rebooted the server,
> it was able to 
>    detect all the disks coming from Qlogic HBA.
> 
> Observation:
> -----------
> 
> -> After disabling NVME support for Qlogic with above parameter, I did not
> see 'Abort' message
>    neither 'TECH PREVIEW' message.
> 
> ===================
> ANOTHER WORKAROUND:
> ===================
> 
> -> Customer was using one more system with 3.10.0-957.1.3.el7.x86_64 having
> exact same model of QLogic Adapter [ QLogic QLE2692 ] 
>    where disks were getting detected without any modification to 'qla2xxx'
> parameters.
> 
> -> That system was not showing any 'Abort' message nor it was showing any
> 'TECH PREVIEW' message.
> 
> -> I figured out that firmware version of QLOGIC HBA was lower on that
> server 
> 
> fw=8.05.63    [Actual Firmware version 1.90.43] as per XClarity Controller's
> firmware web page
> 
> $ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot -A1
> Dec 11 08:39:47 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fb:14: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 11 08:39:47 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fc:14: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+
> host#=14 fw=8.05.63 (d0d5).
> --
> Dec 11 08:39:49 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 11 08:39:49 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+
> host#=16 fw=8.05.63 (d0d5).
> --
> Dec 11 08:39:51 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fb:17: QLogic QLE2690 - QLogic 16Gb FC Single-port HBA.
> Dec 11 08:39:51 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+
> host#=17 fw=8.05.63 (d0d5).
> --
> Dec 11 08:39:53 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:af:00.0]-00fb:18: QLogic QLE2690 - QLogic 16Gb FC Single-port HBA.
> Dec 11 08:39:53 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> [0000:af:00.0]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:af:00.0 hdma+
> host#=18 fw=8.05.63 (d0d5).
> 
> -> I asked Customer to downgrade the firmware on problematic server to
> 8.05.63 or '1.90.43' as per LENOVO THINKSYSTEM SR650 Drivers page and it
> worked.
> 
> https://datacentersupport.lenovo.com/in/en/downloads/DS501286
> 
> -> All disks are getting detected with fw=8.05.63 and without applying any
> parameter modification on qla2xxx module.
> 
> ==============
> fenacosrv92151
> ==============
> 
> AFTER FIRMWARE DOWNGRADE
> ------------------------
> 
> fw=8.05.63
> 
> $ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot -A1
> Dec 21 15:38:22 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 21 15:38:22 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+
> host#=15 fw=8.05.63 (d0d5).
> --
> Dec 21 15:38:24 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 21 15:38:24 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+
> host#=16 fw=8.05.63 (d0d5).
> --
> Dec 21 15:38:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 21 15:38:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+
> host#=17 fw=8.05.63 (d0d5).
> --
> Dec 21 15:38:28 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> Dec 21 15:38:28 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> [0000:58:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.1 hdma+
> host#=18 fw=8.05.63 (d0d5).
> 
> $ cat sys/module/qla2xxx/parameters/ql2xnvmeenable 
> 1
> 
> $ cat sos_commands/multipath/multipath_-l
> mpathc (360060e80221598005041159800000447) dm-3 HITACHI ,OPEN-V          
> size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=0 status=active
>   |- 18:0:0:1 sdf 8:80 active undef running
>   `- 15:0:0:1 sdc 8:32 active undef running
> mpathb (360060e80221598005041159800000448) dm-5 HITACHI ,OPEN-V          
> size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=0 status=active
>   |- 15:0:0:2 sdd 8:48 active undef running
>   `- 18:0:0:2 sdg 8:96 active undef running
> mpatha (360060e80221598005041159800000446) dm-4 HITACHI ,OPEN-V          
> size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=0 status=active
>   |- 15:0:0:0 sdb 8:16 active undef running
>   `- 18:0:0:0 sde 8:64 active undef running
> 
> Also I do not observe any 'Abort' message or 'TECH PREVIEW' message
> 
> QUESTION:
> =========
> 
> -> Here I see 2 workaround 
> 
> o Downgrade the firmware of QLogic HBA
> o set nvme support to 0 in qla2xxx parameter using 'ql2xnvmeenable=0'
> 
> -> Does this needs fix in qla2xxx kernel module or does this needs fix in
> QLogic firmware?
> 

We have identified patches that has fixed this issue and are submitted as part of RH77 inbox. 

see comment #39 for the patches that were identified for Fabric mode connection. 

Reporter of this bugzilla confirmed that the issue with fabric connect was resolved in comment #49.

> Thanks,

Thanks,
Himanshu

Comment 67 Vishal Agrawal 2019-01-02 19:13:41 UTC
Hi Himanshu,

>> Are these customer using Fabric mode or direct attached? 

Is this something which I can identify from sosreport or should
I get this detail from customer directly.

Do you also want me to share the test kernel's from comment #51?

Thanks,

Comment 68 Himanshu Madhani (Marvell) 2019-01-02 19:33:04 UTC
Hi Vishal, 

(In reply to vishal agrawal from comment #67)
> Hi Himanshu,
> 
> >> Are these customer using Fabric mode or direct attached? 
> 
> Is this something which I can identify from sosreport or should
> I get this detail from customer directly.
> 

I would want you to get detail topology information from customer before sharing any test kernel.

I want to make sure their configuration before we share any test code. 

> Do you also want me to share the test kernel's from comment #51?
> 
> Thanks,

Thanks,
Himanshu

Comment 69 jennifer.duong 2019-01-02 22:09:54 UTC
Created attachment 1518039 [details]
map

Comment 70 jennifer.duong 2019-01-02 22:16:26 UTC
Himanshu,

I have attached a screenshot of part of the output when running "map". Does that look correct to you?

Thanks,

Jennifer

Comment 71 Himanshu Madhani (Marvell) 2019-01-05 03:33:57 UTC
Hi Jenifer, 

(In reply to jennifer.duong from comment #70)
> Himanshu,
> 
> I have attached a screenshot of part of the output when running "map". Does
> that look correct to you?
> 
> Thanks,
> 
> Jennifer

We were able to confirm that the information is good. We do see LUNs discovered by UEFI driver. 

Is it possible for you to capture FC trace to see why you are not able to boot into SAN Boot LUN after installation. 

Also, Note that these patches have been merged into RHEL 77 kernel. i'll find out if there is possibility of having ISO image that you can try and see if that makes any difference. 

Thanks,
Himanshu

Comment 72 Vishal Agrawal 2019-01-07 12:44:27 UTC
(In reply to Himanshu Madhani (Cavium) from comment #66)
> Hello Vishal, 
> 
> Are these customer using Fabric mode or direct attached? 
> 
> (In reply to vishal agrawal from comment #61)
> > Hello Himanshu and Ewan,
> > 
> > I have got a customer who seems to face the same issue as defined
> > in this bugzilla. Below I am providing all the information.
> > 
> > Support Case : 02276650
> > =======================
> > 
> > Issue : 
> > =======
> > 
> > After updating the kernel to 3.10.0-957.1.3.el7.x86_64, server is no more
> > able to 
> > see LUN's coming from Qlogic HBA.
> > 
> > However when booted with Older kernel 3.10.0-862.el7.x86_64, server is able
> > to detect
> > all the LUN's.
> > 
> > Qlogic Adapter Details from sosreport
> > =====================================
> > 
> > ==============
> > fenacosrv92151   <====> [Hostname of server]
> > ==============
> > 
> > All 4 Qlogic cards are exactly same.
> > 
> > $ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> > Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > 
> > ***Even the subsystem vendor and device id are same.
> > 
> > $ grep Fibre -A1 lspci 
> > 2f:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> > Channel to PCIe Adapter [1077:2261] (rev 01)
> > 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> > --
> > 2f:00.1 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> > Channel to PCIe Adapter [1077:2261] (rev 01)
> > 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> > --
> > 58:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> > Channel to PCIe Adapter [1077:2261] (rev 01)
> > 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> > --
> > 58:00.1 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre
> > Channel to PCIe Adapter [1077:2261] (rev 01)
> > 	Subsystem: QLogic Corp. Device [1077:02af]          <<<<<<====------<-----
> > 
> > **** All 4 are Dual-port HBA.
> > 
> > $ grep QLogic lspci|grep HBA
> > 		Product Name: QLogic 16Gb FC Dual-port HBA
> > 		Product Name: QLogic 16Gb FC Dual-port HBA
> > 		Product Name: QLogic 16Gb FC Dual-port HBA
> > 		Product Name: QLogic 16Gb FC Dual-port HBA
> > 
> > *** Firmware versions:
> > 
> > fw=8.08.05    [Actual Firmware version 1.90.53] as per XClarity Controller's
> > firmware web page
> > 
> > $ grep QLE -A1 sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> > Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+
> > host#=15 fw=8.08.05 (d0d5).
> > --
> > Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+
> > host#=16 fw=8.08.05 (d0d5).
> > --
> > Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+
> > host#=17 fw=8.08.05 (d0d5).
> > --
> > Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.1 hdma+
> > host#=18 fw=8.08.05 (d0d5).
> > 
> > -------------------------
> > 3.10.0-957.1.3.el7.x86_64
> > -------------------------
> > 
> > No disks are getting detected and we can see Qlogic 'Abort' messages along
> > with 'TECH PREVIEW' message.
> > 
> > $ grep 'tech preview' -i
> > sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> > Dec 18 09:20:24 fenacosrv92151.main.corp.fenaco.com kernel: TECH PREVIEW:
> > NVMe over FC may not be fully supported.
> > 
> > $ grep Abort sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> > Dec 18 09:20:47 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-801c:16: Abort command issued nexus=16:0:0 --  0 2003.
> > Dec 18 09:20:49 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-801c:17: Abort command issued nexus=17:0:0 --  0 2003.
> > Dec 18 09:21:02 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-801c:16: Abort command issued nexus=16:0:0 --  1 2002.
> > 
> > $ grep fc-nvme -i sos_commands/logs/journalctl_--no-pager_--catalog_--boot
> > Dec 18 09:20:23 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-d302:15: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> > Dec 18 09:20:25 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-d302:16: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> > Dec 18 09:20:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-d302:17: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> > Dec 18 09:20:29 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.1]-d302:18: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> > Dec 18 09:20:51 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-d302:16: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> > Dec 18 09:20:53 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-d302:17: qla2x00_get_fw_version: FC-NVMe is Enabled (0x3c58)
> > 
> > ---------------------
> > 3.10.0-862.el7.x86_64
> > ---------------------
> > 
> > When same system is booted with Older RHEL7.5 GA kernel, it detects all
> > disks coming via Qlogic HBA.
> > 
> > $ grep 'scsi host'
> > sos_commands/logs/journalctl_--no-pager_--catalog_--boot|grep qla
> > Dec 18 10:32:43 fenacosrv92151.main.corp.fenaco.com kernel: scsi host15:
> > qla2xxx
> > Dec 18 10:32:45 fenacosrv92151.main.corp.fenaco.com kernel: scsi host16:
> > qla2xxx
> > Dec 18 10:32:47 fenacosrv92151.main.corp.fenaco.com kernel: scsi host17:
> > qla2xxx
> > Dec 18 10:32:49 fenacosrv92151.main.corp.fenaco.com kernel: scsi host18:
> > qla2xxx
> > 
> > Disks are coming only from scsi host 15 and scsi host 18.
> > 
> > $ cat sos_commands/multipath/multipath_-l 
> > mpathc (360060e80221598005041159800000447) dm-4 HITACHI ,OPEN-V          
> > size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=0 status=active
> >   |- 15:0:0:1 sdc 8:32 active undef running
> >   `- 18:0:0:1 sdf 8:80 active undef running
> > mpathb (360060e80221598005041159800000448) dm-5 HITACHI ,OPEN-V          
> > size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=0 status=active
> >   |- 15:0:0:2 sdd 8:48 active undef running
> >   `- 18:0:0:2 sdg 8:96 active undef running
> > mpatha (360060e80221598005041159800000446) dm-3 HITACHI ,OPEN-V          
> > size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=0 status=active
> >   |- 15:0:0:0 sdb 8:16 active undef running
> >   `- 18:0:0:0 sde 8:64 active undef running
> > 
> > In Addition, we do not see any 'TECH PREVIEW' message for FC over NVME nor
> > any 
> > 'Abort' message.
> > 
> > ======================
> > WORKAROUND IDENTIFIED:
> > ======================
> > 
> > -> Applied QLogic parameter 'ql2xnvmeenable' set to 0.
> > 
> > # cat > /etc/modprobe.d/qla2xxx.conf
> > options qla2xxx ql2xnvmeenable=0
> > 
> > -> Rebuild initramfs for 3.10.0-957.1.3.el7.x86_64 and rebooted the server,
> > it was able to 
> >    detect all the disks coming from Qlogic HBA.
> > 
> > Observation:
> > -----------
> > 
> > -> After disabling NVME support for Qlogic with above parameter, I did not
> > see 'Abort' message
> >    neither 'TECH PREVIEW' message.
> > 
> > ===================
> > ANOTHER WORKAROUND:
> > ===================
> > 
> > -> Customer was using one more system with 3.10.0-957.1.3.el7.x86_64 having
> > exact same model of QLogic Adapter [ QLogic QLE2692 ] 
> >    where disks were getting detected without any modification to 'qla2xxx'
> > parameters.
> > 
> > -> That system was not showing any 'Abort' message nor it was showing any
> > 'TECH PREVIEW' message.
> > 
> > -> I figured out that firmware version of QLOGIC HBA was lower on that
> > server 
> > 
> > fw=8.05.63    [Actual Firmware version 1.90.43] as per XClarity Controller's
> > firmware web page
> > 
> > $ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot -A1
> > Dec 11 08:39:47 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fb:14: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 11 08:39:47 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fc:14: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+
> > host#=14 fw=8.05.63 (d0d5).
> > --
> > Dec 11 08:39:49 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 11 08:39:49 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+
> > host#=16 fw=8.05.63 (d0d5).
> > --
> > Dec 11 08:39:51 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fb:17: QLogic QLE2690 - QLogic 16Gb FC Single-port HBA.
> > Dec 11 08:39:51 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+
> > host#=17 fw=8.05.63 (d0d5).
> > --
> > Dec 11 08:39:53 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:af:00.0]-00fb:18: QLogic QLE2690 - QLogic 16Gb FC Single-port HBA.
> > Dec 11 08:39:53 fenacosrv92071.main.corp.fenaco.com kernel: qla2xxx
> > [0000:af:00.0]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:af:00.0 hdma+
> > host#=18 fw=8.05.63 (d0d5).
> > 
> > -> I asked Customer to downgrade the firmware on problematic server to
> > 8.05.63 or '1.90.43' as per LENOVO THINKSYSTEM SR650 Drivers page and it
> > worked.
> > 
> > https://datacentersupport.lenovo.com/in/en/downloads/DS501286
> > 
> > -> All disks are getting detected with fw=8.05.63 and without applying any
> > parameter modification on qla2xxx module.
> > 
> > ==============
> > fenacosrv92151
> > ==============
> > 
> > AFTER FIRMWARE DOWNGRADE
> > ------------------------
> > 
> > fw=8.05.63
> > 
> > $ grep QLE sos_commands/logs/journalctl_--no-pager_--catalog_--boot -A1
> > Dec 21 15:38:22 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fb:15: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 21 15:38:22 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.0]-00fc:15: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.0 hdma+
> > host#=15 fw=8.05.63 (d0d5).
> > --
> > Dec 21 15:38:24 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fb:16: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 21 15:38:24 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:2f:00.1]-00fc:16: ISP2261: PCIe (8.0GT/s x8) @ 0000:2f:00.1 hdma+
> > host#=16 fw=8.05.63 (d0d5).
> > --
> > Dec 21 15:38:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fb:17: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 21 15:38:27 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.0]-00fc:17: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.0 hdma+
> > host#=17 fw=8.05.63 (d0d5).
> > --
> > Dec 21 15:38:28 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.1]-00fb:18: QLogic QLE2692 - QLogic 16Gb FC Dual-port HBA.
> > Dec 21 15:38:28 fenacosrv92151.main.corp.fenaco.com kernel: qla2xxx
> > [0000:58:00.1]-00fc:18: ISP2261: PCIe (8.0GT/s x8) @ 0000:58:00.1 hdma+
> > host#=18 fw=8.05.63 (d0d5).
> > 
> > $ cat sys/module/qla2xxx/parameters/ql2xnvmeenable 
> > 1
> > 
> > $ cat sos_commands/multipath/multipath_-l
> > mpathc (360060e80221598005041159800000447) dm-3 HITACHI ,OPEN-V          
> > size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=0 status=active
> >   |- 18:0:0:1 sdf 8:80 active undef running
> >   `- 15:0:0:1 sdc 8:32 active undef running
> > mpathb (360060e80221598005041159800000448) dm-5 HITACHI ,OPEN-V          
> > size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=0 status=active
> >   |- 15:0:0:2 sdd 8:48 active undef running
> >   `- 18:0:0:2 sdg 8:96 active undef running
> > mpatha (360060e80221598005041159800000446) dm-4 HITACHI ,OPEN-V          
> > size=2.0T features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=0 status=active
> >   |- 15:0:0:0 sdb 8:16 active undef running
> >   `- 18:0:0:0 sde 8:64 active undef running
> > 
> > Also I do not observe any 'Abort' message or 'TECH PREVIEW' message
> > 
> > QUESTION:
> > =========
> > 
> > -> Here I see 2 workaround 
> > 
> > o Downgrade the firmware of QLogic HBA
> > o set nvme support to 0 in qla2xxx parameter using 'ql2xnvmeenable=0'
> > 
> > -> Does this needs fix in qla2xxx kernel module or does this needs fix in
> > QLogic firmware?
> > 
> 
> We have identified patches that has fixed this issue and are submitted as
> part of RH77 inbox. 
> 
> see comment #39 for the patches that were identified for Fabric mode
> connection. 
> 
> Reporter of this bugzilla confirmed that the issue with fabric connect was
> resolved in comment #49.
> 
> > Thanks,
> 
> Thanks,
> Himanshu

Hi Himanshu, my customer has confirmed that the storage LUNs are using 'FABRIC MODE'

Do you want me to share test kernel with him now?

Thanks,

- Vishal Agrawal.

Comment 73 jennifer.duong 2019-01-07 17:27:16 UTC
Himanshu,

How do I capture the FC trace?

Thanks,

Jennifer

Comment 74 jennifer.duong 2019-01-09 22:03:13 UTC
Himanshu,

I am in the process of requesting an analyzer. Once that becomes available to me, I'll try and grab that FC trace for you.

Comment 75 jennifer.duong 2019-01-18 16:49:15 UTC
Created attachment 1521610 [details]
FC trace

Comment 76 Sirius Rayner-Karlsson 2019-01-24 09:16:23 UTC
Hi there,

If there is anything else required from Jennifer, now that the FC trace is available, let us know and we will work to provide it. If there is test kernels, I can help provide them via my peoples page.

Kind regards,

/S

Comment 77 jennifer.duong 2019-01-31 18:18:50 UTC
Himanshu, what should the next steps be?

Comment 78 jennifer.duong 2019-02-08 17:09:56 UTC
Himanshu, what should the next steps be?

Comment 79 Himanshu Madhani (Marvell) 2019-02-08 17:41:47 UTC
Hi Jennifer, 

Looks like I did not see that FC trace was available. (I was on PTO on week of Jan 14-18 when these were uploaded so must have missed notification)

Let me take a look at the trace and provide next step. 

Thanks,
Himanshu

Comment 80 Himanshu Madhani (Marvell) 2019-02-14 23:15:51 UTC
Hi Jennifer, 

Still going thru FC trace. nothing seems to stand out of ordinary. So need to take another look and see if there is any thing else that we can try on your setup. 

Thanks,
Himanshu

Comment 81 jennifer.duong 2019-02-22 14:51:30 UTC
Himanshu, have you had a chance to take another look at the FC trace?

Comment 82 Dwight (Bud) Brown 2019-02-22 19:05:02 UTC
I have a customer with the following that is hitting this or a similar issue (but the ql2xnvmeenable workaround doesn't help).

Hardware:

ProLiant DL380 Gen9  w/3 

Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter [1077:2261] (rev 01)


#            --------- PCI -------------
#                          subsystem      model        model
#scsi_addr   vendor device vendor device  name         description
#----------- ------ ------ ------ ------  ------------ --------------------------------------------------
 1:*:*:*     0x1077 0x2261 0x1590 0x00fa  SN1100Q      HPE SN1100Q 16Gb 2p FC HBA  << tapes/changer
 2:*:*:*     0x1077 0x2261 0x1590 0x00fa  SN1100Q      HPE SN1100Q 16Gb 2p FC HBA  << tapes/changer
 3:*:*:*     0x1077 0x2261 0x1590 0x00fa  SN1100Q      HPE SN1100Q 16Gb 2p FC HBA  << disks
 4:*:*:*     0x1077 0x2261 0x1590 0x00fa  SN1100Q      HPE SN1100Q 16Gb 2p FC HBA
 5:*:*:*     0x1077 0x2261 0x1590 0x00fa  SN1100Q      HPE SN1100Q 16Gb 2p FC HBA  << disks
 6:*:*:*     0x1077 0x2261 0x1590 0x00fa  SN1100Q      HPE SN1100Q 16Gb 2p FC HBA

firmware level: 8.07.18 (d0d5)



Running 7.5, no issues with seeing disks.
Running 7.6 (3.10.0-957.5.1.el7.x86_64) 
   . 10.00.00.06.07.6-k (in-box driver)
     - doesn't see the IBM 2145 disks, but does see tapes and changers.
     - using ql2xnvmeenable didn't help
   . 8.08.00.08.07.5-k9 (from QLogic website) 
     - disks being seen again.

With inbox driver they only see tapes/changers:


#scsi_addr       Type      Vendor   Model            Rev    sdN           
#--------------- --------- ------- ----------------- ------ ------------- 
[0:0:0:0]        raid      HP       P440ar           6.60   /dev/sg0 
[0:0:1:0]        disk      HP       EH0600JDXBC      HPD5   /dev/sda
[0:0:2:0]        disk      HP       EH0600JDXBC      HPD5   /dev/sdb
[0:0:3:0]        enclosure HP       P440ar           6.60   /dev/sg3
[1:0:0:0]        tape      IBM      03592E08         481A   /dev/st0 
[2:0:0:0]        tape      IBM      03592E08         481A   /dev/st1 
[1:0:1:0]        tape      IBM      03592E08         481A   /dev/st2
[2:0:1:0]        tape      IBM      03592E08         481A   /dev/st3
[1:0:1:1]        changer   IBM      03584L22         F330   /dev/ch0 
[2:0:1:1]        changer   IBM      03584L22         F330   /dev/ch1


Is this a different issue or the same as this one.  Should I ask the customer to upgrade to 8.08.xx.xx firmware
and then try setting ql2xnvmeenable off or should I open a new bugzilla since the ql2xnvmeenable workaround didn't
help.

Please advise.

Comment 84 Dwight (Bud) Brown 2019-02-25 21:10:49 UTC
Created attachment 1538584 [details]
overview of storage when all devices are seen

storage view of 7.6 system with v8 (out of box) driver and all storage ports, storage seen.

This is more info on case where workaround did not help.

Comment 85 Dwight (Bud) Brown 2019-02-25 21:13:13 UTC
Created attachment 1538585 [details]
overview of 7.6 with inbox driver, no disk luns seen

storage view of 7.6 system with v10 (in-box) driver and tapes/changers seen attached to Cisco switch(s), but no IBM storage seen off of IBM(?) switch.  Aka there is an assigned portid to the HBA, a Fabric wwn but no IBM 2145 storage ports listed as being logged into. 

This is more info on case where workaround did not help.

Comment 86 Himanshu Madhani (Marvell) 2019-02-26 17:41:31 UTC
Hi David, 
(In reply to Dwight (Bud) Brown from comment #85)
> Created attachment 1538585 [details]
> overview of 7.6 with inbox driver, no disk luns seen
> 
> storage view of 7.6 system with v10 (in-box) driver and tapes/changers seen
> attached to Cisco switch(s), but no IBM storage seen off of IBM(?) switch. 
> Aka there is an assigned portid to the HBA, a Fabric wwn but no IBM 2145
> storage ports listed as being logged into. 
> 
> This is more info on case where workaround did not help.

Can you try with RHEL7.6 Z stream kernel and see if the same issue is seen? 

Thanks,
Himanshu

Comment 87 Himanshu Madhani (Marvell) 2019-02-26 17:42:37 UTC
Hi Jennifer, 

(In reply to jennifer.duong from comment #81)
> Himanshu, have you had a chance to take another look at the FC trace?

Will have some update this week. I was on PTO last week so could not respond.

Thanks,
Himanshu

Comment 88 Dwight (Bud) Brown 2019-02-26 18:07:07 UTC
> Hi David, 
> Can you try with RHEL7.6 Z stream kernel and see if the same issue is seen? 


Himanshu, 

I'm not sure I understand, who's David?.  Did you mean Dwight?  Anyway, as noted in comment #82

"
Running 7.6 (3.10.0-957.5.1.el7.x86_64)
   . 10.00.00.06.07.6-k (in-box driver)
     - doesn't see the IBM 2145 disks, but does see tapes and changers.
     - using ql2xnvmeenable didn't help
   . 8.08.00.08.07.5-k9 (from QLogic website) 
     - disks being seen again.
"

So as you can see from the original comment, the customer is running latest shipped 7.6 (zstream) kernel.  Are you aware of specific driver commits between the latest shipped 7.6 kernel and a future one still unreleased?  Is that what you want the customer to test against?  For example, since the last released kernel, there have been several brew builds on later kernels:

[ 1] RHEL7.6.z: ( )3.10.0-957.8.1.el7   09-Jan-2019
[ 2] RHEL7.6.z: ( )3.10.0-957.7.1.el7   08-Jan-2019
[ 3] RHEL7.6.z: ( )3.10.0-957.6.1.el7   25-Dec-2018 << these and later are potential hotfix kernels, but not shipped.
[ 4] RHEL7.6.z: (!)3.10.0-957.5.1.el7   19-Dec-2018 << last shipped

Pulling the qla2xxx driver from 862-27.1 (latest 7.5z) into 957-5.1 results in the customer again being able to see his disks again under 7.6.


Please advise.

Comment 89 Himanshu Madhani (Marvell) 2019-02-26 18:28:49 UTC
(In reply to Dwight (Bud) Brown from comment #88)
> > Hi David, 
> > Can you try with RHEL7.6 Z stream kernel and see if the same issue is seen? 
> 
> 
> Himanshu, 
> 
> I'm not sure I understand, who's David?.  Did you mean Dwight?  Anyway, as
> noted in comment #82
> 

Sorry about mess up. I did meant Dwight. 

> "
> Running 7.6 (3.10.0-957.5.1.el7.x86_64)
>    . 10.00.00.06.07.6-k (in-box driver)
>      - doesn't see the IBM 2145 disks, but does see tapes and changers.
>      - using ql2xnvmeenable didn't help
>    . 8.08.00.08.07.5-k9 (from QLogic website) 
>      - disks being seen again.
> "
> 
> So as you can see from the original comment, the customer is running latest
> shipped 7.6 (zstream) kernel.  Are you aware of specific driver commits
> between the latest shipped 7.6 kernel and a future one still unreleased?  Is
> that what you want the customer to test against?  For example, since the
> last released kernel, there have been several brew builds on later kernels:
> 
> [ 1] RHEL7.6.z: ( )3.10.0-957.8.1.el7   09-Jan-2019
> [ 2] RHEL7.6.z: ( )3.10.0-957.7.1.el7   08-Jan-2019
> [ 3] RHEL7.6.z: ( )3.10.0-957.6.1.el7   25-Dec-2018 << these and later are
> potential hotfix kernels, but not shipped.
> [ 4] RHEL7.6.z: (!)3.10.0-957.5.1.el7   19-Dec-2018 << last shipped
> 
We added a patch in RH7.6.z kernel version 3.10.0-957.7.1.el7. which helped recover path in a multi path env. we should try driver from that kernel to see if it helps

> Pulling the qla2xxx driver from 862-27.1 (latest 7.5z) into 957-5.1 results
> in the customer again being able to see his disks again under 7.6.
> 
> 
> Please advise.

Also, can you provide debug logs using ql2xextended_error_logging=1 with both working and non working case. (i.e. RH7.5.z driver in RHEL 7.6 and using RHEL7.6 inbox driver) 

Thanks,
Himanshu

Comment 90 Dwight (Bud) Brown 2019-02-28 13:42:45 UTC
Created attachment 1539493 [details]
messages from 957-5.1, all devices present afterwards

case 02324311 - messages from 957-5.1, all devices present afterwards

symptom in case, after tape library restart, devices do not return as they didn in 7.5

Comment 91 Dwight (Bud) Brown 2019-02-28 13:44:38 UTC
Created attachment 1539494 [details]
messages 957-5.1 w/lip & debug after devices disappeared

case 02324311 - messages 957-5.1 w/lip & debug after devices disappeared

Attempt to manually rediscover missing devices via LIP, didn't work.  Tape/changers still missing after lip.

Comment 92 Dwight (Bud) Brown 2019-02-28 13:47:43 UTC
new 957-5.1 test kernel with qla2xxx patch from 957-7.1 provided to two customers:

customer #1 - looses tapes/changes after tape library restart and then never return in 7.6*, works fine in 7.5* as tapes/changes come back within 1 minute.
customer #2 - upon boot of 7.6*, tapes/changes show up but no disks.  7.5 works fine, 7.6 w/driver from 7.5 works fine.

Comment 93 Dwight (Bud) Brown 2019-03-01 11:53:09 UTC
customer #1 - looses tapes/changes after tape library restart and then never return in 7.6*, works fine in 7.5* as tapes/changes come back within 1 minute.

[No Change] : Test kernel with identified patch from 957-7.1 did not address the issue.  

Still when tape library is restarted, it never returns to configuration.  I have a new messages file for this kernel which I will upload after I pull it down and unpack it. There is no workaround identified for this issue other than rebooting the production servers affected.

Comment 94 Dwight (Bud) Brown 2019-03-01 12:00:46 UTC
customer #2 - upon boot of 7.6*, tapes/changes show up but no disks.  7.5 works fine, 7.6 w/driver from 7.5 works fine.

[No Change] : Test kernel with identified patch from 957-7.1 did not address the issue. 

The IBM disks are still not visible after boot with 7.6 kernels but are with 7.5 kernels.  The customer provided new messages files for this kernle and I will uplaod them after I pull the files down and unpack them.  There is no workaround identified for this issue other than downgrading to 7.5.  The nvme option off does not change the issue.  Running the v9 driver pulled from 7.5 into 7.6 kernels results in the issue not being seen.

Comment 95 jennifer.duong 2019-03-05 15:35:00 UTC
Himanshu, do you happen to have an update on what my next steps should be?

Comment 96 Robert Palco 2019-03-06 16:20:18 UTC
Created attachment 1541498 [details]
messages - 957.5.1.el7.QLAV10_7.1.2

customer #1 - looses tapes/changes after tape library restart and then never return in 7.6*, works fine in 7.5* as tapes/changes come back within 1 minute.
Testing with kernel - 3.10.0-957.5.1.el7.QLAV10_7.1.2

case 02324311 - system still running with the library in the DEAD state, running lip.bsh again and uploaded the resulting messages log.

The library returns about 80 seconds after being restarted, as it did with 3.10.0-862.14.4.el7.

Comment 97 jennifer.duong 2019-03-13 14:12:25 UTC
Himanshu, do you have an update on this?

Comment 98 Himanshu Madhani (Marvell) 2019-03-13 23:21:58 UTC
Hi Jennifer, 

(In reply to jennifer.duong from comment #95)
> Himanshu, do you happen to have an update on what my next steps should be?

Looks like this comment got lost in subsequent updates in this BZ.

from FC trace i don't see anything out of ordinary. we do get response from PLOGI/PRLI. 

Given that you are not seeing issue in Fabric mode, We need to identify how we can verify this
in a N2N mode.

i'll update it tomorrow if there is anyway we can confirm this. 

I Will want you to verify this once we have RHEL77 ISO build as well. 

Thanks,
Himanshu

Comment 99 Dwight (Bud) Brown 2019-03-14 14:22:40 UTC
I've another customer with issues since updating to 7.6 with 16G QLogic adapters, it appears that port discover stutters, that's the only way I know how to describe it.

#scsi_addr   name                   version                f/w                       device
#----------- ---------------------- ---------------------- ------------------------- ----------------------------------------------
 0:*:*:*     megaraid_sas                                                            /sys/devices/pci0000:00/0000:00:02.2/0000:07:00.0/host0
 1:*:*:*     qla2xxx                                       7.05.04 (d0d5)            /sys/devices/pci0000:00/0000:00:03.0/0000:11:00.0/host1
 2:*:*:*     qla2xxx                                       7.05.04 (d0d5)            /sys/devices/pci0000:00/0000:00:03.0/0000:11:00.1/host2
 :

#            --------- PCI -------------
#                          subsystem      model        model
#scsi_addr   vendor device vendor device  name         description
#----------- ------ ------ ------ ------  ------------ --------------------------------------------------
 0:*:*:*     0x1000 0x005d 0x1014 0x0454               <D:MegaRAID SAS-3 3108 [Invader]>
 1:*:*:*     0x1077 0x2031 0x1077 0x0263  QLE2662      QLogic 16Gb FC Dual-port HBA for System x
 2:*:*:*     0x1077 0x2031 0x1077 0x0263  QLE2662      QLogic 16Gb FC Dual-port HBA for System x
 :

#SCSI               HBA                                             Fabric              Storage                                        Port
#Addr          Luns wwnn               wwpn               portid    wwn                 wwnn               wwpn                portid  info
#------------- ---- ------------------/------------------/--------  ------------------  ------------------/------------------/--------/---------
1:0:0:-         282 0x2000000e1ee8ba56 0x2100000e1ee8ba56 0x6a95c0  0x100050eb1afa122c  0x50060e8007e60963 0x50060e8007e60963 0x6a1900 FCP Target
1:0:1:-         225 0x2000000e1ee8ba56 0x2100000e1ee8ba56 0x6a95c0  0x100050eb1afa122c  0x50060e8007e61763 0x50060e8007e61763 0x693500 FCP Target
              
2:0:0:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:1:-         225 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e61773 0x50060e8007e61773 0x692500 FCP Target

>> these are the expected two storage targets, we see hba->switch->storage targets with san did of 0x6a1900, 0x692500
>> but below is repeated storage targets to thru the same switch/san to the same end port -- there should only be one target discovery,
>> but in this case the same storage target (0x6a1900) was discovered and added to the kernel storage interconnect topology multiple times.
>> its possible that some last stage discover error results in this behavior, but have not seen it before.  There have been regression issues
>> within the qla2xxx driver within 7.6, especially with 16G adapters, and this problem may be related to those issues.
   
                               same hba (duh)                        same switch by id  same storage port by ids wwpn,wwnn, san did
                    vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv  vvvvvvvvvvvvvvvvvv  vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
2:0:2:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:3:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:4:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:5:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:6:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:7:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:8:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:9:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:10:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:11:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:12:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:13:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:14:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:15:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
2:0:16:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0  0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target

So the QLogic HBA discovers the same port over and over again resulting in the same port being registered under a new scsi target index, at least that is what it appears from the above data.  So instead of the expected 4 paths to a device the system end up with 17 paths, 15 being repeats of hba->switch->san did 0x6a1900. Customer is running 7.05.xx firmware on the 16G adpaters, but not sure that has anything to do with it.

Comment 100 Himanshu Madhani (Marvell) 2019-03-20 23:58:03 UTC
Hi Dwight, 

Can you please create new BZ for this issue. I do not see issue is related to this particular BZ and would like to keep issue separate. 

(In reply to Dwight (Bud) Brown from comment #99)
> I've another customer with issues since updating to 7.6 with 16G QLogic
> adapters, it appears that port discover stutters, that's the only way I know
> how to describe it.
> 
> #scsi_addr   name                   version                f/w              
> device
> #----------- ---------------------- ----------------------
> ------------------------- ----------------------------------------------
>  0:*:*:*     megaraid_sas                                                   
> /sys/devices/pci0000:00/0000:00:02.2/0000:07:00.0/host0
>  1:*:*:*     qla2xxx                                       7.05.04 (d0d5)   
> /sys/devices/pci0000:00/0000:00:03.0/0000:11:00.0/host1
>  2:*:*:*     qla2xxx                                       7.05.04 (d0d5)   
> /sys/devices/pci0000:00/0000:00:03.0/0000:11:00.1/host2
>  :
> 
> #            --------- PCI -------------
> #                          subsystem      model        model
> #scsi_addr   vendor device vendor device  name         description
> #----------- ------ ------ ------ ------  ------------
> --------------------------------------------------
>  0:*:*:*     0x1000 0x005d 0x1014 0x0454               <D:MegaRAID SAS-3
> 3108 [Invader]>
>  1:*:*:*     0x1077 0x2031 0x1077 0x0263  QLE2662      QLogic 16Gb FC
> Dual-port HBA for System x
>  2:*:*:*     0x1077 0x2031 0x1077 0x0263  QLE2662      QLogic 16Gb FC
> Dual-port HBA for System x
>  :
> 
> #SCSI               HBA                                             Fabric  
> Storage                                        Port
> #Addr          Luns wwnn               wwpn               portid    wwn     
> wwnn               wwpn                portid  info
> #------------- ---- ------------------/------------------/-------- 
> ------------------  ------------------/------------------/--------/---------
> 1:0:0:-         282 0x2000000e1ee8ba56 0x2100000e1ee8ba56 0x6a95c0 
> 0x100050eb1afa122c  0x50060e8007e60963 0x50060e8007e60963 0x6a1900 FCP Target
> 1:0:1:-         225 0x2000000e1ee8ba56 0x2100000e1ee8ba56 0x6a95c0 
> 0x100050eb1afa122c  0x50060e8007e61763 0x50060e8007e61763 0x693500 FCP Target
>               
> 2:0:0:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:1:-         225 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e61773 0x50060e8007e61773 0x692500 FCP Target
> 
> >> these are the expected two storage targets, we see hba->switch->storage targets with san did of 0x6a1900, 0x692500
> >> but below is repeated storage targets to thru the same switch/san to the same end port -- there should only be one target discovery,
> >> but in this case the same storage target (0x6a1900) was discovered and added to the kernel storage interconnect topology multiple times.
> >> its possible that some last stage discover error results in this behavior, but have not seen it before.  There have been regression issues
> >> within the qla2xxx driver within 7.6, especially with 16G adapters, and this problem may be related to those issues.
>    
>                                same hba (duh)                        same
> switch by id  same storage port by ids wwpn,wwnn, san did
>                     vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 
> vvvvvvvvvvvvvvvvvv  vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> 2:0:2:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:3:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:4:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:5:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:6:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:7:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:8:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:9:-         282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:10:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:11:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:12:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:13:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:14:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:15:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 2:0:16:-        282 0x2000000e1ee8ba57 0x2100000e1ee8ba57 0x6abac0 
> 0x100050eb1affdcd8  0x50060e8007e60973 0x50060e8007e60973 0x6a1900 FCP Target
> 
> So the QLogic HBA discovers the same port over and over again resulting in
> the same port being registered under a new scsi target index, at least that
> is what it appears from the above data.  So instead of the expected 4 paths
> to a device the system end up with 17 paths, 15 being repeats of
> hba->switch->san did 0x6a1900. Customer is running 7.05.xx firmware on the
> 16G adpaters, but not sure that has anything to do with it.

do you happen to have log file for multiple port registrations? Please ask customer to capture log file with ql2xextended_error_logging=1. 

Please provide me logs from the failure case to identify if this is known issue or not.

Thanks,
Himanshu

Comment 101 jennifer.duong 2019-03-27 14:24:54 UTC
Himanshu, what should my next steps be?

Comment 102 jennifer.duong 2019-03-29 19:35:46 UTC
Himanshu, I did a network install of the latest RHEL 7.7 nightly build onto my SANboot LUN with FW:v8.07.80 DVR:v10.00.00.12.07.7-k and it boot just fine. It was also able to see the remainder of my volumes. I went ahead and upgraded to FW:v8.08.03 and my host boot into emergency mode. I received a warning saying the following:

/dev/mapper/rhel_ictm1608s02h4-root does not exist
/dev/rhel_ictm1608s02h4/root does not exist
/dev/rhel_ictm1608s02h4/swap does not exit

I'm guessing that I hit that message because this issue still exists in RHEL 7.7, specifically when FW:v8.08.03 is loaded onto the HBAs. I reboot the host with FW:v8.07.80 loaded onto the HBAs and it was able to reboot multiple times without losing sight of the SANboot LUN. I will be attaching the serial logs shortly.

Comment 103 jennifer.duong 2019-03-29 19:39:22 UTC
Created attachment 1549588 [details]
ICTM1608S02H4-3-29-19 7.7 Nightly build

Comment 104 Himanshu Madhani (Marvell) 2019-03-29 19:56:57 UTC
Hi Jennifer, 

Can you try with 8.08.204 firmware posted on our download site to check if the issue is still reproducible

http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/SearchByProduct.aspx?ProductCategory=39&Product=1261&Os=2

Thanks,
Himanshu

Comment 105 jennifer.duong 2019-03-29 22:35:39 UTC
Himanshu, it doesn't look like I'm able to reproduce this with FW:v8.08.204.

Comment 106 Himanshu Madhani (Marvell) 2019-04-01 22:31:01 UTC
Hi Jennifer, 

(In reply to jennifer.duong from comment #105)
> Himanshu, it doesn't look like I'm able to reproduce this with FW:v8.08.204.

Thanks for the update. 

Thanks,
Himanshu

Comment 107 jennifer.duong 2019-04-03 14:21:15 UTC
Himanshu, since it looks like this issue is fixed in the latest QLogic FW 8.08.204, go ahead and close this bug.

Comment 108 Himanshu Madhani (Marvell) 2019-04-03 16:29:00 UTC
Thanks Jennifer for the confirmation. We'll close this Bugzilla.

(In reply to jennifer.duong from comment #107)
> Himanshu, since it looks like this issue is fixed in the latest QLogic FW
> 8.08.204, go ahead and close this bug.


Note You need to log in before you can comment on or make changes to this bug.