Bug 735600

Summary: Regression in 2.6.32-131.12.1 regarding LUN discovery
Product: Red Hat Enterprise Linux 6 Reporter: Troels Arvin <troels>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-06 10:15:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output of "multipath -ll" when there is no problem
none
Output of "lsscsi" when there is no problem
none
Output of "multipath -ll" when there IS a problem
none
Output of "lsscsi" when there IS a problem
none
Screenshot of the server's console duing boot none

Description Troels Arvin 2011-09-03 22:03:49 UTC
Description of problem:
After having upgraded the kernel from kernel-2.6.32-131.6.1.el6.x86_64 to kernel-2.6.32-131.12.1.el6.x86_64, the following problem occurs in around 50% of system boots:
A FC SAN LUN is not discovered at boot-time. This leads to a volume group not being available.

Version-Release number of selected component (if applicable):
2.6.32-131.12.1.el6.x86_64

How reproducible:
The problem occurs randomly; after having performed around 10 reboots, my estimate is that it occurs in around 50% of the boot sequences.

Steps to Reproduce:
1. Reboot.
2. multipath -ll | fgrep XIV
3. When the LUN has not been discovered, the grep yields nothing; then the LUN has been discovered: grep yields one line (expected in this setup):
2001738000565024e dm-2 IBM,2810XIV
  
Actual results:
In ½ the cases, the LUN is not known by multipath

Expected results:
The LUN is always present.

Additional info:
I tried reverting to kernel-2.6.32-131.6.1.el6.x86_64, and then rebooted 8 times; with this configuration, the LUN showed up every time.

When running with the 2.6.32-131.12.1.el6.x86_64, I tried adjusting rc.sysinit, after having found the following page after some googling: http://www.firetooth.net/confluence/display/public/Linux+-+Multipath
My adjustment:
Instead of
   modprobe dm-multipath > /dev/null 2>&1
   /sbin/multipath -v 0
I put in:
    echo "About to modprobe dm-multipath; sleeping a bit"
    sleep 30
    modprobe dm-multipath
    echo "modprobe done; sleeping again"
    sleep 10
    /sbin/multipath
    echo "multipath was run; sleeping again"
    sleep 10

The adjustment to rc.sysinit doesn't make the problem go away, but likelihood of the problem occurring seems to decrease a little bit.

Comment 1 Troels Arvin 2011-09-03 22:05:51 UTC
FYI, the version of some possibly related packages:
 - device-mapper-multipath-0.4.9-41.el6
 - kpartx-0.4.9-41.el6.x86_64

Comment 2 Troels Arvin 2011-09-03 22:08:46 UTC
Created attachment 521344 [details]
Output of "multipath -ll" when there is no problem

Comment 3 Troels Arvin 2011-09-03 22:09:33 UTC
Created attachment 521345 [details]
Output of "lsscsi" when there is no problem

Comment 4 Troels Arvin 2011-09-03 22:10:43 UTC
Created attachment 521346 [details]
Output of "multipath -ll" when there IS a problem

Comment 5 Troels Arvin 2011-09-03 22:11:18 UTC
Created attachment 521347 [details]
Output of "lsscsi" when there IS a problem

Comment 7 Troels Arvin 2011-09-03 22:32:18 UTC
The server (which is a Dell R710) has some local storage (served by a PERC H700 on-board RAID-controller) which hosts the operating system and swap.

Its SAN storage connectivity happens through two 4Gbit/s Qlogic HBAs, connected to two Brocade FC switches.

Behind the switches are three different storage systems:
 - An IBM DS4800 system
 - A Hitachi AMS2100 system
 - An IBM XIV (generation 2) system

The LUN which is not discovered half the time (after upgrading the kernel) is on the IBM XIV system. But I suspect that the problem is really related to the handling of the DS4800 LUNs. This suspicion is based on some ugly messages seen on the console when booting; the messages show up before the stage where the dm-multipath kernel module is loaded by rc.sysinit; the messages are seen no matter which of the kernels is being booted on:

sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:1: [sds] Sense Key : Illegal Request [current] 
sd 2:0:0:1: [sds] <<vendor>> ASC=0x94 ASCQ=0x1ASC=0x94 ASCQ=0x1
sd 2:0:0:1: [sds] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
end_request: I/O error, dev sds, sector 0
Buffer I/O error on device sds, logical block 0

(sds is a path to a LUN on the DS4800 storage system.)

A screenshot illustrating the messages will be uploaded as "console-shot1.png".


By the way: It might be the case that this is somehow related to RH support case 484711 which concerns a situation where a swap-partition on a local RAID volume (/dev/sdb1) is not discovered at boot-time, unless the following is inserted in rc.local:
partprobe /dev/sdb
swapon /dev/sdb1
This problem happens no matter which of the two kernels is being used, though.

Comment 8 Troels Arvin 2011-09-03 22:33:07 UTC
Created attachment 521348 [details]
Screenshot of the server's console duing boot

Comment 9 RHEL Program Management 2011-10-07 15:47:13 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 10 Jan Kurik 2017-12-06 10:15:48 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/