Description of problem:
After having upgraded the kernel from kernel-2.6.32-131.6.1.el6.x86_64 to kernel-2.6.32-131.12.1.el6.x86_64, the following problem occurs in around 50% of system boots:
A FC SAN LUN is not discovered at boot-time. This leads to a volume group not being available.
Version-Release number of selected component (if applicable):
The problem occurs randomly; after having performed around 10 reboots, my estimate is that it occurs in around 50% of the boot sequences.
Steps to Reproduce:
2. multipath -ll | fgrep XIV
3. When the LUN has not been discovered, the grep yields nothing; then the LUN has been discovered: grep yields one line (expected in this setup):
2001738000565024e dm-2 IBM,2810XIV
In ½ the cases, the LUN is not known by multipath
The LUN is always present.
I tried reverting to kernel-2.6.32-131.6.1.el6.x86_64, and then rebooted 8 times; with this configuration, the LUN showed up every time.
When running with the 2.6.32-131.12.1.el6.x86_64, I tried adjusting rc.sysinit, after having found the following page after some googling: http://www.firetooth.net/confluence/display/public/Linux+-+Multipath
modprobe dm-multipath > /dev/null 2>&1
/sbin/multipath -v 0
I put in:
echo "About to modprobe dm-multipath; sleeping a bit"
echo "modprobe done; sleeping again"
echo "multipath was run; sleeping again"
The adjustment to rc.sysinit doesn't make the problem go away, but likelihood of the problem occurring seems to decrease a little bit.
FYI, the version of some possibly related packages:
Created attachment 521344 [details]
Output of "multipath -ll" when there is no problem
Created attachment 521345 [details]
Output of "lsscsi" when there is no problem
Created attachment 521346 [details]
Output of "multipath -ll" when there IS a problem
Created attachment 521347 [details]
Output of "lsscsi" when there IS a problem
The server (which is a Dell R710) has some local storage (served by a PERC H700 on-board RAID-controller) which hosts the operating system and swap.
Its SAN storage connectivity happens through two 4Gbit/s Qlogic HBAs, connected to two Brocade FC switches.
Behind the switches are three different storage systems:
- An IBM DS4800 system
- A Hitachi AMS2100 system
- An IBM XIV (generation 2) system
The LUN which is not discovered half the time (after upgrading the kernel) is on the IBM XIV system. But I suspect that the problem is really related to the handling of the DS4800 LUNs. This suspicion is based on some ugly messages seen on the console when booting; the messages show up before the stage where the dm-multipath kernel module is loaded by rc.sysinit; the messages are seen no matter which of the kernels is being booted on:
sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:1: [sds] Sense Key : Illegal Request [current]
sd 2:0:0:1: [sds] <<vendor>> ASC=0x94 ASCQ=0x1ASC=0x94 ASCQ=0x1
sd 2:0:0:1: [sds] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
end_request: I/O error, dev sds, sector 0
Buffer I/O error on device sds, logical block 0
(sds is a path to a LUN on the DS4800 storage system.)
A screenshot illustrating the messages will be uploaded as "console-shot1.png".
By the way: It might be the case that this is somehow related to RH support case 484711 which concerns a situation where a swap-partition on a local RAID volume (/dev/sdb1) is not discovered at boot-time, unless the following is inserted in rc.local:
This problem happens no matter which of the two kernels is being used, though.
Created attachment 521348 [details]
Screenshot of the server's console duing boot
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.
Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.
The official life cycle policy can be reviewed here:
This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: