Bug 209160 - [RHEL5 Beta2] kernel(qla2xxx): parallel scanning of SCSI devices causes name changes
Summary: [RHEL5 Beta2] kernel(qla2xxx): parallel scanning of SCSI devices causes name ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
urgent
high
Target Milestone: ---
: ---
Assignee: Chip Coldwell
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 216989 227613 228988 230627 243319
TreeView+ depends on / blocked
 
Reported: 2006-10-03 16:51 UTC by Kiyoshi Ueda
Modified: 2009-06-19 09:16 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-07-26 13:22:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Kiyoshi Ueda 2006-10-03 16:51:13 UTC
+++ This bug was initially created as a clone of Bug #208332 +++

Description of problem:
System panic often occurs during boot time, when the root filesystem
is on a multipath device and the multipath storage has many LUNs.

When a storage has many LUNs and multiple ports which are connected
to QLogic FC cards, the device name of each LUN varies in each reboot
in the current RHEL5, since QLogic FC cards are initialized almost
in parallel.

On the other hand, mkinitrd assumes the device name (actually major
and minor number) is same.
So wrong multipath maps can be created during boot time on initrd
environment, and the root filesystem can't be found.


Version-Release number of selected component:
mkinitrd-5.1.15-1


How reproducible:
Often in 4 LUNs and 2 ports environment.
(Failed 9 times while 10 trials.)


Steps to Reproduce:
 0. Prepare a multipath storage which has many LUNs and
    some QLogic FC cards.
 1. Connect those cards and ports of the storage to create
    multipath environment.
 2. Install OS to the multipath storage by using multipath root
    support of the installer.
    (Boot filesystem (like /boot or /boot/efi) doesn't need to install
     to a multipath device.)
 3. Boot the installed OS.


Actual results:
Can't find the root filesystem and system panic occurs.
-------------------------------------------------------------------
Loading dm-multipath.ko module
device-mapper: multipath: version 1.0.4 loaded
Loading dm-round-robin.ko module
device-mapper: multipath round-robin: version 1.0.0 loaded
Creating root device.
Mounting root filesystem.
mount: could not find filesystem '/dev/root'
Setting up other fKernel panic - not syncing: Attempted to kill init!
-------------------------------------------------------------------


Expected results:
System panic should not occur.


Additional info:

-- Additional comment from pjones on 2006-09-27 17:50 EST --


*** This bug has been marked as a duplicate of 157082 ***

Comment 1 Kiyoshi Ueda 2006-10-03 16:53:27 UTC
I cloned this bug as a bug of the QLogic FC card driver.

In the current RHEL5, FC disks which are connected to multiple
QLogic FC HBAs are scanned in parallel.
On the other hand, there are no persistent device naming scheme.
So it is hard to identify some device in multiple QLoginc FC HBAs
and multiple FC disks environment.

Currently, it causes a system panic problem during boot time
in the multipath root support.  (See the original bug report.)
If the driver scans FC disks serially like the RHEL4 driver,
the multipath root support of current mkinitrd should work
as long as physical configuration doesn't change.

Example:
-------------------------------------------------------------------
Environment: 2 HBAs (host0, host1)
             4 LUNs multipath storage (lun0, lun1, lun2, lun3)

Scan order:
      current RHEL4         current RHEL5
      (2.6.9-42.EL)       (2.6.18-1.2702.el5)
    --------------------------------------------
        host0-lun0            host0-lun0
        host0-lun1            host1-lun0
        host0-lun2            host0-lun1
        host0-lun3            host0-lun2
        host1-lun0            host1-lun1
        host1-lun1            host0-lun3
        host1-lun2            host1-lun2
        host1-lun3            host1-lun3
    (Always same order)  (Varies in each boot)
-------------------------------------------------------------------


Comment 2 Tom Coughlan 2006-10-05 16:06:42 UTC
Andrew, 

Reliable persistent device naming is planned for RHEL 5.1. In the meantime, in
RHEL 5.0, it may be helpful to consider a way to reduce the impact by disabling
the parallel scan of SCSI hosts. An option to revert to sequential scanning
would help avoid the most common cause of name changes. It is a partial
solution, to be sure, but I wonder if it would be feasible?

Tom

Comment 3 Andrew Vasquez 2006-10-05 21:23:07 UTC
Given the new FC transport infrastructure, the driver has no role in
the lun-scan detection process.  Instead the driver simply makes an
upcall to the transport indicating the a new FC port has been discovered.
If that port has a 'target' role, then a midlayer 'scan-work' event is
placed on the shost's work-queue.  Given the threaded/scheduled semantics
of work-queue handling, there's no guarantee when a work-event will 
be processed.  As can be seen by the customer, in his test cases, he's
seeing parallel work-queue handling.

Comment 4 Jun'ichi NOMURA 2006-10-13 22:07:39 UTC
Are there any work around for this?

Comment 5 Tom Coughlan 2006-10-17 21:39:51 UTC
I don't know of any workaround. We are going to have to instruct customes in how
to use persistent names (LVM, labels, udev), rather than "sd" (or major, minor
numbers). This is not going to be easy, or perfect, but it is the direction they
have to move in anyway.

Comment 6 RHEL Program Management 2006-10-26 19:33:44 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 7 Chip Coldwell 2006-10-31 16:14:08 UTC
(In reply to comment #5)
> I don't know of any workaround. We are going to have to instruct customes in how
> to use persistent names (LVM, labels, udev), rather than "sd" (or major, minor
> numbers). This is not going to be easy, or perfect, but it is the direction they
> have to move in anyway.

Does this mean a release note, then?  Or perhaps a "persistent naming" whitepaper?

Chip

Comment 8 Andrius Benokraitis 2006-10-31 21:04:50 UTC
This bug could be related to bug 213039.

Comment 9 Larry Troan 2007-02-19 15:30:43 UTC
Bug already in GSS list. Removing from feature list.

Comment 14 RHEL Program Management 2007-07-26 13:22:07 UTC
Quality Engineering Management has reviewed and declined this request.  You may
appeal this decision by reopening this request. 

Comment 15 Tom Coughlan 2007-07-27 21:47:45 UTC
The pressing need for this is has been removed in 5.1 by the improved support
for dm-multipath in Anaconda and the initrd. The need for persistent "sd" device
names is mostly gone. What remains is to make customers aware of the fact that
they can not depend on persistent "sd" device names, and have them remove this
dependency from their applications and procedures. 

How about a knowledge base article on this Chip? 



Comment 17 Chip Coldwell 2007-11-08 14:56:29 UTC
What is 209160 for? Was it for multipath bugs that were a result of async
scanning? I thought there were two bugs with multipath boot:

1. async scanning causes a device's names (/dev/sX) and major minor numbers to
change between boots. This was bad for the initial multipath boot code back in
5.0 beta, because the multipath boot code was relying major minor numbers to be
the same. I think Peter Jones or someone fixed that by having multipath assemble
devices for boot using uuid like is done with the non-boot multipath setup.

2. Previously, userspace assumed that when a module was done loading the devices
were added and ready to go, but async scanning causes the module loading to
return before devices are found. This causes multipath boot not to find devices.
I thought this was fixed with the wait fix in this bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=213039

Comment 18 Chip Coldwell 2007-11-08 14:57:11 UTC
Ooops, previous comment is Mike Christie's copy-pased from but 198666.  Should
have cited him there.

Chip



Note You need to log in before you can comment on or make changes to this bug.