+++ This bug was initially created as a clone of Bug #208332 +++ Description of problem: System panic often occurs during boot time, when the root filesystem is on a multipath device and the multipath storage has many LUNs. When a storage has many LUNs and multiple ports which are connected to QLogic FC cards, the device name of each LUN varies in each reboot in the current RHEL5, since QLogic FC cards are initialized almost in parallel. On the other hand, mkinitrd assumes the device name (actually major and minor number) is same. So wrong multipath maps can be created during boot time on initrd environment, and the root filesystem can't be found. Version-Release number of selected component: mkinitrd-5.1.15-1 How reproducible: Often in 4 LUNs and 2 ports environment. (Failed 9 times while 10 trials.) Steps to Reproduce: 0. Prepare a multipath storage which has many LUNs and some QLogic FC cards. 1. Connect those cards and ports of the storage to create multipath environment. 2. Install OS to the multipath storage by using multipath root support of the installer. (Boot filesystem (like /boot or /boot/efi) doesn't need to install to a multipath device.) 3. Boot the installed OS. Actual results: Can't find the root filesystem and system panic occurs. ------------------------------------------------------------------- Loading dm-multipath.ko module device-mapper: multipath: version 1.0.4 loaded Loading dm-round-robin.ko module device-mapper: multipath round-robin: version 1.0.0 loaded Creating root device. Mounting root filesystem. mount: could not find filesystem '/dev/root' Setting up other fKernel panic - not syncing: Attempted to kill init! ------------------------------------------------------------------- Expected results: System panic should not occur. Additional info: -- Additional comment from pjones on 2006-09-27 17:50 EST -- *** This bug has been marked as a duplicate of 157082 ***
I cloned this bug as a bug of the QLogic FC card driver. In the current RHEL5, FC disks which are connected to multiple QLogic FC HBAs are scanned in parallel. On the other hand, there are no persistent device naming scheme. So it is hard to identify some device in multiple QLoginc FC HBAs and multiple FC disks environment. Currently, it causes a system panic problem during boot time in the multipath root support. (See the original bug report.) If the driver scans FC disks serially like the RHEL4 driver, the multipath root support of current mkinitrd should work as long as physical configuration doesn't change. Example: ------------------------------------------------------------------- Environment: 2 HBAs (host0, host1) 4 LUNs multipath storage (lun0, lun1, lun2, lun3) Scan order: current RHEL4 current RHEL5 (2.6.9-42.EL) (2.6.18-1.2702.el5) -------------------------------------------- host0-lun0 host0-lun0 host0-lun1 host1-lun0 host0-lun2 host0-lun1 host0-lun3 host0-lun2 host1-lun0 host1-lun1 host1-lun1 host0-lun3 host1-lun2 host1-lun2 host1-lun3 host1-lun3 (Always same order) (Varies in each boot) -------------------------------------------------------------------
Andrew, Reliable persistent device naming is planned for RHEL 5.1. In the meantime, in RHEL 5.0, it may be helpful to consider a way to reduce the impact by disabling the parallel scan of SCSI hosts. An option to revert to sequential scanning would help avoid the most common cause of name changes. It is a partial solution, to be sure, but I wonder if it would be feasible? Tom
Given the new FC transport infrastructure, the driver has no role in the lun-scan detection process. Instead the driver simply makes an upcall to the transport indicating the a new FC port has been discovered. If that port has a 'target' role, then a midlayer 'scan-work' event is placed on the shost's work-queue. Given the threaded/scheduled semantics of work-queue handling, there's no guarantee when a work-event will be processed. As can be seen by the customer, in his test cases, he's seeing parallel work-queue handling.
Are there any work around for this?
I don't know of any workaround. We are going to have to instruct customes in how to use persistent names (LVM, labels, udev), rather than "sd" (or major, minor numbers). This is not going to be easy, or perfect, but it is the direction they have to move in anyway.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion.
(In reply to comment #5) > I don't know of any workaround. We are going to have to instruct customes in how > to use persistent names (LVM, labels, udev), rather than "sd" (or major, minor > numbers). This is not going to be easy, or perfect, but it is the direction they > have to move in anyway. Does this mean a release note, then? Or perhaps a "persistent naming" whitepaper? Chip
This bug could be related to bug 213039.
Bug already in GSS list. Removing from feature list.
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.
The pressing need for this is has been removed in 5.1 by the improved support for dm-multipath in Anaconda and the initrd. The need for persistent "sd" device names is mostly gone. What remains is to make customers aware of the fact that they can not depend on persistent "sd" device names, and have them remove this dependency from their applications and procedures. How about a knowledge base article on this Chip?
What is 209160 for? Was it for multipath bugs that were a result of async scanning? I thought there were two bugs with multipath boot: 1. async scanning causes a device's names (/dev/sX) and major minor numbers to change between boots. This was bad for the initial multipath boot code back in 5.0 beta, because the multipath boot code was relying major minor numbers to be the same. I think Peter Jones or someone fixed that by having multipath assemble devices for boot using uuid like is done with the non-boot multipath setup. 2. Previously, userspace assumed that when a module was done loading the devices were added and ready to go, but async scanning causes the module loading to return before devices are found. This causes multipath boot not to find devices. I thought this was fixed with the wait fix in this bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=213039
Ooops, previous comment is Mike Christie's copy-pased from but 198666. Should have cited him there. Chip