Description of problem: Replacing PERC5/E controller attached to an MD1000 storage box results in fsck errors during boot up and root filesystem does not get mounted. By rebooting again the issue is not founf. By removing the /etc/blkid.tab file before changing the PERC5/E controller results in the /etc/blkid.tab file being generated freshly and again the issue doesn't occur again. So, the bug is in the blkid tool in the e2fsprogs application which runs fsck on the partitions before mounting them. Version-Release number of selected component (if applicable): e2fsprogs-1.35-12.4.EL4 How reproducible: Often. Steps to Reproduce: 1) Install RHEL4 U4 32bit (kernel version : 2.6.9-42.ELsmp) on RAID5 on PERC5iA. 2)Create two or three VD's on PERC5/E from OMSS or the controller BIOS. 3)Shutdown the system and replace the PERC5/E controller with another one with same fw level and having no config on it. 4)Switch ON the system and observe the booting process.Booting ends up on the screen "Give root password for maintenance". Actual results: Fsck error occurs resulting in filesystem not being mounted. Expected results: System boot should be error free even the first time. Additional info: 1) Basically the fsck on / fails and filesystem is not mounted. 2) Manually mounting the filesystem after running fsck makes the problem go away. 3) Simply rebooting the system at this time makes the problem go away. 4) Deleting the /etc/blkid.tab before changin the PERC5/E controller makes the problem go away.
*** Bug 227819 has been marked as a duplicate of this bug. ***
Dell is requesting this for 4.5, Setting ACK and 4.5 flags to ?.
Shyam, I was wondering if you could help clear up some confusion concerning the version of PERC5 that manifests this bug. From reading the first comment, both the PERC5/E and PERC5/iA are listed. Is this a typo or does it fail on either PERC5? I ask because I was looking for a system that could be used, just to help move things along, and the pe2900 lists a PCI of 1028:0015 in its list of PCI IDs. When I looked this id up (http://pci-ids.ucw.cz/iii/?i=1028), it appears as though a "PERC5/I" has the ids 1028:0015. The PERC5/E has an id of 1028:1f01. So would this bug manifest on a pe2900 with a PERC5/I?
John, The PERC5/iA is PERC5/internal adapter which does not have external connectors(it is a PCI card). The OS is installed on disks attached to this controller. The PERC5/E is an adapter that has external connectors. I guess if you install OS on the PERC5/I which is the integrated controller and create VDS on the PERC5/E the issue will still reoccur. The catch is that VDs need to be presented to the OS after install.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
ignore the above. I hit this bug by accident during testing of bugbot. apologies
There re different ways to solve this problem: 1) Remove blkid.tab file, if it contains outdated data. I see a big problem arising, for this solution, because this will enable filesystem switching if filesystem labels are not unique on a system and if there are controller changes. This is a big behaviour change. 2) Do not add single partitions for multilayer filesystems to blkid.tab. This has been done for RHEL-5 and requires to backport new device-mapper and all changes for device-mapper and the dependend packages form RHEL-5 to RHEL-4: e.g. mount, blkid, mkfs.XYZ, fsck.XYZ. 3) User removes the blkid.tab file after changing hardware as a workaround.
Changes are too invasive for an update, can the workaround in 3) help the customer until they migrate over to RHEL5? regards, Florian La Roche
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Can the blkid.tab file be removed during reboots automatically. Probably /etc/init.d/halt? Just thinking aloud. This is a valid scenario during maintainance and might result in customer calls.
according to pjones, removing blkid.tab would cause a failure in the event there is duplicate labels. pm-nak.
(In reply to comment #8) > Changes are too invasive for an update, can the workaround in 3) help > the customer until they migrate over to RHEL5? Yes, we would at least like to have this workaround in RHEL 4.5.
Opening up comment #11 in response to Dell's request in comment #12. > according to pjones, removing blkid.tab would cause a failure in the event there > are duplicate labels. Therefore option "3)" in comment #7 is not viable.
We do not have a viable work around at this time. Can option 2 from comment#7 be a solution (requires backport from rhel5)?
If programatically or manually removing blkid.tab can lead to problems in specific cases and the backporting of RHEL5's device mapper and all associated utilities is too pervasive, does anyone see issues resulting from when the system is rebooted after fsck is run or the root filesystem is mounted by hand, as mentioned in the first comment? I don't consider these workarounds per se, they are more like "now what do I do?" but I was wondering if they were viable.
It is not possible to fix this problem without getting other major problems. Closing as CANTFIX.
Dell is requesting thar engineering provide a workaround that will help avoid this problem. A KB article will help advise customer of this workaround. For the permenant fix per option#2 in comment#7, they agree to wait until 4.6.
My proposed kbase article to help explain this situation....PLEASE provide corrections/feedback. Thanks. Issue: When I replace my Dell PowerEdge Expandable RAID Controller (PERC 5/I) why does my system not finish booting? Resolution: The replacing of the PERC 5/I can cause the fsck utility to fail when booting. If this occurs, the user should enter maintenance mode from the system console in order to manually run fsck. After the user enters the root password, fsck can be executed. fsck requires the device name of the failed partition, which would be something like /dev/hda1 or /dev/sdb2. Adding the -y option will prevent you from having to type 'y' for each repair. For example, # fsck -y /dev/hda0 Once the partition is repaired, the user can exit maintenance mode and reboot.
Reopening for fix in next update.
Even though there is a question of whether this can/should be fixed, it is still slated for 4.5 at this moment. But it missed 4.5 so setting 4.6 to '?'.
Dell will verify if this issue could be caused due to hardware.
Attempted to try possibilities of hardware problem like removing the battery and the issue doesn't occur due to that and so the issue can be safely eliminated from the hardware angle.
Can I get a little more clarification on this please. (several questions follow...) This bug started out with: --- Steps to Reproduce: 1) Install RHEL4 U4 32bit (kernel version : 2.6.9-42.ELsmp) on RAID5 on PERC5iA. 2)Create two or three VD's on PERC5/E from OMSS or the controller BIOS. 3)Shutdown the system and replace the PERC5/E controller with another one with same fw level and having no config on it. 4)Switch ON the system and observe the booting process.Booting ends up on the screen "Give root password for maintenance". --- So, do I understand this correctly: * There are (at least) 2 controllers on the box; a PERC5iA and a PERC5/E * The OS is installed on disks attached to the (internal) PERC5iA * When up & running, external disks are configured on the PERC5/E * The (external) PERC5/E controller is later replaced with one "having no config" And, my understanding is that if this *new* external PERC5/E controller has no config, it does not allow the OS to see the devices attached to it until it *is* configured properly? And it is these missing devices that cause the boot process to stop? But, the first comment later says that it is unable to fsck the *root* (/) filesystem, which is attached not to the PERC5/E but to the PERC5iA? I'm a little hung up on this "with no config" part. Based on other comments you can configure the controller from the controller bios, presumeably before the OS starts booting. If the new, replaced controller is configured from the BIOS on the first boot after replacement, does this also solve the problem? Is there a configuration option for the controller to set it as primary or secondary controller in the system? From twoerner: "the device name is as follows: /dev/cciss/cXdYpZ (controller X, disk Y, partition Z). the controller number is the problem." So perhaps the controllers have switched order, causing it to not find the root device? At this point I'm not clear on the absolute root cause of this problem, but I assume it's different (renamed or missing) devices presented to the OS, but what is it about the controller replacement that causes this? Thanks, -Eric
>I'm a little hung up on this "with no config" part. Based on other comments you >can configure the controller from the controller bios, presumeably before the OS >starts booting. If the new, replaced controller is configured from the BIOS on >the first boot after replacement, does this also solve the problem? No and this is a mandatory steps and is done using the CRTL+R utility in the PERC5/E bios so that the disks can be migrated. >Is there a configuration option for the controller to set it as primary or >secondary controller in the system? From twoerner: "the device name is as >follows: /dev/cciss/cXdYpZ (controller X, disk Y, partition Z). the controller >xnumber is the problem." So perhaps the controllers have switched order, causing >it to not find the root device? There is an option to set the primary and the secondary controller. But this has been verified with the correct boot as well. When this issue was reproduced the PERC5/Es were of different revisions. That is the only thing different. PCI ids were the same. Does this cause a different labeling?
I would not expect the firmware revisions to change things, but I'm not certain. I just feel like we haven't gotten to the true root cause on this one - IOW, exactly what does the OS see that is different after the replacement...
KBase article submitted with a suggested user recovery. Tweaked the input from John Feeney.... -------------------------------------------------- Issue: When I replace the PERC 5/E (or PERC 5/I) controller attached to my Dell MD1000 Storage box, the system fails to boot. I get fsck (filesystem check) errors during boot up and root filesystem does not get mounted. Reference: Bugzilla 227813. Resolution: Replacing the PERC 5/E or 5/I controller may cause the fsck utility to fail when booting. If this occurs, the user should enter maintenance mode from the system console in order to manually run fsck (you may be placed in maintenance mode automatically after this type of failure). After the user enters the root password (if required), fsck can be executed. fsck requires the device name of the failed partition, which would be something like /dev/hda1 or /dev/sdb2. Adding the -y option will prevent you from having to type 'y' for each repair. For example, # fsck -y /dev/hda0 Once the partition is repaired, the user can exit maintenance mode and reboot. Note that it's been reported that if you remove the /etc/blkid.tab file before changing the PERC5 controller the problem doesn't occur because the file is regenerated automatically.
Here's the kbase entry for this issue: http://kbase.redhat.com/faq/FAQ_46_11257.shtm
Since the kbase article is now available for users, I am going to close this as WontFix.