Bug 227813 - PERC5/E: Controller replacement results in failure of normal OS boot up
Summary: PERC5/E: Controller replacement results in failure of normal OS boot up
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: e2fsprogs
Version: 4.5
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Eric Sandeen
QA Contact: Jay Turner
URL:
Whiteboard:
: 227819 (view as bug list)
Depends On:
Blocks: 246028 246627
TreeView+ depends on / blocked
 
Reported: 2007-02-08 12:22 UTC by Shyam kumar Iyer
Modified: 2018-10-19 22:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-09-27 15:55:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Shyam kumar Iyer 2007-02-08 12:22:16 UTC
Description of problem:
Replacing PERC5/E controller attached to an MD1000 storage box results in fsck 
errors during boot up and root filesystem does not get mounted. By rebooting 
again the issue is not founf. By removing the /etc/blkid.tab file before 
changing the PERC5/E controller results in the /etc/blkid.tab file being 
generated freshly and again the issue doesn't occur again. 
So, the bug is in the blkid tool in the e2fsprogs application which runs fsck 
on the partitions before mounting them.

Version-Release number of selected component (if applicable):
e2fsprogs-1.35-12.4.EL4

How reproducible:
Often.

Steps to Reproduce:
1) Install RHEL4 U4 32bit (kernel version : 2.6.9-42.ELsmp) on RAID5 on 
PERC5iA.
2)Create two or three VD's on PERC5/E from OMSS or the controller BIOS.
3)Shutdown the system and replace the PERC5/E controller with another one with 
same fw level and having no config on it.
4)Switch ON the system and observe the booting process.Booting ends up on the 
screen "Give root password for maintenance".
  
Actual results:
Fsck error occurs resulting in filesystem not being mounted.

Expected results:
System boot should be error free even the first time.

Additional info:
1) Basically the fsck on / fails and filesystem is not mounted. 
2) Manually mounting the filesystem after running fsck makes the problem go 
away.
3) Simply rebooting the system at this time makes the problem go away.
4) Deleting the /etc/blkid.tab before changin the PERC5/E controller makes the 
problem go away.

Comment 1 Shyam kumar Iyer 2007-02-08 13:07:00 UTC
*** Bug 227819 has been marked as a duplicate of this bug. ***

Comment 2 Samuel Benjamin 2007-02-14 23:13:48 UTC
Dell is requesting this for 4.5, Setting ACK and 4.5 flags to ?.

Comment 3 John Feeney 2007-02-15 20:17:20 UTC
Shyam,
I was wondering if you could help clear up some confusion concerning the 
version of PERC5 that manifests this bug. From reading the first comment, 
both the PERC5/E and PERC5/iA are listed. Is this a typo or does it fail on
either PERC5? I ask because I was looking for a system that could be used, just 
to help move things along, and the pe2900 lists a PCI of 1028:0015 in its list 
of PCI IDs. When I looked this id up (http://pci-ids.ucw.cz/iii/?i=1028), it 
appears as though a "PERC5/I" has the ids 1028:0015. The PERC5/E has an 
id of 1028:1f01. So would this bug manifest on a pe2900 with a PERC5/I?


Comment 4 Shyam kumar Iyer 2007-02-16 13:57:37 UTC
John,
      The PERC5/iA is PERC5/internal adapter which does not have external 
connectors(it is a PCI card). The OS is installed on disks attached to this 
controller. The PERC5/E is an adapter that has external connectors.
      I guess if you install OS on the PERC5/I which is the integrated 
controller and create VDS on the PERC5/E the issue will still reoccur. The 
catch is that VDs need to be presented to the OS after install.

Comment 5 RHEL Program Management 2007-02-20 21:52:15 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Kevin Baker 2007-02-20 22:04:30 UTC
ignore the above. I hit this bug by accident during testing of bugbot.
apologies

Comment 7 Thomas Woerner 2007-02-26 16:25:20 UTC
There re different ways to solve this problem:

1) Remove blkid.tab file, if it contains outdated data.
I see a big problem arising, for this solution, because this will enable
filesystem switching if filesystem labels are not unique on a system and if
there are controller changes. This is a big behaviour change.

2) Do not add single partitions for multilayer filesystems to blkid.tab.
This has been done for RHEL-5 and requires to backport new device-mapper and all
changes for device-mapper and the dependend packages form RHEL-5 to RHEL-4: e.g.
mount, blkid, mkfs.XYZ, fsck.XYZ.

3) User removes the blkid.tab file after changing hardware as a workaround.


Comment 8 Florian La Roche 2007-02-26 16:30:12 UTC
Changes are too invasive for an update, can the workaround in 3) help
the customer until they migrate over to RHEL5?

regards,

Florian La Roche


Comment 9 RHEL Program Management 2007-02-26 16:41:25 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request. 

Comment 10 Shyam kumar Iyer 2007-02-28 08:24:40 UTC
Can the blkid.tab file be removed during reboots automatically. 
Probably /etc/init.d/halt? Just thinking aloud. 
This is a valid scenario during maintainance and might result in customer 
calls.

Comment 11 Kevin Krafthefer 2007-03-01 17:07:11 UTC
according to pjones, removing blkid.tab would cause a failure in the event there
is duplicate labels. pm-nak.

Comment 12 Charles Rose 2007-03-05 12:25:18 UTC
(In reply to comment #8)
> Changes are too invasive for an update, can the workaround in 3) help
> the customer until they migrate over to RHEL5?

Yes, we would at least like to have this workaround in RHEL 4.5.


Comment 14 Larry Troan 2007-03-05 13:25:59 UTC
Opening up comment #11 in response to Dell's request in comment #12.
> according to pjones, removing blkid.tab would cause a failure in the event there
> are duplicate labels.

Therefore option "3)" in comment #7 is not viable.

Comment 15 Samuel Benjamin 2007-03-05 19:28:22 UTC
We do not have a viable work around at this time. Can option 2 from comment#7 be
a solution (requires backport from rhel5)?

Comment 16 John Feeney 2007-03-05 20:39:14 UTC
If programatically or manually removing blkid.tab can lead to problems in 
specific cases and the backporting of RHEL5's device mapper and all associated
utilities is too pervasive, does anyone see issues resulting from when the 
system is rebooted after fsck is run or the root filesystem is mounted by hand, 
as mentioned in the first comment? I don't consider these workarounds per se,
they are more like "now what do I do?" but I was wondering if they were viable.

Comment 17 Thomas Woerner 2007-03-21 10:59:53 UTC
It is not possible to fix this problem without getting other major problems.

Closing as CANTFIX.

Comment 18 Samuel Benjamin 2007-03-23 02:18:13 UTC
Dell is requesting thar engineering provide a workaround that will help avoid
this problem. A KB article will help advise customer of this workaround. 

For the permenant fix per option#2 in comment#7, they agree to wait until 4.6.

Comment 21 John Feeney 2007-04-09 19:16:02 UTC
My proposed kbase article to help explain this situation....PLEASE provide
corrections/feedback. Thanks.

Issue:
When I replace my Dell PowerEdge Expandable RAID Controller (PERC 5/I) why 
does my system not finish booting?

Resolution:
The replacing of the PERC 5/I can cause the fsck utility to fail when booting.
If this occurs, the user should enter maintenance mode from the system 
console in order to manually run fsck. 

After the user enters the root password, fsck can be executed. fsck requires 
the device name of the failed partition, which would be something like 
/dev/hda1 or /dev/sdb2. Adding the -y option will prevent you from having
to type 'y' for each repair. 

For example,
   # fsck -y /dev/hda0

Once the partition is repaired, the user can exit maintenance mode and 
reboot.


Comment 22 Shyam kumar Iyer 2007-04-11 12:21:38 UTC
Reopening for fix in next update.

Comment 23 John Feeney 2007-04-12 21:01:11 UTC
Even though there is a question of whether this can/should be fixed, it is
still slated for 4.5 at this moment. But it missed 4.5 so setting 4.6 to '?'.

Comment 26 RHEL Program Management 2007-05-09 07:50:47 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Charles Rose 2007-06-06 14:15:28 UTC
Dell will verify if this issue could be caused due to hardware.

Comment 29 Shyam kumar Iyer 2007-06-14 12:31:52 UTC
Attempted to try possibilities of hardware problem like removing the battery 
and the issue doesn't occur due to that and so the issue can be safely 
eliminated from the hardware angle.

Comment 30 Eric Sandeen 2007-07-25 14:48:49 UTC
Can I get a little more clarification on this please.  (several questions
follow...)  This bug started out with:

---
Steps to Reproduce:
1) Install RHEL4 U4 32bit (kernel version : 2.6.9-42.ELsmp) on RAID5 on 
PERC5iA.
2)Create two or three VD's on PERC5/E from OMSS or the controller BIOS.
3)Shutdown the system and replace the PERC5/E controller with another one with 
same fw level and having no config on it.
4)Switch ON the system and observe the booting process.Booting ends up on the 
screen "Give root password for maintenance".
---

So, do I understand this correctly: 

* There are (at least) 2 controllers on the box; a PERC5iA and a PERC5/E
* The OS is installed on disks attached to the (internal) PERC5iA
* When up & running, external disks are configured on the PERC5/E
* The (external) PERC5/E controller is later replaced with one "having no config"

And, my understanding is that if this *new* external PERC5/E controller has no
config, it does not allow the OS to see the devices attached to it until it *is*
configured properly?  And it is these missing devices that cause the boot
process to stop?

But, the first comment later says that it is unable to fsck the *root* (/)
filesystem, which is attached not to the PERC5/E but to the PERC5iA?

I'm a little hung up on this "with no config" part.  Based on other comments you
can configure the controller from the controller bios, presumeably before the OS
starts booting.  If the new, replaced controller is configured from the BIOS on
the first boot after replacement, does this also solve the problem?

Is there a configuration option for the controller to set it as primary or
secondary controller in the system?  From twoerner: "the device name is as
follows: /dev/cciss/cXdYpZ (controller X, disk Y, partition Z). the controller
number is the problem."  So perhaps the controllers have switched order, causing
it to not find the root device?

At this point I'm not clear on the absolute root cause of this problem, but I
assume it's different (renamed or missing) devices presented to the OS, but what
is it about the controller replacement that causes this?

Thanks,

-Eric

Comment 32 Shyam kumar Iyer 2007-08-21 06:27:13 UTC
>I'm a little hung up on this "with no config" part.  Based on other comments 
you
>can configure the controller from the controller bios, presumeably before the 
OS
>starts booting.  If the new, replaced controller is configured from the BIOS 
on
>the first boot after replacement, does this also solve the problem?

No and this is a mandatory steps and is done using the CRTL+R utility in the 
PERC5/E bios so that the disks can be migrated.

>Is there a configuration option for the controller to set it as primary or
>secondary controller in the system?  From twoerner: "the device name is as
>follows: /dev/cciss/cXdYpZ (controller X, disk Y, partition Z). the controller
>xnumber is the problem."  So perhaps the controllers have switched order, 
causing
>it to not find the root device?

There is an option to set the primary and the secondary controller. But this 
has been verified with the correct boot as well.
When this issue was reproduced the PERC5/Es were of different revisions. That 
is the only thing different. PCI ids were the same. Does this cause a 
different labeling?


Comment 33 Eric Sandeen 2007-08-21 14:33:13 UTC
I would not expect the firmware revisions to change things, but I'm not certain.

I just feel like we haven't gotten to the true root cause on this one - IOW,
exactly what does the OS see that is different after the replacement...

Comment 34 Larry Troan 2007-08-29 17:30:46 UTC
KBase article submitted with a suggested user recovery.

Tweaked the input from John Feeney....
--------------------------------------------------

Issue:
When I replace the PERC 5/E (or PERC 5/I) controller attached to my Dell
MD1000 Storage box, the system fails to boot. I get fsck (filesystem
check) errors during boot up and root filesystem does not get mounted. 

Reference: Bugzilla 227813.


Resolution:

Replacing the PERC 5/E or 5/I controller may cause the fsck utility to
fail when booting. If this occurs, the user should enter maintenance
mode from the system console in order to manually run fsck (you may be
placed in maintenance mode automatically after this type of failure). 

After the user enters the root password (if required), fsck can be
executed. fsck requires the device name of the failed partition, which
would be something like /dev/hda1 or /dev/sdb2. Adding the -y option
will prevent you from having to type 'y' for each repair. 

For example,
   # fsck -y /dev/hda0

Once the partition is repaired, the user can exit maintenance mode and
reboot.

Note that it's been reported that if you remove the /etc/blkid.tab file
before changing the PERC5 controller the problem doesn't occur because
the file is regenerated automatically. 

Comment 35 Gary Case 2007-09-20 13:24:42 UTC
Here's the kbase entry for this issue:

http://kbase.redhat.com/faq/FAQ_46_11257.shtm

Comment 36 John Feeney 2007-09-27 15:55:51 UTC
Since the kbase article is now available for users, I am going to close this as
WontFix. 


Note You need to log in before you can comment on or make changes to this bug.