Bug 231330

Summary: RHEL4 U5 beta1 installation failure: anaconda unhandled exception
Product: Red Hat Enterprise Linux 4 Reporter: Nick Dokos <nicholas.dokos>
Component: partedAssignee: David Cantrell <dcantrell>
Status: CLOSED NOTABUG QA Contact: Brock Organ <borgan>
Severity: urgent Docs Contact:
Priority: medium    
Version: 4.5CC: bmarson, dkl, jturner
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-15 14:34:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Anaconda traceback
none
Patch to probe_partition_for_geom() for DL585g systems
none
New parted source RPM
none
New parted rpm for x86_64
none
New parted-devel rpm for x86_64
none
New parted-debuginfo rpm for x86_64 none

Description Nick Dokos 2007-03-07 19:25:40 UTC
Description of problem: Trying to install RHEL4U5 beta1 on an HP DL585g2 system
(8 Opteron cores, 32Gb of memory, installing on a 72Gb CCISS drive, no RAID,
booting from a boot.iso and installing over NFS, chose Everything under
packages). During the QA period, when partitioning the drive, I got a lot of
assertions:


Assertion (cyl_size <= 255*63) at disk_dos.c:556 in
function probe_partition_for_geom() failed.

Assertion (heads < 256) at disk_dos.c:576 in ...

Assertion ((C*heads + H)*sectors +S == A) at disk_dos.c:582 in ...

I clicked [Ignore] on all of them (clicking [Cancel] did not seem
to have any effect the few times I tried the first time I went through
the exercise - the second time, I ignored every assertion).

Finally, after the interactive part of the installation was over, I got
more assertions and finally anaconda died with an unhandled exception.
I'll create an attachment with the traceback.

Version-Release number of selected component (if applicable):
anaconda-10.1.1.62

How reproducible:


Steps to Reproduce:
1. Described above
2.
3.
  
Actual results: Installation failed.


Expected results: Success.


Additional info:

Comment 1 Nick Dokos 2007-03-07 19:25:40 UTC
Created attachment 149479 [details]
Anaconda traceback

Comment 2 Nick Dokos 2007-03-09 20:47:26 UTC
I tried the install with RHEL4.5snap1 - I get the exact same failure.

.discinfo says:
1172773896.803713
Red Hat Enterprise Linux 4
x86_64
1,2,3,4,5
RedHat/base
RedHat/RPMS
RedHat/pixmaps

Comment 3 Nick Dokos 2007-03-12 16:05:17 UTC
Some more information:

o I tried installing on an (older) Opteron blade with a CCISS drive: the
installation succeeded, the system is running fine.

o I copied parted and some of the libraries that it depends on from the above
system to the DL585g2 with a non-RHEL4.5snap1 installation and tried looking at
the disks with it. There was no problem (although I didn't push it very far).

o I copied the kernel from the RHEL4.5snap1 installation on the blade to an
older RHEL4U4 installation on the DL585g2, and rebooted it. There was no problem.

o The RHEL4U4 installation was using a kernel parameter "pci=nommconf", so
I tried adding it to the kernel command line when installing RHEL4.5: no joy -
the assertions still failed and anaconda died.

I was hoping that one of the experiments above would pinpoint the culprit
unambiguously, but it's still a mystery. It seems specific to the DL585g2 at
this point. I'd appreciate any suggestions of how to get past this.


Comment 4 David Lawrence 2007-03-13 17:33:02 UTC
Did this occur with earlier updates of RHEL4 such as U4?

Comment 5 Nick Dokos 2007-03-13 17:36:48 UTC
No - RHEL4U4 installed and works fine.

Comment 6 David Lawrence 2007-03-13 17:38:38 UTC
Adding Regression keyword.

Comment 8 David Cantrell 2007-03-13 18:31:28 UTC
I've patched disk_dos.c with what I think will work, but without having that
particular system to reproduce the problem on and test, I'm going to post my fix
here and ask you to test it.

You will find a patch, a new SRPM for parted on RHEL4U5, and a binary RPM for
this patched parted on x86_64.  Can you test out this build of parted(8) on
RHEL4U5 and see if it solves your problem.  I realize it's probably difficult to
get U5 installed on the target system, but whatever you can do to test this
build on that platform on U5 would help.  I would suggest installing RHEL4U4 and
then doing an upgrade to U5.  Anaconda does not partition in those cases, so you
would be fine.

Let me know if this parted solves the problem or breaks in new and interesting ways.

Comment 9 David Cantrell 2007-03-13 18:33:04 UTC
Created attachment 149968 [details]
Patch to probe_partition_for_geom() for DL585g systems

Comment 10 David Cantrell 2007-03-13 18:34:08 UTC
Created attachment 149969 [details]
New parted source RPM

Comment 11 David Cantrell 2007-03-13 18:34:53 UTC
Created attachment 149970 [details]
New parted rpm for x86_64

Comment 12 David Cantrell 2007-03-13 18:35:48 UTC
Created attachment 149971 [details]
New parted-devel rpm for x86_64

Comment 13 David Cantrell 2007-03-13 18:36:53 UTC
Created attachment 149972 [details]
New parted-debuginfo rpm for x86_64

Comment 14 Nick Dokos 2007-03-13 22:30:51 UTC
I have not been able to try the patch but I have new information that may make
it unnecessary. I said in comment #5 that RHEL4U4 installed and ran with no
problem: that's true but it's not the whole story. I tried installing it again
on the new disk with the intention of doing an update install to RHEL4.5snap1
and smashed into the same brick wall.

It turns out that we had added more volumes to the RAID array and there seems to
be a threshold: the original installation was done with one volume configured
(no problem there) and I've been trying to install with eight volumes configured
   (problems galore here). We deleted six volumes, recreated one and tried to
install RHEL4U4 on that (the third configured volume) - that was successful. I
have not tried RHEL4.5snap1 yet and I have not tried to find exactly where the
threshold is, but I'll do that tomorrow and let you know.

BTW, RHEL5 does not hit this problem at all (most of the deleted volumes had
versions of RHEL5 on them).

Comment 15 Nick Dokos 2007-03-14 15:10:47 UTC
I added a fourth volume to the RAID array and installed RHEL4.5snap1 with no
problem. I have not determined the threshold yet, but it is clear that the
problem is not parted: it just gets bum information. We'll try to interpose
a newer device driver at installation time, once we determine where the failure
threshold is.

Comment 16 David Cantrell 2007-03-14 15:42:58 UTC
That is good to hear [that it's not parted], but I'm interested to know what is
happening.  Thanks for the feedback.

Comment 17 Nick Dokos 2007-03-14 16:31:05 UTC
We've gone all the way back to eight volumes, step by step (i.e. adding one
volume at a time and installing RHEL4.5snap1 on each newly added volume) and I
*still* don't see the former failure - so there is no "threshold".

The only explanation that we can think of is the following:

o before blowing away (almost) all the volumes, we updated the firmware on
the box and then on the P400 controller. Neither of these solved the problem at
the time.

o we also blew away the one volume that I was trying to install on and recreated
it (with the new firmware in place). That also was unsuccessful.

o but now that we've recreated six of the volumes from scratch with the new
firmware in place, the problem has disappeared.

The situation is deeply unsatisfying but there it is.

For the record, the new firmware on the box says:

A07 (12/02/2006)

and the firmware on the P400 controller says:

v2.08

The previous version of the controller firmware was v1.18.


Comment 18 David Lawrence 2007-03-15 03:07:43 UTC
In your opinion do you think this can be closed then?

Comment 19 Nick Dokos 2007-03-15 14:22:39 UTC
I think so - it's almost certainly *not* a parted problem,
probably a firmware issue with the P400.


Comment 20 David Lawrence 2007-03-15 14:34:56 UTC
Ok closing then. Please reopen if you acquire additional information about this.
Thanks for the report.