Bug 50294

Summary:	Installing RH 7.0 corrupts SCSI sector translation table
Product:	[Retired] Red Hat Linux	Reporter:	jac
Component:	anaconda	Assignee:	Brent Fox <bfox>
Status:	CLOSED RAWHIDE	QA Contact:	Brock Organ <borgan>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.0
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2001-09-25 15:44:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jac 2001-07-30 01:31:20 UTC

Description of Problem:
RH 7.0 installed and booted, but it prevented a previously installed copy
of RH 6.2 in a different set of partitions from booting.  6.2 complained of
partitions not ending on cylinder boundaries.  The factory configuration of
the drive was 9.1 GB in 1105 cylinders.  RH 7.0 had changed this to 8669
cylinders, and it didn't ask permission first.  The fix was to install RH
6.0 over 7.0, and use the 6.0 fdisk to repair the translation table.  6.0
boots from the MBR, but 6.2 must be booted from a floppy because its
/sbin/lilo still has problems with the partition table; so the fix is
incomplete.  Have yet to try editing /etc/lilo.conf in the 6.0 installation
to also boot the 6.2 installation.
Since tampering with either the sector translation table or the partition
table, especially without asking permission first, is unacceptable in a
multi-boot installation, I consider RH 7.0 completely unusable.  Therefore
I marked the severity "high".


How Reproducible:
Unwilling to make the experiment on my hard disk.  If you want me to
disconnect the hard disk and test on a 1 GB Jaz drive, I could do that.

Steps to Reproduce:
1. 
2. 
3. 

Actual Results:


Expected Results:


Additional Information:
System is a 486 with 48 MB RAM, Adaptec AHA-1542 controller, Seagate
ST410800N drive at /dev/sda in LBA mode.  Working partitions for main Linux
installation are: /boot on /dev/sda2, / on /dev/sda8, /var on /dev/sda6,
/usr on /dev/sda10, /usr/local on /dev/sda12, and /home on /dev/sda13.  The
alternate set of partitions for testing new Linux installations is /boot on
/dev/sda3, / on /dev/sda9, /var on /dev/sda7, /usr on /dev/sda11. 
/usr/local and /home are common to the main and test installations. 
/dev/sda1 was originally intended for DOS, but that was never installed. 
/dev/sda5 is swap.  /dev/sda1, 2, 3 are primary partitions and /dev/sda4 is
the extended partition.
More details if you want them.

Comment 1 Brent Fox 2001-07-30 14:23:00 UTC

Did you create a new partition with Disk Druid in order to do the 7.0 installation?

Comment 2 jac 2001-08-02 17:06:27 UTC

	No.  I have never run Disk Druid at all.  I selected fdisk because I was
required to choose one or the other.  In fdisk I didn't make any changes to the
partition table, and exited without saving anything.  I created all the
partitions about a year and a half ago, and never changed them through several
cycles of installing RH 6.0 and 6.2 in one set of partitions or the other.
	RH 7.0 apparently didn't change the partition table either, but it
spontaneously changed the sector translation table, causing the partition table
to become invalid because the partitions no longer ended on cylinder boundaries.

Comment 3 jac 2001-08-07 01:56:54 UTC

It occurs to me that the problem could be in the version of /sbin/lilo shipped
with 7.0.  I let anaconda install the bootloader in the Master Boot Record of
/dev/sda.  Presumably, anaconda runs lilo to write the MBR, so if lilo wrote
past the boot sector, it could have trashed the sector translation table.

Comment 4 Brent Fox 2001-08-15 02:48:59 UTC

Yes, the installer just runs LILO.  I guess it is possible that LILO modified
your partition table, but I'd be very surprised.  I haven't heard of anyone
having this problem before.  
The move to GRUB in the next release of Red Hat Linux should avoid this problem
in the future.  Since you aren't willing to try to reproduce on your original
hard drive, I don't know if the problem is reproducible.  I haven't seen it in
any of my testing.
I'm resolving this as Rawhide since the next release of Red Hat will use GRUB.

Comment 5 jac 2001-08-22 01:41:42 UTC

  I've heard good things about GRUB.  Perhaps as part of the QA before putting
it into the mainstream distribution, it might be well to audit the code to make
sure it doesn't write outside the MBR.  Doing anything at all to either the
partition table or the sector translation table would be an extremely serious
bug.  In my case it was the translation table that got corrupted, not the
partition table.
  I think you can understand why I'm reluctant to do this kind of experiment on
my hard disk.  I was extremely lucky to get back access to everything in my
/home partition without having to reformat and re-install from tape.  I'm
planning to replace this system in a few months; I'll be willing to do these
kinds of tests after my files are moved over.
  Perhaps this bug wasn't previously reported because all-SCSI systems aren't
all that common, especially on older platforms such as 486s.  IDE drives don't
have translation tables.  However, you might hear more about bugs that affect
only SCSI drives.  Athlon motherboards with built-in Ultra-160 SCSI interfaces
are showing up in some of the newer high-performance boxes.  As it happens,
that's what I'm planning for my next desktop machine.

Comment 6 jac 2001-09-01 03:28:12 UTC

   Found method of reproducing bug without writing on disk.  (For reference, the
hard disk is a Seagate ST410800N.)
1.  Perform RH 6.0 installation procedure, up to the filesystem setup screen. 
This screen shows correctly that the cylinder-head-sector geometry is
1105/255/63 as configured earlier by 6.0 fdisk.  Reset system.
2.  Perform RH 7.0 installation in expert text mode, up to the filesystem setup
screen.  This screen shows incorrectly that the CHS geometry is 8669/64/32.  It
ignores the actual geometry written on the disk.  Reset system.
3.  Repeat step 1.  The 6.0 filesystem setup screen still shows that the
geometry is 1105/255/63.  Reset system.
4.  Boot the existing RH 6.2 partitions with a boot floppy as described in the
original bug report.  It still boots and connects to the Internet on command.
   With this procedure, the bug is reproduced every time.  The behavior implies
pretty strongly that Anaconda is involved in the bug, although it may not be the
only program involved.

Comment 7 Matt Wilson 2001-09-05 23:50:34 UTC

Very odd.  The term 'sector translation table' isn't very common in this day. 
It's an old CPM concept, is it not?  We're not in the days of SECTRAN anymore -
where is this sector translation table you're talking about?

Comment 8 jac 2001-09-06 01:40:01 UTC

   It's possible I might be using the wrong terminology.  I read about this 3 or
4 years ago in a description of how SCSI is used on the PC architecture. 
Unfortunately, I've been unable to find the documentation again, despite a long
search through the bookcases.  As I recall, SCSI drives inherently use linear
block addressing, but some systems and programs insist on viewing them through a
CHS addressing scheme just the same.  So a "sector translation table" maps the
linear addressing to CHS.  This information is written on the disk itself. 
Apparently, it's a group of binary numbers written somewhere in the first sector
of the drive, where the BIOS can read it.  fdisk is able to rewrite this
information, by going to the "x" submenu and using the "c", "h", and "s"
commands.  That's what I used to restore my disk's geometry to the proper
1105/255/63.
   (If I do have the terminology wrong, "sector translation table" might refer
to an internal map in the drive that maps its physical geometry to the linear
addressing that the outside world sees.  In that case, the mapping between CHS
and LBA is determined by some data written in the MBR, for which I don't know
the correct name.)
   RH installation programs for 6.2 and earlier apparently read this information
from the disk when the filesystem setup screen is displayed, and preserve it
when the MBR is written back to the disk.  With Anaconda (7.0 and later), what
appears to be happening is that the setup screen gets the total number of
sectors right, but ignores the geometry that's written on the disk, and blindly
substitutes 8669/64/32.  (As I discovered, this happens even when neither fdisk
nor Disk Druid is executed.)  That renders the partition table invalid, because
the partitions no longer start on "cylinder" boundaries.  Apparently some
compensating hack lets 7.0's LILO rewrite the MBR with the trashed CHS values
and the pre-existing partitions that are assigned to 7.0's mount points, but
neither 7.0's LILO nor 6.2's LILO can set the system up to dual-boot with a
pre-existing set of partitions in which an earlier Red Hat version is still
installed.
   Looks like someone who understands the raw binary data in the MBR is needed
to figure out what's really happening.  I don't have the right documentation for
that, or any idea where to find it.  fdisk does include a command to do a
formatted binary dump of the MBR, but that might not be the best tool for
analyzing and repairing it.  I can dump my MBR to a file and e-mail it, if that
will produce useful clues.
   But this behavior looks like enough information to suggest where in the
source code to look.

Comment 9 Matt Wilson 2001-09-06 02:46:20 UTC

7.0 is using the same partition manipulation code as 6.2.  I'm not sure what
could have changed to modify the in-mbr geometry.

Comment 10 Matt Wilson 2001-09-06 02:47:42 UTC

try turning off the 'linear' option in the lilo configuration window.  The
default changed from "off" in 6.2 to "on" in 7.

Comment 11 jac 2001-09-07 02:12:07 UTC

   I don't get as far as the LILO configuration window.  The bug manifests
itself earlier, at the filesystem setup screen.  If I ran LILO, I'd probably get
a corrupted drive again, and not be able to access my /home or /usr/local
partitions.  (FYI, I don't display any kind of "window" during installation.  I
must use text mode installation, because my monitor is not capable of syncing to
any VGA mode.  Graphic display is possible only with a hand-edited XF86Config
file with a fixed-frequency video board, and that's not available until all
post-install configuration work is completed.)
   I have always had to explicitly set LBA mode to "on" to make an installation
on this disk work at all.
   Anyway, I'll shut down now and run the 6.2 installation up to the filesystem
setup screen, and report what I see.

Comment 12 jac 2001-09-07 03:44:27 UTC

   Following up your message, I ran the 6.2 installation program as far as the
filesystem setup screen.  It did indeed display the same erroneous CHS geometry
as the 7.0 Anaconda screen.
   Why this didn't interfere with dual-booting RH 6.0, back when I first
installed RH 6.2, is a mystery.  Possibly this part of the installation program
interacts with LILO in a different way at 6.2 than at 7.0.
   At any rate, tackling this bug may make it easier to understand any other
problems hiding behind it.  While skimming the LILO User's Guide, Version 20,
pages 27-28 tonight, I noticed an "unsafe" option that stops LILO from reading
the boot sector at map creation time -- this may be a clue.  Another wild
hypothesis is that maybe either the format or the location of the CHS fields is
different for 6.0's LILO than for 6.2 and 7.0's LILO.  If true, that might be
the reason the later filesystem setup screens don't see the CHS data written by
6.0's fdisk, and make an incorrect default assumption.
   By itself, and even without considerations of dual-booting and preserving
access to pre-existing partitions, spontaneously altering the CHS geometry in
this way has two undesirable effects:
	1.  It slightly reduces the number of sectors available for partitions;
	2.  It radically reduces the number of sectors available for bootable
partitions located entirely below "track 1023".
   Any more tests I can run to help you figure out what's going on -- without
doing scary things to my boot sector?  For that matter, do you know of a Linux
utility that can save a boot sector to a floppy and restore it, so I can afford
to be more adventurous?  I don't think I want to try hacking raw hex code for
the boot sector, unless I can find clear and detailed docs on its format.

Comment 13 Matt Wilson 2001-09-07 16:28:45 UTC

to save and restore the boot sector:

dd if=/dev/sda of=saved-bootsector bs=512 count=1

to restore:

dd if=saved-bootsector of=/dev/sda bs=512 count=1

did you try toggling the linear option in 7.0 (you have to do a custom install
or an upgrade to see it)

Comment 14 Brent Fox 2001-09-17 18:21:14 UTC

Any more info here?

Comment 15 Brent Fox 2001-09-25 15:44:06 UTC

An email received from jac.com:

Date: Mon, 24 Sep 2001 21:24:00 -0400 (EDT)
From: Jack Carroll <jac.com>
Subject: No new test data on 50294
To: <bfox>, <borgan>

        I've been swamped at work all week, so I've done almost no work on
this.  The only thing new is that I bought an old Compaq Deskpro at a
going-out-of-business sale, to turn into an iptables firewall.  It has an
IDE drive.  I've established that the RH 6.0 and 7.0 install programs
agree on its CHS geometry.  So this still looks like a SCSI-specific bug.
        What I think is going on is that the CHS information for a SCSI
drive is located in a different place in the boot sector, between RH 6.0
LILO and RH 7.0 LILO.  Quite likely the 7.0 location matches Microsoft's,
because it clobbers the geometry exactly the way a DOS 5.0 installation
attempt did a couple of years ago.  But I haven't built the two versions
of the boot sector and DD'd them to a floppy for analysis yet, so this
remains a hypothesis.

Best regards,
Jack Carroll

Comment 16 Brent Fox 2001-09-25 15:47:56 UTC

Ok well, if you get a chance to run the install with the 'linear' option
toggled, let us know how that turns out.  Also, I'd be interested if you see the
same problem with 7.1 or the public beta for the next release.  If you have
time, that is.

I'm resolving the report as Rawhide, because all the partitioning code has been
replaced in our current internal trees, so I expect this problem to not happen
in the future.  Please reopen this report if you see it crop up in future
releases.  Thanks for your report.