Bug 50294
Summary: | Installing RH 7.0 corrupts SCSI sector translation table | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | jac |
Component: | anaconda | Assignee: | Brent Fox <bfox> |
Status: | CLOSED RAWHIDE | QA Contact: | Brock Organ <borgan> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.0 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2001-09-25 15:44:10 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jac
2001-07-30 01:31:20 UTC
Did you create a new partition with Disk Druid in order to do the 7.0 installation? No. I have never run Disk Druid at all. I selected fdisk because I was required to choose one or the other. In fdisk I didn't make any changes to the partition table, and exited without saving anything. I created all the partitions about a year and a half ago, and never changed them through several cycles of installing RH 6.0 and 6.2 in one set of partitions or the other. RH 7.0 apparently didn't change the partition table either, but it spontaneously changed the sector translation table, causing the partition table to become invalid because the partitions no longer ended on cylinder boundaries. It occurs to me that the problem could be in the version of /sbin/lilo shipped with 7.0. I let anaconda install the bootloader in the Master Boot Record of /dev/sda. Presumably, anaconda runs lilo to write the MBR, so if lilo wrote past the boot sector, it could have trashed the sector translation table. Yes, the installer just runs LILO. I guess it is possible that LILO modified your partition table, but I'd be very surprised. I haven't heard of anyone having this problem before. The move to GRUB in the next release of Red Hat Linux should avoid this problem in the future. Since you aren't willing to try to reproduce on your original hard drive, I don't know if the problem is reproducible. I haven't seen it in any of my testing. I'm resolving this as Rawhide since the next release of Red Hat will use GRUB. I've heard good things about GRUB. Perhaps as part of the QA before putting it into the mainstream distribution, it might be well to audit the code to make sure it doesn't write outside the MBR. Doing anything at all to either the partition table or the sector translation table would be an extremely serious bug. In my case it was the translation table that got corrupted, not the partition table. I think you can understand why I'm reluctant to do this kind of experiment on my hard disk. I was extremely lucky to get back access to everything in my /home partition without having to reformat and re-install from tape. I'm planning to replace this system in a few months; I'll be willing to do these kinds of tests after my files are moved over. Perhaps this bug wasn't previously reported because all-SCSI systems aren't all that common, especially on older platforms such as 486s. IDE drives don't have translation tables. However, you might hear more about bugs that affect only SCSI drives. Athlon motherboards with built-in Ultra-160 SCSI interfaces are showing up in some of the newer high-performance boxes. As it happens, that's what I'm planning for my next desktop machine. Found method of reproducing bug without writing on disk. (For reference, the hard disk is a Seagate ST410800N.) 1. Perform RH 6.0 installation procedure, up to the filesystem setup screen. This screen shows correctly that the cylinder-head-sector geometry is 1105/255/63 as configured earlier by 6.0 fdisk. Reset system. 2. Perform RH 7.0 installation in expert text mode, up to the filesystem setup screen. This screen shows incorrectly that the CHS geometry is 8669/64/32. It ignores the actual geometry written on the disk. Reset system. 3. Repeat step 1. The 6.0 filesystem setup screen still shows that the geometry is 1105/255/63. Reset system. 4. Boot the existing RH 6.2 partitions with a boot floppy as described in the original bug report. It still boots and connects to the Internet on command. With this procedure, the bug is reproduced every time. The behavior implies pretty strongly that Anaconda is involved in the bug, although it may not be the only program involved. Very odd. The term 'sector translation table' isn't very common in this day. It's an old CPM concept, is it not? We're not in the days of SECTRAN anymore - where is this sector translation table you're talking about? It's possible I might be using the wrong terminology. I read about this 3 or 4 years ago in a description of how SCSI is used on the PC architecture. Unfortunately, I've been unable to find the documentation again, despite a long search through the bookcases. As I recall, SCSI drives inherently use linear block addressing, but some systems and programs insist on viewing them through a CHS addressing scheme just the same. So a "sector translation table" maps the linear addressing to CHS. This information is written on the disk itself. Apparently, it's a group of binary numbers written somewhere in the first sector of the drive, where the BIOS can read it. fdisk is able to rewrite this information, by going to the "x" submenu and using the "c", "h", and "s" commands. That's what I used to restore my disk's geometry to the proper 1105/255/63. (If I do have the terminology wrong, "sector translation table" might refer to an internal map in the drive that maps its physical geometry to the linear addressing that the outside world sees. In that case, the mapping between CHS and LBA is determined by some data written in the MBR, for which I don't know the correct name.) RH installation programs for 6.2 and earlier apparently read this information from the disk when the filesystem setup screen is displayed, and preserve it when the MBR is written back to the disk. With Anaconda (7.0 and later), what appears to be happening is that the setup screen gets the total number of sectors right, but ignores the geometry that's written on the disk, and blindly substitutes 8669/64/32. (As I discovered, this happens even when neither fdisk nor Disk Druid is executed.) That renders the partition table invalid, because the partitions no longer start on "cylinder" boundaries. Apparently some compensating hack lets 7.0's LILO rewrite the MBR with the trashed CHS values and the pre-existing partitions that are assigned to 7.0's mount points, but neither 7.0's LILO nor 6.2's LILO can set the system up to dual-boot with a pre-existing set of partitions in which an earlier Red Hat version is still installed. Looks like someone who understands the raw binary data in the MBR is needed to figure out what's really happening. I don't have the right documentation for that, or any idea where to find it. fdisk does include a command to do a formatted binary dump of the MBR, but that might not be the best tool for analyzing and repairing it. I can dump my MBR to a file and e-mail it, if that will produce useful clues. But this behavior looks like enough information to suggest where in the source code to look. 7.0 is using the same partition manipulation code as 6.2. I'm not sure what could have changed to modify the in-mbr geometry. try turning off the 'linear' option in the lilo configuration window. The default changed from "off" in 6.2 to "on" in 7. I don't get as far as the LILO configuration window. The bug manifests itself earlier, at the filesystem setup screen. If I ran LILO, I'd probably get a corrupted drive again, and not be able to access my /home or /usr/local partitions. (FYI, I don't display any kind of "window" during installation. I must use text mode installation, because my monitor is not capable of syncing to any VGA mode. Graphic display is possible only with a hand-edited XF86Config file with a fixed-frequency video board, and that's not available until all post-install configuration work is completed.) I have always had to explicitly set LBA mode to "on" to make an installation on this disk work at all. Anyway, I'll shut down now and run the 6.2 installation up to the filesystem setup screen, and report what I see. Following up your message, I ran the 6.2 installation program as far as the filesystem setup screen. It did indeed display the same erroneous CHS geometry as the 7.0 Anaconda screen. Why this didn't interfere with dual-booting RH 6.0, back when I first installed RH 6.2, is a mystery. Possibly this part of the installation program interacts with LILO in a different way at 6.2 than at 7.0. At any rate, tackling this bug may make it easier to understand any other problems hiding behind it. While skimming the LILO User's Guide, Version 20, pages 27-28 tonight, I noticed an "unsafe" option that stops LILO from reading the boot sector at map creation time -- this may be a clue. Another wild hypothesis is that maybe either the format or the location of the CHS fields is different for 6.0's LILO than for 6.2 and 7.0's LILO. If true, that might be the reason the later filesystem setup screens don't see the CHS data written by 6.0's fdisk, and make an incorrect default assumption. By itself, and even without considerations of dual-booting and preserving access to pre-existing partitions, spontaneously altering the CHS geometry in this way has two undesirable effects: 1. It slightly reduces the number of sectors available for partitions; 2. It radically reduces the number of sectors available for bootable partitions located entirely below "track 1023". Any more tests I can run to help you figure out what's going on -- without doing scary things to my boot sector? For that matter, do you know of a Linux utility that can save a boot sector to a floppy and restore it, so I can afford to be more adventurous? I don't think I want to try hacking raw hex code for the boot sector, unless I can find clear and detailed docs on its format. to save and restore the boot sector: dd if=/dev/sda of=saved-bootsector bs=512 count=1 to restore: dd if=saved-bootsector of=/dev/sda bs=512 count=1 did you try toggling the linear option in 7.0 (you have to do a custom install or an upgrade to see it) Any more info here? An email received from jac.com: Date: Mon, 24 Sep 2001 21:24:00 -0400 (EDT) From: Jack Carroll <jac.com> Subject: No new test data on 50294 To: <bfox>, <borgan> I've been swamped at work all week, so I've done almost no work on this. The only thing new is that I bought an old Compaq Deskpro at a going-out-of-business sale, to turn into an iptables firewall. It has an IDE drive. I've established that the RH 6.0 and 7.0 install programs agree on its CHS geometry. So this still looks like a SCSI-specific bug. What I think is going on is that the CHS information for a SCSI drive is located in a different place in the boot sector, between RH 6.0 LILO and RH 7.0 LILO. Quite likely the 7.0 location matches Microsoft's, because it clobbers the geometry exactly the way a DOS 5.0 installation attempt did a couple of years ago. But I haven't built the two versions of the boot sector and DD'd them to a floppy for analysis yet, so this remains a hypothesis. Best regards, Jack Carroll Ok well, if you get a chance to run the install with the 'linear' option toggled, let us know how that turns out. Also, I'd be interested if you see the same problem with 7.1 or the public beta for the next release. If you have time, that is. I'm resolving the report as Rawhide, because all the partitioning code has been replaced in our current internal trees, so I expect this problem to not happen in the future. Please reopen this report if you see it crop up in future releases. Thanks for your report. |