Bug 508491 - for raid1 array, disabling one disk gives Hard Disk Error, disabling the other gives grub>
Summary: for raid1 array, disabling one disk gives Hard Disk Error, disabling the othe...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: 11
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Hans de Goede
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-27 21:08 UTC by Bob Gustafson
Modified: 2009-09-01 09:21 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-01 09:21:59 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Bob Gustafson 2009-06-27 21:08:09 UTC
Description of problem:

I have a system with two SCSI disks running software raid1. Fedora 11 was installed (full wipe).

There was some confusion on my part - whether raid was actually working (see Bug 508478). Since this is software raid, mdadm is the applicable utility, not dmraid.

However, in the course of testing whether raid was working, I disabled one of the disks and when the system was rebooted, I saw:

Hard Disk Error

No grub, no nothing.

When this disk was enabled and the other disabled, I saw

grub>

and was able to reboot in a disabled state after some fiddling with the grub command lines.

-----

The fact that I got Hard Disk Error on one enabled raid disk and grub> on the other, seems to indicate that grub was not written properly to both disks during the Anaconda install.

To test this hypothesis, I switched the two disks. If one disk still has the Hard Disk Error even though it is in a different slot, it indicates that it has the bad grub in its MBR.

This was the case - the Hard Disk Error followed the physical disk, no matter which slot it was in.

---

A further check on the contents of the mbr of both disks shows differences

[root@hoho2 user1]# dd if=/dev/sdb bs=512 count=1 | od -c > sdb.mbr
1+0 records in
1+0 records out
512 bytes (512 B) copied, 7.5149e-05 s, 6.8 MB/s
[root@hoho2 user1]# dd if=/dev/sda bs=512 count=1 | od -c > sda.mbr
1+0 records in
1+0 records out
512 bytes (512 B) copied, 4.8149e-05 s, 10.6 MB/s
[root@hoho2 user1]# diff sdb.mbr sda.mbr
1,5c1,5
< 0000000 353   H 220   l   l   l   l   l   l   l   l   l   l   l   l   l
< 0000020   l   l   l   l   l   l   l   l   l   l   l   l   l   l   l   l
< *
< 0000060   l   l   l   l   l   l   l   l   l   l   l   l   l   l 003 002
< 0000100 201  \0  \0 200   3   % 001  \0  \0  \b 372 220 220 366 302 200
---
> 0000000 353   H 220 216 320 274  \0   | 213 364   P  \a   P 037 373 374
> 0000020 277  \0 006 271  \0 001 362 245 352 035 006  \0  \0 276 276  \a
> 0000040 263 004 200   < 200   t 016 200   <  \0   u 034 203 306 020 376
> 0000060 313   u 357 315 030 213 024 213   L 002 213 356 203 306 003 002
> 0000100 200  \0  \0 200   3   % 001  \0  \0  \b 372 220 220 366 302 200
28c28
< 0000660  \0  \0  \0  \0  \0  \0  \0  \0   l   l   l   l   l   l 200 001
---
> 0000660  \0  \0  \0  \0  \0  \0  \0  \0 267 264 006  \0  \0  \0 200 001
[root@hoho2 user1]# 


Version-Release number of selected component (if applicable):

 Anaconda on Fedora 11 release disk.

How reproducible:

 Only tried removing disks on this system.

Steps to Reproduce:

1. Install Fedora 11 (full wipe) on system with two disks as raid1
2. shutdown system - remove a disk
3. boot - see if the remaining disk has grub>
4. If grub> shows, then shutdown, replace the removed disk, remove the other
5. boot - see if the remaining disk has grub>
6. If boot shows 'Hard Disk Error', then that disk does not have grub written properly to MBR
  
Actual results:

 grub> with one disk removed
 Hard Disk Error with the other disk removed.

Expected results:

 grub> should show, no matter which disk removed.

Additional info:

Comment 1 Bob Gustafson 2009-06-27 21:13:25 UTC
To keep beating this horse, I rewrote grub on both sda and sdb disks.

grub> root (hd0,0)
root (hd0,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0,0)
setup (hd0,0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0,0)"... failed (this is not fatal)
 Running "embed /grub/e2fs_stage1_5 (hd0,0)"... failed (this is not fatal)
 Running "install /grub/stage1 (hd0,0) /grub/stage2 p /grub/grub.conf "... succe
eded
Done.
grub> root (hd1,0)
root (hd1,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd1,0)
setup (hd1,0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd1,0)"... failed (this is not fatal)
 Running "embed /grub/e2fs_stage1_5 (hd1,0)"... failed (this is not fatal)
 Running "install /grub/stage1 (hd1,0) /grub/stage2 p /grub/grub.conf "... succe
eded
Done.
grub> quit
quit
[root@hoho2 user1]# 

Doing the dump of both mbr and comparing - still gives differences ??

[root@hoho2 user1]# dd if=/dev/sda bs=512 count=1 | od -c > sda2.mbr
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.0018e-05 s, 10.2 MB/s
[root@hoho2 user1]# dd if=/dev/sdb bs=512 count=1 | od -c > sdb2.mbr
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.0869e-05 s, 10.1 MB/s
[root@hoho2 user1]# diff sdb2.mbr sda2.mbr
1,5c1,5
< 0000000 353   H 220   l   l   l   l   l   l   l   l   l   l   l   l   l
< 0000020   l   l   l   l   l   l   l   l   l   l   l   l   l   l   l   l
< *
< 0000060   l   l   l   l   l   l   l   l   l   l   l   l   l   l 003 002
< 0000100 201  \0  \0 200   3   % 001  \0  \0  \b 372 220 220 366 302 200
---
> 0000000 353   H 220 216 320 274  \0   | 213 364   P  \a   P 037 373 374
> 0000020 277  \0 006 271  \0 001 362 245 352 035 006  \0  \0 276 276  \a
> 0000040 263 004 200   < 200   t 016 200   <  \0   u 034 203 306 020 376
> 0000060 313   u 357 315 030 213 024 213   L 002 213 356 203 306 003 002
> 0000100 200  \0  \0 200   3   % 001  \0  \0  \b 372 220 220 366 302 200
28c28
< 0000660  \0  \0  \0  \0  \0  \0  \0  \0   l   l   l   l   l   l 200 001
---
> 0000660  \0  \0  \0  \0  \0  \0  \0  \0 267 264 006  \0  \0  \0 200 001
[root@hoho2 user1]# 

[root@hoho4 argouml]# 

Also, removing one disk and booting still gives Hard Disk Error
whereas removing the other disk and booting - gives a nice bootup without having to fiddle with grub command lines.

This is really puzzling.

Comment 2 Bob Gustafson 2009-06-28 03:22:47 UTC
Hmm, I think there were some errors in my process

Should be
  grub> setup (hd0)
instead of
  grub> setup (hd0,0)

Repeating with the above changes:

grub> root (hd0,0)
root (hd0,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0)
setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  28 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+28 p (hd0,0)/grub/stage2 /grub/grub.
conf"... succeeded
Done.
grub> setup (hd1)
setup (hd1)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd1)"...  28 sectors are embedded.
succeeded
 Running "install /grub/stage1 d (hd1) (hd1)1+28 p (hd0,0)/grub/stage2 /grub/gru
b.conf"... succeeded
Done.
grub> quit
quit
[root@hoho2 user1]# dd if=/dev/sda bs=512 count=1 | od -c > sda5.mbr
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.4559e-05 s, 9.4 MB/s
[root@hoho2 user1]# dd if=/dev/sdb bs=512 count=1 | od -c > sdb5.mbr
1+0 records in
1+0 records out
512 bytes (512 B) copied, 7.4893e-05 s, 6.8 MB/s
[root@hoho2 user1]# diff sda5.mbr sdb5.mbr
1,5c1,5
< 0000000 353   H 220 216 320 274  \0   | 213 364   P  \a   P 037 373 374
< 0000020 277  \0 006 271  \0 001 362 245 352 035 006  \0  \0 276 276  \a
< 0000040 263 004 200   < 200   t 016 200   <  \0   u 034 203 306 020 376
< 0000060 313   u 357 315 030 213 024 213   L 002 213 356 203 306 003 002
< 0000100 377  \0  \0     001  \0  \0  \0  \0 002 372 220 220 366 302 200
---
> 0000000 353   H 220   l   l   l   l   l   l   l   l   l   l   l   l   l
> 0000020   l   l   l   l   l   l   l   l   l   l   l   l   l   l   l   l
> *
> 0000060   l   l   l   l   l   l   l   l   l   l   l   l   l   l 003 002
> 0000100 201  \0  \0     001  \0  \0  \0  \0 002 372 220 220 366 302 200
28c28
< 0000660  \0  \0  \0  \0  \0  \0  \0  \0 267 264 006  \0  \0  \0 200 001
---
> 0000660  \0  \0  \0  \0  \0  \0  \0  \0   l   l   l   l   l   l 200 001
[root@hoho2 user1]# 

[root@hoho4 argouml]# 

Still differences in the two MBR blocks.

Note that the
  grub> root (hd0,0)
was done only once, and then followed by
  grub> setup (hd0)
  grub> setup (hd1)

The thought here was that root (hd0,0) defines where the grub files are copied from.  Maybe this is not true.

The two setup commands should write the same information into the MBR blocks. However, the diff command shows that different information is in the MBR blocks.

Still a puzzle.

Comment 3 Bob Gustafson 2009-06-28 04:49:16 UTC
Looking at
  http://en.wikipedia.org/wiki/Master_boot_record

it seems as though the information in the 0000660 line includes the partition table in sda, but not sdb..

At one point in this check of whether raid is working, I was able to cleanly boot from either sda or sdb with the other disk removed.

At the moment, this is not so. When I try to boot with only sdb, I again get
Hard Disk Error

I will try to get back to the spot where I can cleanly boot from either disk.

Comment 4 Bob Gustafson 2009-06-29 04:26:02 UTC
Rewriting the mbr of both /dev/sda and /dev/sdb did the trick.

Now it boots cleanly from either disk with the other removed. It acts like a RAID1 should act.

Comment 5 Hans de Goede 2009-09-01 09:21:59 UTC
The grub writing code for mdraid has been rewritten in rawhide, and should handle this properly now.


Note You need to log in before you can comment on or make changes to this bug.