Bug 176179 (md_d0)

Summary:	software raid panic on reboot due to not mounting arrays
Product:	[Fedora] Fedora	Reporter:	Andy Burns <fedora>
Component:	mkinitrd	Assignee:	Peter Jones <pjones>
Status:	CLOSED RAWHIDE	QA Contact:	David Lawrence <dkl>
Severity:	high	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	carl, clydekunkel7734, davej, jarodwilson, mishu, oliva, paul, tmus, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:	5.0.16-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-01-03 21:14:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andy Burns 2005-12-19 20:44:22 UTC

Description of problem:

Installed a clean system from today's rawhide
using custom partitioning ...

sda1+sda1 -> 100MiB md0(RAID1) -> ext3 /boot
sda2+sdb2 -> 100GiB md1(RAID1) -> ext3 /
sda3 -> 1GiB swap0
sdb3 -> 1GiB swap1
sba4+sdb4 -> 300GiB md2(RAID0) -> ext3 /home

Install proceeded ok right until the reboot

Version-Release number of selected component (if applicable):

rawhide 2005-12-19

How reproducible:

???

Actual results:

On rebooting, the disks are detected, and apparently the md arrays are
assembled, but filesystems don't seem to be recognised and mounted.

I booted the machine with a rescue CD and manually assembled and mounted the
arrays, checked /mnt/sysimage/etc/fstab and /mnt/sysimage/boot/grub/grub.conf
looked ok, even reformatted the empty /home, mount points existed for boot and
home, rebooted but still no joy.

In the attached kernel debug log, is the "personality not loaded" message
significant? Also is the reference to md_d0 and md_d1 rather than md0 and md1
significant, or am I just misremembering the naming?

Expected results:

machine mounts root and home partitions

Additional info:

ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq led clo pio slum part
ata1: SATA max UDMA/133 cmd 0xF8828100 ctl 0x0 bmdma 0x0 irq 66
ata2: SATA max UDMA/133 cmd 0xF8828180 ctl 0x0 bmdma 0x0 irq 66
ata3: SATA max UDMA/133 cmd 0xF8828200 ctl 0x0 bmdma 0x0 irq 66
ata4: SATA max UDMA/133 cmd 0xF8828280 ctl 0x0 bmdma 0x0 irq 66
ata1: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:207f
ata1: dev 0 ATA-7, max UDMA/133, 488397168 sectors: LBA48
ata1: dev 0 configured for UDMA/133
scsi0 : ahci
ata2: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:207f
ata2: dev 0 ATA-7, max UDMA/133, 488397168 sectors: LBA48
ata2: dev 0 configured for UDMA/133
scsi1 : ahci
ata3: no device found (phy stat 00000000)
scsi2 : ahci
ata4: no device found (phy stat 00000000)
scsi3 : ahci
  Vendor: ATA       Model: WDC WD2500KS-00M  Rev: 02.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3 sda4
sd 0:0:0:0: Attached scsi disk sda
  Vendor: ATA       Model: WDC WD2500KS-00M  Rev: 02.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sdb: drive cache: write back
 sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:0:0: Attached scsi disk sdb
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb4 ...
md:  adding sdb4 ...
md: sdb2 has different UUID to sdb4
md: sdb1 has different UUID to sdb4
md:  adding sda4 ...
md: sda2 has different UUID to sdb4
md: sda1 has different UUID to sdb4
md: created md_d2
md: bind<sda4>
md: bind<sdb4>
md: running: <sdb4><sda4>
md: personality 2 is not loaded!  
md: do_md_run() returned -22
md: md_d2 stopped.
md: unbind<sdb4>
md: export_rdev(sdb4)
md: unbind<sda4>
md: export_rdev(sda4)
md: considering sdb2 ...
md:  adding sdb2 ...
md: sdb1 has different UUID to sdb2
md:  adding sda2 ...
md: sda1 has different UUID to sdb2
md: created md_d1
md: bind<sda2>
md: bind<sdb2>
md: running: <sdb2><sda2>
raid1: raid set md_d1 active with 2 out of 2 mirrors
md: considering sdb1 ...
md:  adding sdb1 ...
md:  adding sda1 ...
md: created md_d0
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md_d0 active with 2 out of 2 mirrors
md: ... autorun DONE.
 md_d1: unknown partition table
 md_d0: unknown partition table
EXT3-fs: unable to read superblock
Kernel panic - not syncing: Attempted to kill init!
 [<c0121184>] panic+0x3c/0x16d     [<c0123b37>] do_exit+0x6c/0x372
 [<c0123ef4>] sys_exit_group+0x0/0xd     [<c0103f19>] syscall_call+0x7/0xb

Comment 1 Andy Burns 2005-12-19 20:48:08 UTC

whoops minor type should have been ...

sda1+sdb1 -> 100MiB md0(RAID1) -> ext3 /boot
sda2+sdb2 -> 100GiB md1(RAID1) -> ext3 /
sda3      -> 1GiB   swap0
sdb3      -> 1GiB   swap1
sba4+sdb4 -> 300GiB md2(RAID0) -> ext3 /home

Comment 2 Dave Jones 2005-12-20 04:44:38 UTC


*** This bug has been marked as a duplicate of 169059 ***

Comment 3 Peter Jones 2005-12-21 19:13:13 UTC

(I'm reopening this one so I can handle it separately, since it's still a
relatively clean and readable bug report)

Comment 4 Peter Jones 2005-12-21 19:29:11 UTC

OK, so I think what's happening here is that the kernel is picking it up as a
raid device which is partitioned, rather than several partitions which form a
raid device.

As to why that's happening, I'm really not sure.  Can you look at the initrd
which is showing this, and do:

mkdir /tmp/initrd
cd /tmp/initrd
zcat /boot/initrd-$BAD.img | cpio -dv

and then attach the file /tmp/initrd/init to this bug?  Also, can you show me
/proc/partitions?

Comment 5 Peter Jones 2005-12-21 20:31:34 UTC

Notting was able to replicate this on one of his machines, and he says my guess
is right -- raid autorun is misdetecting a normal raid as a partitioned raid,
and so the devices are being created with different names from what are expected.

Reassigning to kernel.

Comment 6 Peter Jones 2005-12-21 20:41:45 UTC

*** Bug 169059 has been marked as a duplicate of this bug. ***

Comment 7 Andy Burns 2005-12-22 00:53:01 UTC

Hic!

Just arrived back from the pub, so probably not in the best state to give
meaningful buzilla feedback, but once the haze clears in the morning I'll
revisit this and give you chapter and verse ;-)

Comment 8 Bill Marrs 2005-12-22 09:55:12 UTC

I run FC4 with all the latest upgrades.  When I upgraded 2 software-raid using 
servers from 2.6.14-1.1644_FC4smp to 2.6.14-1.1653_FC4smp, one of them failed 
to boot with filesystem problems ("EXT3-fs: unable to read superblock ... 
mount: error 22 mounting ext3... ERROR opening /dev/console!!!!: 2").  The 
other server boots fine.  So, my impression is that there was some change in 
2.6.14-1.1653_FC4smp that triggered this problem with mounting.  Going back to 
2.6.14-1.1644_FC4smp allows me to boot my server.

Comment 9 Andy Burns 2005-12-22 22:46:07 UTC

I did wonder if it ought to even be looking for a partition table, rather than
expecting a filesystem directly on the raid block device.

Today's rawhide won't boot for me (panics at udev, but not related happens
without raid and already bugzilla'ed) so can't continue testing this problem yet ...

Comment 10 Andy Burns 2006-01-01 12:39:00 UTC

rawhide has been non-installable for me for a while but since kernel 1805/1806
it's usable again, I didn't see anything that indicated it might be fixed yet,
but I also know that buildsys held back a few releases so though I might have
missed something, I've tested this problem again.

Existing disk contents were zeroed and empty partitions writen with fdisk, then
a fresh rawhide install was performed, using LVM above a combinaton of software
raid0 and raid1 disks

/dev/sda1 (100MB) + /dev/sdb1 (100MB) 
  -> /dev/md0 (100MB) raid1 = /boot (100MB) ext3

/dev/sda2 (50GB) + /dev/sdb2 (50GB)
  -> /dev/md1 (50GB) raid1 
    -> vg00 
      -> lv00 = / (10GB) ext3
       + lv01 = /var (10GB) ext3
       + lv02 = /tmp (10GB) ext3
       + 20GB free

/dev/sda3 (1GB) swap

/dev/sdb3 (1GB) swap

/dev/sda4 (180GB) + /dev/sdb4 (180GB)
  -> /dev/md2 (360GB) raid0
    -> vg01
      -> lv03 = /home (100GB) ext3
       + lv04 = /extra (200GB) ext3
       + 60Gb free

Installation proceeded ok, but at reboot when the arrays are assembled they are
still being reconised as md_d0, md_d1 and md_d2 the PV's/VG's are not being
found and so none of the partitions are mounted.

Comment #5 still applies, I hope now the festivities are over this can get some
attention before FC5T2?

ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq led clo pio slum part
ata1: SATA max UDMA/133 cmd 0xF8828100 ctl 0x0 bmdma 0x0 irq 66
ata2: SATA max UDMA/133 cmd 0xF8828180 ctl 0x0 bmdma 0x0 irq 66
ata3: SATA max UDMA/133 cmd 0xF8828200 ctl 0x0 bmdma 0x0 irq 66
ata4: SATA max UDMA/133 cmd 0xF8828280 ctl 0x0 bmdma 0x0 irq 66
ata1: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:207f
ata1: dev 0 ATA-7, max UDMA/133, 488397168 sectors: LBA48
ata1: dev 0 configured for UDMA/133
scsi0 : ahci
ata2: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:207f
ata2: dev 0 ATA-7, max UDMA/133, 488397168 sectors: LBA48
ata2: dev 0 configured for UDMA/133
scsi1 : ahci
ata3: no device found (phy stat 00000000)
scsi2 : ahci
ata4: no device found (phy stat 00000000)
scsi3 : ahci
  Vendor: ATA       Model: WDC WD2500KS-00M  Rev: 02.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3 sda4
sd 0:0:0:0: Attached scsi disk sda
  Vendor: ATA       Model: WDC WD2500KS-00M  Rev: 02.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
SCSI device sdb: drive cache: write back
 sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:0:0: Attached scsi disk sdb
Loading raid1.ko module
md: raid1 personality registered as nr 3
Loading jbd.ko module
Loading ext3.ko module
Loading dm-mod.ko module
device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel
Loading dm-mirror.ko module
Loading dm-zero.ko module
Loading dm-snapshot.ko module
Making device-mapper control node
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb4 ...
md:  adding sdb4 ...
md: sdb2 has different UUID to sdb4
md: sdb1 has different UUID to sdb4
md:  adding sda4 ...
md: sda2 has different UUID to sdb4
md: sda1 has different UUID to sdb4
md: created md_d2
md: bind<sda4>
md: bind<sdb4>
md: running: <sdb4><sda4>
md: personality 2 is not loaded!
md: do_md_run() returned -22
md: md_d2 stopped.
md: unbind<sdb4>
md: export_rdev(sdb4)
md: unbind<sda4>
md: export_rdev(sda4)
md: considering sdb2 ...
md:  adding sdb2 ...
md: sdb1 has different UUID to sdb2
md:  adding sda2 ...
md: sda1 has different UUID to sdb2
md: created md_d1
md: bind<sda2>
md: bind<sdb2>
md: running: <sdb2><sda2>
raid1: raid set md_d1 active with 2 out of 2 mirrors
md: considering sdb1 ...
md:  adding sdb1 ...
md:  adding sda1 ...
md: created md_d0
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md_d0 active with 2 out of 2 mirrors
md: ... autorun DONE.
Scanning logical volumes
  Reading all physical volumes.  This may take a while...
cdrom: open failed.
  No volume groups found
Activating logical volumes
cdrom: open failed.
  Unable to find v md_d1:olume group "vg00"
Trying to resume from LABEL=SWAP-sdb3
 unknown partition table
 md_d0: unknown partition table
label SWAP-sdb3 not found
Unable to access resume device (LABEL=SWAP-sdb3)
Creating root device.
Mounting root filesystem.
mount: error No such device or adddress mounting /dev/root on /sysroot as ext3
Setting up other filesystems.

**** Beware I had to untangle to panic *****
**** from the last few lines of logging ****
**** on the serial console ****
 
Kernel panic - not syncing: Attempted to kill init!
[<c0122622>] panic+0x3e/0x16f
[<c0124fe8>]
[<c01253a5>] sys_exit_group+0x0/0x
[<c0103f19>] do_exit+0x6e/0x374
syscall_call+0x7/0xb

Comment 11 drago01 2006-01-02 07:22:37 UTC

I did a rawhide install yesterday an got the same issues (raid0 hda1+hdb1)

Comment 12 drago01 2006-01-02 10:38:43 UTC

booted with boot.iso (2006-01-02) and anaconda seems to find the partation
(/dev/md0) which kernel is used in boot.iso ? or is this problem not kernel related?

Comment 13 Andy Burns 2006-01-02 10:57:39 UTC

rescue cd works for me too (think I had to manually assemble and mount to avoid
a crash but that was on older rawide)

I think there was suspicion on the init script or nash itself?

Comment 14 Paul Flinders 2006-01-02 20:35:49 UTC

I've been experiencing this bug for a while - or at least something giving
similar symptoms but haven't had the time to look at it until now.

I can confirm that with kernel 2.6.14-1.1805_FC5 mkinitrd-5.0.10 produces a
bootable initrd whereas 5.0.15 mis-identifies my raid partitions as /dev/md_d0 &
/dev/md_d1 when they should be /dev/md0 & /dev/md1.

The following patch to mkinitrd fixes this problem *for*me* - it reverses a
change which occurred sometime between -10 and -15. It isn't clear to me at this
point in time whether the bug is in mkinitrd or the kernel.

Haven't caught up with 1807 so I hope it hasn't stolen my thunder :-)

--- mkinitrd-5.0.15/nash/nash.c.orig    2005-12-19 19:22:59.000000000 +0000
+++ mkinitrd-5.0.15/nash/nash.c 2006-01-02 20:13:46.000000000 +0000
@@ -1059,7 +1059,7 @@
        return 1;
     }

-    if (ioctl(fd, RAID_AUTORUN)) {
+    if (ioctl(fd, RAID_AUTORUN, 0)) {
        eprintf("raidautorun: RAID_AUTORUN failed: %s\n", strerror(errno));
        close(fd);
        return 1;

Comment 15 Paul Flinders 2006-01-02 20:51:48 UTC

Hmmm - I'd say this is a straightforward mkinitrd bug. The RAID_AUTORUN ioctl
expects an argument & this is specifically whether it should set up partitioned
MD devices. 

Not supplying the argument in nash has to be wrong and the expected result
exactly fits the observed symptoms.

Supplying a fixed argument of 0 is probably wrong as well - it should depend on
whether mkinitrd finds partioned RAID devices.

Comment 16 Dave Jones 2006-01-03 04:41:38 UTC

good detective work :)
I agree, this looks like a nash bug.

Comment 17 Jarod Wilson 2006-01-03 08:08:00 UTC

Building a new mkinitrd after applying the change in comment 14 does the trick
for me too, a previously-unbootable system now boots fine, at least w/respect to
the software RAID partitions.

Comment 18 Peter Jones 2006-01-03 21:14:33 UTC

Thanks for the patch, it's applied in 5.0.16-1, which I'll put at
http://people.redhat.com/pjones/mkinitrd/ just as soon as it's done building. 
It'll be in rawhide tomorrow.

I'm leaving it as 0, since we don't support partitioned raid devices at all in
the installer.

Comment 19 Andy Burns 2006-01-05 01:29:31 UTC

Pleased to say that today's rawhide installs (and works) on my convoluted
mixture of ext3, swap, raid0, raid1 & lvm partitions :-)

Comment 20 Alexandre Oliva 2006-01-05 16:57:00 UTC

Confirmed fixed for me as well.

Comment 21 Clyde E. Kunkel 2006-01-05 19:36:22 UTC

Fixed for me as well.  Thanks.

Comment 22 Carl Farrington 2008-11-25 21:15:33 UTC

I am experiencing this behaviour after upgrading my F9 system to F10. Any ideas?