Bug 176179 (md_d0)
Summary: | software raid panic on reboot due to not mounting arrays | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andy Burns <fedora> |
Component: | mkinitrd | Assignee: | Peter Jones <pjones> |
Status: | CLOSED RAWHIDE | QA Contact: | David Lawrence <dkl> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | carl, clydekunkel7734, davej, jarodwilson, mishu, oliva, paul, tmus, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | 5.0.16-1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-01-03 21:14:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andy Burns
2005-12-19 20:44:22 UTC
whoops minor type should have been ... sda1+sdb1 -> 100MiB md0(RAID1) -> ext3 /boot sda2+sdb2 -> 100GiB md1(RAID1) -> ext3 / sda3 -> 1GiB swap0 sdb3 -> 1GiB swap1 sba4+sdb4 -> 300GiB md2(RAID0) -> ext3 /home *** This bug has been marked as a duplicate of 169059 *** (I'm reopening this one so I can handle it separately, since it's still a relatively clean and readable bug report) OK, so I think what's happening here is that the kernel is picking it up as a raid device which is partitioned, rather than several partitions which form a raid device. As to why that's happening, I'm really not sure. Can you look at the initrd which is showing this, and do: mkdir /tmp/initrd cd /tmp/initrd zcat /boot/initrd-$BAD.img | cpio -dv and then attach the file /tmp/initrd/init to this bug? Also, can you show me /proc/partitions? Notting was able to replicate this on one of his machines, and he says my guess is right -- raid autorun is misdetecting a normal raid as a partitioned raid, and so the devices are being created with different names from what are expected. Reassigning to kernel. *** Bug 169059 has been marked as a duplicate of this bug. *** Hic! Just arrived back from the pub, so probably not in the best state to give meaningful buzilla feedback, but once the haze clears in the morning I'll revisit this and give you chapter and verse ;-) I run FC4 with all the latest upgrades. When I upgraded 2 software-raid using servers from 2.6.14-1.1644_FC4smp to 2.6.14-1.1653_FC4smp, one of them failed to boot with filesystem problems ("EXT3-fs: unable to read superblock ... mount: error 22 mounting ext3... ERROR opening /dev/console!!!!: 2"). The other server boots fine. So, my impression is that there was some change in 2.6.14-1.1653_FC4smp that triggered this problem with mounting. Going back to 2.6.14-1.1644_FC4smp allows me to boot my server. I did wonder if it ought to even be looking for a partition table, rather than expecting a filesystem directly on the raid block device. Today's rawhide won't boot for me (panics at udev, but not related happens without raid and already bugzilla'ed) so can't continue testing this problem yet ... rawhide has been non-installable for me for a while but since kernel 1805/1806 it's usable again, I didn't see anything that indicated it might be fixed yet, but I also know that buildsys held back a few releases so though I might have missed something, I've tested this problem again. Existing disk contents were zeroed and empty partitions writen with fdisk, then a fresh rawhide install was performed, using LVM above a combinaton of software raid0 and raid1 disks /dev/sda1 (100MB) + /dev/sdb1 (100MB) -> /dev/md0 (100MB) raid1 = /boot (100MB) ext3 /dev/sda2 (50GB) + /dev/sdb2 (50GB) -> /dev/md1 (50GB) raid1 -> vg00 -> lv00 = / (10GB) ext3 + lv01 = /var (10GB) ext3 + lv02 = /tmp (10GB) ext3 + 20GB free /dev/sda3 (1GB) swap /dev/sdb3 (1GB) swap /dev/sda4 (180GB) + /dev/sdb4 (180GB) -> /dev/md2 (360GB) raid0 -> vg01 -> lv03 = /home (100GB) ext3 + lv04 = /extra (200GB) ext3 + 60Gb free Installation proceeded ok, but at reboot when the arrays are assembled they are still being reconised as md_d0, md_d1 and md_d2 the PV's/VG's are not being found and so none of the partitions are mounted. Comment #5 still applies, I hope now the festivities are over this can get some attention before FC5T2? ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode ahci 0000:00:1f.2: flags: 64bit ncq led clo pio slum part ata1: SATA max UDMA/133 cmd 0xF8828100 ctl 0x0 bmdma 0x0 irq 66 ata2: SATA max UDMA/133 cmd 0xF8828180 ctl 0x0 bmdma 0x0 irq 66 ata3: SATA max UDMA/133 cmd 0xF8828200 ctl 0x0 bmdma 0x0 irq 66 ata4: SATA max UDMA/133 cmd 0xF8828280 ctl 0x0 bmdma 0x0 irq 66 ata1: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:207f ata1: dev 0 ATA-7, max UDMA/133, 488397168 sectors: LBA48 ata1: dev 0 configured for UDMA/133 scsi0 : ahci ata2: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:207f ata2: dev 0 ATA-7, max UDMA/133, 488397168 sectors: LBA48 ata2: dev 0 configured for UDMA/133 scsi1 : ahci ata3: no device found (phy stat 00000000) scsi2 : ahci ata4: no device found (phy stat 00000000) scsi3 : ahci Vendor: ATA Model: WDC WD2500KS-00M Rev: 02.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) SCSI device sda: drive cache: write back SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) SCSI device sda: drive cache: write back sda: sda1 sda2 sda3 sda4 sd 0:0:0:0: Attached scsi disk sda Vendor: ATA Model: WDC WD2500KS-00M Rev: 02.0 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) SCSI device sdb: drive cache: write back SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB) SCSI device sdb: drive cache: write back sdb: sdb1 sdb2 sdb3 sdb4 sd 1:0:0:0: Attached scsi disk sdb Loading raid1.ko module md: raid1 personality registered as nr 3 Loading jbd.ko module Loading ext3.ko module Loading dm-mod.ko module device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel Loading dm-mirror.ko module Loading dm-zero.ko module Loading dm-snapshot.ko module Making device-mapper control node md: Autodetecting RAID arrays. md: autorun ... md: considering sdb4 ... md: adding sdb4 ... md: sdb2 has different UUID to sdb4 md: sdb1 has different UUID to sdb4 md: adding sda4 ... md: sda2 has different UUID to sdb4 md: sda1 has different UUID to sdb4 md: created md_d2 md: bind<sda4> md: bind<sdb4> md: running: <sdb4><sda4> md: personality 2 is not loaded! md: do_md_run() returned -22 md: md_d2 stopped. md: unbind<sdb4> md: export_rdev(sdb4) md: unbind<sda4> md: export_rdev(sda4) md: considering sdb2 ... md: adding sdb2 ... md: sdb1 has different UUID to sdb2 md: adding sda2 ... md: sda1 has different UUID to sdb2 md: created md_d1 md: bind<sda2> md: bind<sdb2> md: running: <sdb2><sda2> raid1: raid set md_d1 active with 2 out of 2 mirrors md: considering sdb1 ... md: adding sdb1 ... md: adding sda1 ... md: created md_d0 md: bind<sda1> md: bind<sdb1> md: running: <sdb1><sda1> raid1: raid set md_d0 active with 2 out of 2 mirrors md: ... autorun DONE. Scanning logical volumes Reading all physical volumes. This may take a while... cdrom: open failed. No volume groups found Activating logical volumes cdrom: open failed. Unable to find v md_d1:olume group "vg00" Trying to resume from LABEL=SWAP-sdb3 unknown partition table md_d0: unknown partition table label SWAP-sdb3 not found Unable to access resume device (LABEL=SWAP-sdb3) Creating root device. Mounting root filesystem. mount: error No such device or adddress mounting /dev/root on /sysroot as ext3 Setting up other filesystems. **** Beware I had to untangle to panic ***** **** from the last few lines of logging **** **** on the serial console **** Kernel panic - not syncing: Attempted to kill init! [<c0122622>] panic+0x3e/0x16f [<c0124fe8>] [<c01253a5>] sys_exit_group+0x0/0x [<c0103f19>] do_exit+0x6e/0x374 syscall_call+0x7/0xb I did a rawhide install yesterday an got the same issues (raid0 hda1+hdb1) booted with boot.iso (2006-01-02) and anaconda seems to find the partation (/dev/md0) which kernel is used in boot.iso ? or is this problem not kernel related? rescue cd works for me too (think I had to manually assemble and mount to avoid a crash but that was on older rawide) I think there was suspicion on the init script or nash itself? I've been experiencing this bug for a while - or at least something giving similar symptoms but haven't had the time to look at it until now. I can confirm that with kernel 2.6.14-1.1805_FC5 mkinitrd-5.0.10 produces a bootable initrd whereas 5.0.15 mis-identifies my raid partitions as /dev/md_d0 & /dev/md_d1 when they should be /dev/md0 & /dev/md1. The following patch to mkinitrd fixes this problem *for*me* - it reverses a change which occurred sometime between -10 and -15. It isn't clear to me at this point in time whether the bug is in mkinitrd or the kernel. Haven't caught up with 1807 so I hope it hasn't stolen my thunder :-) --- mkinitrd-5.0.15/nash/nash.c.orig 2005-12-19 19:22:59.000000000 +0000 +++ mkinitrd-5.0.15/nash/nash.c 2006-01-02 20:13:46.000000000 +0000 @@ -1059,7 +1059,7 @@ return 1; } - if (ioctl(fd, RAID_AUTORUN)) { + if (ioctl(fd, RAID_AUTORUN, 0)) { eprintf("raidautorun: RAID_AUTORUN failed: %s\n", strerror(errno)); close(fd); return 1; Hmmm - I'd say this is a straightforward mkinitrd bug. The RAID_AUTORUN ioctl expects an argument & this is specifically whether it should set up partitioned MD devices. Not supplying the argument in nash has to be wrong and the expected result exactly fits the observed symptoms. Supplying a fixed argument of 0 is probably wrong as well - it should depend on whether mkinitrd finds partioned RAID devices. good detective work :) I agree, this looks like a nash bug. Building a new mkinitrd after applying the change in comment 14 does the trick for me too, a previously-unbootable system now boots fine, at least w/respect to the software RAID partitions. Thanks for the patch, it's applied in 5.0.16-1, which I'll put at http://people.redhat.com/pjones/mkinitrd/ just as soon as it's done building. It'll be in rawhide tomorrow. I'm leaving it as 0, since we don't support partitioned raid devices at all in the installer. Pleased to say that today's rawhide installs (and works) on my convoluted mixture of ext3, swap, raid0, raid1 & lvm partitions :-) Confirmed fixed for me as well. Fixed for me as well. Thanks. I am experiencing this behaviour after upgrading my F9 system to F10. Any ideas? |