744217 – kernel-2.6.40-4.fc15.x86_64 fails to boot due to failure to start MD RAID

Bug 744217 - kernel-2.6.40-4.fc15.x86_64 fails to boot due to failure to start MD RAID

Summary: kernel-2.6.40-4.fc15.x86_64 fails to boot due to failure to start MD RAID

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Depends On:
Blocks:	F16Blocker, F16FinalBlocker
TreeView+	depends on / blocked

Reported:	2011-10-07 14:07 UTC by Doug Ledford
Modified:	2011-10-25 03:30 UTC (History)
CC List:	14 users (show)
Fixed In Version:	mdadm-3.2.2-12.fc16
Clone Of:	736387
Environment:
Last Closed:	2011-10-25 03:30:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Doug Ledford 2011-10-07 14:07:51 UTC

Cloning for f16

+++ This bug was initially created as a clone of Bug #736387 +++

+++ This bug was initially created as a clone of Bug #729205 +++


--- Additional comment from michael.wuersch on 2011-09-02 10:54:14 EDT ---

I have exactly the same problem, which occurred after updating from fc14 to fc15 and thus getting a new kernel. However, my kernel is 2.6.40.3-0.fc15.x86_64.

I followed the advice above and executed:

su -c 'yum update --enablerepo=updates-testing mdadm-3.2.2-9.fc15'

Then I rebuilt the initramfs image with:

sudo dracut initramfs-2.6.40.3-0.fc15.x86_64.img 2.6.40.3-0.fc15.x86_64 --force

Error persists after reboot.

--- Additional comment from michael.wuersch on 2011-09-02 11:05:02 EDT ---

Sorry, just noticed that the output of dmesg differs slightly:

dracut: Autoassembling MD Raid
dracut Warning: No root device "block:/dev/disk/by-uuid/812eb062-d765-4065-be34-4a2cf4160064" found

--- Additional comment from dledford on 2011-09-02 13:44:29 EDT ---



--- Additional comment from michael.wuersch on 2011-09-05 11:18:03 EDT ---

Thanks, Dough, for your time. Below's the output, when I remove the rhgb and quiet options:

...
dracut: dracut-009-12.fc15
udev[164]: starting version 167
dracut: Starting plymouth daemon
pata_jmicron 0000:0500.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
scsi6: pata_jmicron
scsi7: pata_jmicron
ata7: PATA max UDMA/100 cmd 0xr400 ctl 0xec400 bdma 0xe480 irq 16
ata8: PATA max UDMA/100 cmd 0xr400 ctl 0xec880 bdma 0xe488 irq 16
firewire_ohci 0000:06:05.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
firewire_ohci: Added fw-ohci device 0000:06:05.0, OHCI v1.10, 4 IRQ +9 IT contexts, quirks 0x2
firewire_core: created device fw0 GUID 0030480000206d38, S400
dracut: Autoassembling MD Raid
dracut Warning: No root device "block:/dev/disk/by-uuid/812eb062-d756-4065-be34-4a2cf4160064"found


Dropping to debug shell.

sh: can't access tty; job control turned off
dracut:/#


Kernel 2.6.35.14-95.fc14.x86_64 boots perfectly with the same kernel parameters. Let me know, if I can provide any other helpful information.

Michael

--- Additional comment from michael.wuersch on 2011-09-07 05:57:35 EDT ---

Same problem with Kernel 2.6.40.4-5.fc15.x86_64.

Regards,

Michael

--- Additional comment from dledford on 2011-09-07 10:56:21 EDT ---

OK, this bug is getting overly confusing because we are getting different problems reported under the same bug.

First, Rodney, you're original bug was this:
dracut: mdadm: Container /dev/md127 has been assembled with 2 drives
dracut: mdadm (IMSM): Unsupported attributes: 40000000
dracut: mdadm IMSM metadata load not allowed due to attribute incompatibility

In response to that specific bug (about the unsupported attributes) I built a new mdadm with a patch to fix the issue.  Your system still doesn't boot now, so the question is why.  You then posted these messages:
md: raid1 personality registered for level 1
bio: create slab <bio-1> at 1 [Not sure this is relevant, but it's here in the
middle of the others.]
dracut: mdadm: array /dev/md126 now has 2 devices
dracut Warning: No root device "block:/dev/mapper/vg_hostname-lv_root" found
dracut Warning: LVM vg_host/lv_root not found
dracut Warning: LVM vg_host/lv_swap not found

The important thing to note here is that mdadm is no longer rejecting your array, and in fact it started your raid device.  Now, what's happening is that the lvm PV on top of your raid device isn't getting started.  Regardless of the fact that your system isn't up and running yet, the original bug in the bug report *has* been fixed and verified.  So, this bug is no longer appropriate for any other problem reports because the specific issue in this bug is resolved.

Of course, that doesn't get yours or any of the other poster's systems running, so we need to open a new bug(s) for tracking the remaining issues.

I've not heard back from Charlweed on what his problem is.  Rodney, your new problem appears to be that the raid device is started, but the lvm PV on top of your raid device is not.  Michael, unless you edited lines out of your debug messages you posted, I can't see where your hard drives are being detected and can't see where the raid array is even attempting to start.  Dracut is starting md autoassembly, but it's not finding anything to assemble and so it does nothing.  So I'll clone this twice to track the two different issues.  This bug, however, is now verified and ready to be closed out when the package is pushed live.

--- Additional comment from michael.wuersch on 2011-09-07 13:46:55 EDT ---

Thanks for cloning the bug - I am not familiar with the internals of the early linux boot process and therefore, up to now, I was not aware that the bugs weren't related.

I did not edit any lines out after the first line (i.e., the line 'dracut: dracut-009-12.fc15'). Can I contribute anything else to help in resolving this issue?

Michael

--- Additional comment from dledford on 2011-09-08 20:45:39 EDT ---

In the other bug I cloned from this one a fact came up that might be relevant here.  Can you try grabbing the dracut package from your install media and downgrading your copy of dracut to what was shipped with f15, then rebuild the initramfs that fails to boot with the old dracut and try booting again?

--- Additional comment from michael.wuersch on 2011-09-09 02:51:17 EDT ---

I did not use any media but instead relied on PreUpgrade to get to fc15. But I will download an ISO quickly and try as advised.

--- Additional comment from michael.wuersch on 2011-09-09 03:32:06 EDT ---

No luck, so far.

I have checked the version of dracut on the DVD: dracut-009-10.fc15.noarch.rpm, whereas I had installed 009-12.fc15.

Since I can boot with 2.6.35.14-95.fc14.x86_64, I bootet and ran:

sudo yum downgrade dracut

Output:
...
Running Transaction
  Installing : dracut-009-10.fc15.noarch                                                                                                                         
  Cleanup    : dracut-009-12.fc15.noarch

Removed:
  dracut.noarch 0:009-12.fc15

Installed:
  dracut.noarch 0:009-10.fc15

Then I ran:

sudo dracut initramfs-2.6.40.4-5.fc15.x86_64.img 2.6.40.4-5.fc15.x86_64 --force

and did a reboot. Same error message as before.

Michael

--- Additional comment from dledford on 2011-09-09 12:02:12 EDT ---

For some reason, on your system, the hard drives are not being found.  Can you boot into the working kernel, then run dmesg and post the output of that into this bug please?

--- Additional comment from michael.wuersch on 2011-09-09 13:29:28 EDT ---

Created attachment 522371 [details]
dmesg output

I have attached the log.

--- Additional comment from dledford on 2011-09-09 14:24:36 EDT ---

OK, so when the machine boots up successfully, it is starting drives sda and sdb as an imsm raid array, so when you try to boot the new kernel, it drops you to a debug shell.  From that debug shell, I need you to do a few things.

First, verify that /dev/sda and /dev/sdb exist.  Next, if they exist, try to assemble them using mdadm via the following commands:

/sbin/mdadm -I /dev/sda
/sbin/mdadm -I /dev/sdb

If those commands work, then you should now have a new md device.  Try running this command on that new device:

/sbin/mdadm -I /dev/md<device_number>

If that gets you your raid array up and running, then the question becomes "Why isn't this happening automatically like it's supposed to?"

To try and answer that, make sure that the files /lib/udev/rules.d/64-md-raid.rules and /lib/udev/rules.d/65-md-incremental.rules exist.

Let me know what you find out.

--- Additional comment from pb on 2011-09-11 11:54:57 EDT ---

Perhaps my note https://bugzilla.redhat.com/show_bug.cgi?id=729205#c15 helps, at least in my case downgrade to mdadm-3.1.5-2 and recreate initramfs files will result in a proper working newer kernel. initramfs containing mdadm binary from  mdadm-3.2.2-6 nor 3.2.2-9 will not work in my case and result in a broken boot.

Any hints how to debug the mdadm problem in dracut shell?

--- Additional comment from michael.wuersch on 2011-09-12 03:49:00 EDT ---

I booted into dracut debug shell and entered:

/sbin/mdadm -I /dev/sda
/sbin/mdadm -I /dev/sdb

Output was:

mdam: no RAID superblock on /dev/sda and /dev/sdb, respectively.

/lib/udev/rules.d/64-md-raid.rules  does exist, whereas /lib/udev/rules.d/65-md-incremental.rules does not.

Here's the raid info from the "good" kernel:

---
[user ~]$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2

  Member Arrays : /dev/md127

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda
       1       8       16        -        /dev/sdb

---
sudo mdadm --detail /dev/md127
/dev/md127:
      Container : /dev/md0, member 0
     Raid Level : raid1
     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 2

    Update Time : Mon Sep 12 09:16:20 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

---
cat /proc/mdstat
Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      1953511424 blocks super external:/md0/0 [2/2] [UU]
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

--- Additional comment from dledford on 2011-09-12 11:19:48 EDT ---

Peter, Michael: if you boot into an initramfs that does not work, then what do you get when you run mdadm -E /dev/sda?  Does it simply say there is no superblock at all, or does it say it finds one but it's invalid, and if it does say it's invalid, does it say why?

--- Additional comment from pb on 2011-09-12 15:25:47 EDT ---

mdadm -E /dev/sda shows proper output like
/dev/sda
 Magic: Intel Raid ISM Cfg. Sig.
 ...
 Attributes: All supported
 ...

[OS] (name of configured RAID1 set in BIOS)
 ...
 Migrate State: repair (because of all this failed boots...)
 ...


cat /proc/mdstat tells
 md127 : inactive sda[1] sdb[0]
          ... blocks super exsternal:-md0/0

md0 : inactive sdb[1](S) sda[0](S)
       .. blocks super external: imsm


For me it looks like that the new version of mdadm simply forget to activate the RAID, while the old version does

--- Additional comment from michael.wuersch on 2011-09-16 03:13:19 EDT ---

Sorry for the delay, here's the output of mdadm:

dracut:/# /sbin/mdadm -E /dev/sd?
/dev/sda:
	Magic : Intel Raid ISM Cfg Sig.
	Version : 1.1.00
	Orig Family : 0932e0b0
	Family : 0932e0b0
	Generation : 00261fa8
	Attributes : All supported
	UUID : ...:...:...
	Checksum : 045764af correct
	MPB Sectors : 1
	Disks : 2
	RAID Devices : 1
	
	Disk00 Serial : JK11A8B9JL8X5F
	State : active
	Id : 00000000
	Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[System:]
	UUID : ...:...:...
	RAID LEVEL : 1
	Members : 2
	SLOTS : [UU]
	FAILED DISK : none
	This Slot : 0
	Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
	Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
	Sector Offset : 0
	Num Stripes : 15261808
	Chunk Size : 64 KiB
	Reserved : 0
	Migrate State : idle
	Map State : normal
	Dirty State : dirty

	Disk00 Serial : JK11A8B9JL8X5F
	State : active
	Id : 00000000
	Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

... (pretty much the same for /dev/sdb, as above)

--- Additional comment from dledford on 2011-09-16 14:01:17 EDT ---

Peter, you left out part of the contents of /proc/mdstat, what does the personality line read on a failed boot? (And I would like the same info from you Michael, aka the full contents of /proc/mdstat on a failed boot)

--- Additional comment from pb on 2011-09-16 14:18:54 EDT ---

Next notes:

1. always successful boot with old mdadm:
Personalities : [raid1]

2. did now successful boot to a "NORMAL" (BIOS) array also with new mdadm. But here the resync starts immediately.

[    3.260537] md: md0 stopped.
[    3.263234] md: bind<sda>
[    3.263338] md: bind<sdb>
[    3.263490] dracut: mdadm: Container /dev/md0 has been assembled with 2 drives
[    3.272304] md: md127 stopped.
[    3.272514] md: bind<sdb>
[    3.272653] md: bind<sda>
[    3.273900] md: raid1 personality registered for level 1
[    3.274490] md/raid1:md127: not clean -- starting background reconstruction
^^^^ BIOS told "NORMAL" !

[    3.274564] md/raid1:md127: active with 2 out of 2 mirrors
[    3.274643] md127: detected capacity change from 0 to 160038912000
[    3.282507] md: md127 switched to read-write mode.
[    3.282761] md: resync of RAID array md127
[    3.282790] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[    3.282826] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[    3.282882] md: using 128k window, over a total of 156288132k.
[    3.292149] dracut: mdadm: Started /dev/md127 with 2 devices
[    3.401892]  md127: p1 p2 p3 p4 < p5 p6 p7 p8 >
[    3.724356] md: md1 stopped.
[    3.727722] md: bind<sdc1>
[    3.730487] md: bind<sdd1>
[    3.734248] md/raid1:md1: active with 2 out of 2 mirrors
[    3.736817] md1: detected capacity change from 0 to 160039174144
[    3.739379] dracut: mdadm: /dev/md1 has been started with 2 drives.
[    3.743099]  md1: unknown partition table

Just note here, I ran 2 RAID1 with 4 drives
/dev/sd{a,b} is IMSM (dual boot with Windows)
/dev/sd{c,d} is a Linux only software RAID

3. Reboot now during this running resync results in BIOS "VERIFY" (just note that I think during shutdown something like store of current sync position is shown. Booting with new mdadm results now in broken boot, where 

Personalities : [raid1]

and md1 (the Linux software RAID) is active, while md127 is inactive

So as other also already have seen, if the IMSM RAID is in "VERIFY" mode, mdadm will not start the RAID.

--- Additional comment from michael.wuersch on 2011-09-20 05:18:15 EDT ---

cat /proc/mdstat does not list anything when dropped to the dracut debug shell, i.e.:

dracut:/# cat /proc/mdstat
Personalities :
unused devices: <none>

Output for the old kernel (the one which is able to boot) is:

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      1953511424 blocks super external:/md0/0 [2/2] [UU]
      [>....................]  resync =  0.0% (1727872/1953511556) finish=5236.1min speed=6212K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

--- Additional comment from dledford on 2011-09-20 15:54:44 EDT ---

OK, I've got enough info to try and reproduce it here.  I'll see if I can work up a fix to this.  It seems that the mdadm-3.2.2 binary is misinterpreting some of the bits in the imsm superblock so that it doesn't assemble arrays in VERIFY state and when the BIOS thinks an array is clean, mdadm thinks it is dirty and starts a rebuild.

--- Additional comment from pb on 2011-10-06 14:07:08 EDT ---

I ran additional tests because also after downgrading to mdadm-3.1.5-2.fc15.i686 the rebuild starts even on a clean array, which keeps my system after each reboot very busy for 90 minutes.

Crossdowngrading to mdadm-3.1.3-0.git20100804.3.fc14 of F14 and creating a special new ramdisk finally solves the issue.

Please check all changes from 3.1.3 to 3.1.5/3.2.2

Comment 1 Fedora Update System 2011-10-20 08:20:41 UTC

mdadm-3.2.2-11.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/mdadm-3.2.2-11.fc16

Comment 2 Peter Bieringer 2011-10-20 18:12:07 UTC

Using the version from koji 3.2.2-10.fc15 the behavior looks good again, after 2nd boot it starts syncing from the last position (like old version did) and device is detected proper in verify state.

Comment 3 Fedora Update System 2011-10-20 22:14:54 UTC

Package mdadm-3.2.2-11.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing mdadm-3.2.2-11.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14682
then log in and leave karma (feedback).

Comment 4 Jes Sorensen 2011-10-22 08:27:25 UTC

Neil Brown (mdadm maintainer) spotted a bug in one of my fixes. I'll update
and push a fixed version.

Comment 5 Fedora Update System 2011-10-22 15:15:45 UTC

mdadm-3.2.2-12.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/mdadm-3.2.2-12.fc16

Comment 6 Fedora Update System 2011-10-22 17:06:33 UTC

Package mdadm-3.2.2-12.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing mdadm-3.2.2-12.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14767
then log in and leave karma (feedback).

Comment 7 Jes Sorensen 2011-10-24 08:32:13 UTC

Suggest blocker for F16.

This one is really painful for people who get bitten by it and it can
prevent booting and/or install.

Jes

Comment 8 Sandro Mathys 2011-10-24 08:52:51 UTC

Failed QA test criteria:

"This install verifies that installing on a BIOS RAID device works properly.", i.e. "System boots successfully recognizing filesystems created on the BIOS RAID device(s)", see
https://fedoraproject.org/wiki/QA:Testcase_Install_to_BIOS_RAID

Comment 9 Adam Williamson 2011-10-24 16:43:34 UTC

Discussed at 2011-10-24 blocker review meeting. As we understand this bug, it causes the system to fail to boot properly if a RAID array is degraded: as such, we accept it as a blocker under the intersection of criteria "The installer must be able to create and install to software, hardware or BIOS RAID-0, RAID-1 or RAID-5 partitions for anything except /boot" and "Following on from the previous criterion, after firstboot is completed and on subsequent boots, a system installed according to any of the above criteria (or the appropriate Beta or Final criteria, when applying this criterion to those releases) must boot to a working graphical environment without unintended user intervention. This includes correctly accessing any encrypted partitions when the correct passphrase is supplied" - a system with a degraded RAID array should boot and allow you to rebuild the array, not fail to boot.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 10 Fedora Update System 2011-10-25 03:30:41 UTC

mdadm-3.2.2-12.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.