Bug 736387

Summary: kernel-2.6.40-4.fc15.x86_64 fails to boot due to failure to start MD RAID
Product: [Fedora] Fedora Reporter: Doug Ledford <dledford>
Component: mdadmAssignee: Doug Ledford <dledford>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: agajania, agk, brian.broussard, bugzilla, cb20777, c.bradley, cjg9411, dev, dledford, harald, Jes.Sorensen, maciej.patelczyk, mbroz, michael.wuersch, msmsms10079, pb, rhbugzilla, serge, vezza
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: mdadm-3.2.2-15.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 729205
: 744217 744219 (view as bug list) Environment:
Last Closed: 2011-12-14 23:37:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output
none
dmesg with rdshell rdinitdebug
none
dmesg with rdshell rdinitdebug
none
dmesg with rdshell rdbreak rdinitdebug quiet loglevel=9 log_buf_len=1M
none
/run/initramfs/init.log from a failed boot
none
Output from mdadm -E /dev/sda
none
Output from mdadm -E /dev/sdb
none
/etc/mdadm.conf from a failed boot none

Description Doug Ledford 2011-09-07 15:04:07 UTC
+++ This bug was initially created as a clone of Bug #729205 +++


--- Additional comment from michael.wuersch on 2011-09-02 10:54:14 EDT ---

I have exactly the same problem, which occurred after updating from fc14 to fc15 and thus getting a new kernel. However, my kernel is 2.6.40.3-0.fc15.x86_64.

I followed the advice above and executed:

su -c 'yum update --enablerepo=updates-testing mdadm-3.2.2-9.fc15'

Then I rebuilt the initramfs image with:

sudo dracut initramfs-2.6.40.3-0.fc15.x86_64.img 2.6.40.3-0.fc15.x86_64 --force

Error persists after reboot.

--- Additional comment from michael.wuersch on 2011-09-02 11:05:02 EDT ---

Sorry, just noticed that the output of dmesg differs slightly:

dracut: Autoassembling MD Raid
dracut Warning: No root device "block:/dev/disk/by-uuid/812eb062-d765-4065-be34-4a2cf4160064" found

--- Additional comment from dledford on 2011-09-02 13:44:29 EDT ---



--- Additional comment from michael.wuersch on 2011-09-05 11:18:03 EDT ---

Thanks, Dough, for your time. Below's the output, when I remove the rhgb and quiet options:

...
dracut: dracut-009-12.fc15
udev[164]: starting version 167
dracut: Starting plymouth daemon
pata_jmicron 0000:0500.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
scsi6: pata_jmicron
scsi7: pata_jmicron
ata7: PATA max UDMA/100 cmd 0xr400 ctl 0xec400 bdma 0xe480 irq 16
ata8: PATA max UDMA/100 cmd 0xr400 ctl 0xec880 bdma 0xe488 irq 16
firewire_ohci 0000:06:05.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
firewire_ohci: Added fw-ohci device 0000:06:05.0, OHCI v1.10, 4 IRQ +9 IT contexts, quirks 0x2
firewire_core: created device fw0 GUID 0030480000206d38, S400
dracut: Autoassembling MD Raid
dracut Warning: No root device "block:/dev/disk/by-uuid/812eb062-d756-4065-be34-4a2cf4160064"found


Dropping to debug shell.

sh: can't access tty; job control turned off
dracut:/#


Kernel 2.6.35.14-95.fc14.x86_64 boots perfectly with the same kernel parameters. Let me know, if I can provide any other helpful information.

Michael

--- Additional comment from michael.wuersch on 2011-09-07 05:57:35 EDT ---

Same problem with Kernel 2.6.40.4-5.fc15.x86_64.

Regards,

Michael

--- Additional comment from dledford on 2011-09-07 10:56:21 EDT ---

OK, this bug is getting overly confusing because we are getting different problems reported under the same bug.

First, Rodney, you're original bug was this:
dracut: mdadm: Container /dev/md127 has been assembled with 2 drives
dracut: mdadm (IMSM): Unsupported attributes: 40000000
dracut: mdadm IMSM metadata load not allowed due to attribute incompatibility

In response to that specific bug (about the unsupported attributes) I built a new mdadm with a patch to fix the issue.  Your system still doesn't boot now, so the question is why.  You then posted these messages:
md: raid1 personality registered for level 1
bio: create slab <bio-1> at 1 [Not sure this is relevant, but it's here in the
middle of the others.]
dracut: mdadm: array /dev/md126 now has 2 devices
dracut Warning: No root device "block:/dev/mapper/vg_hostname-lv_root" found
dracut Warning: LVM vg_host/lv_root not found
dracut Warning: LVM vg_host/lv_swap not found

The important thing to note here is that mdadm is no longer rejecting your array, and in fact it started your raid device.  Now, what's happening is that the lvm PV on top of your raid device isn't getting started.  Regardless of the fact that your system isn't up and running yet, the original bug in the bug report *has* been fixed and verified.  So, this bug is no longer appropriate for any other problem reports because the specific issue in this bug is resolved.

Of course, that doesn't get yours or any of the other poster's systems running, so we need to open a new bug(s) for tracking the remaining issues.

I've not heard back from Charlweed on what his problem is.  Rodney, your new problem appears to be that the raid device is started, but the lvm PV on top of your raid device is not.  Michael, unless you edited lines out of your debug messages you posted, I can't see where your hard drives are being detected and can't see where the raid array is even attempting to start.  Dracut is starting md autoassembly, but it's not finding anything to assemble and so it does nothing.  So I'll clone this twice to track the two different issues.  This bug, however, is now verified and ready to be closed out when the package is pushed live.

Comment 1 Michael Würsch 2011-09-07 17:46:55 UTC
Thanks for cloning the bug - I am not familiar with the internals of the early linux boot process and therefore, up to now, I was not aware that the bugs weren't related.

I did not edit any lines out after the first line (i.e., the line 'dracut: dracut-009-12.fc15'). Can I contribute anything else to help in resolving this issue?

Michael

Comment 2 Doug Ledford 2011-09-09 00:45:39 UTC
In the other bug I cloned from this one a fact came up that might be relevant here.  Can you try grabbing the dracut package from your install media and downgrading your copy of dracut to what was shipped with f15, then rebuild the initramfs that fails to boot with the old dracut and try booting again?

Comment 3 Michael Würsch 2011-09-09 06:51:17 UTC
I did not use any media but instead relied on PreUpgrade to get to fc15. But I will download an ISO quickly and try as advised.

Comment 4 Michael Würsch 2011-09-09 07:32:06 UTC
No luck, so far.

I have checked the version of dracut on the DVD: dracut-009-10.fc15.noarch.rpm, whereas I had installed 009-12.fc15.

Since I can boot with 2.6.35.14-95.fc14.x86_64, I bootet and ran:

sudo yum downgrade dracut

Output:
...
Running Transaction
  Installing : dracut-009-10.fc15.noarch                                                                                                                         
  Cleanup    : dracut-009-12.fc15.noarch

Removed:
  dracut.noarch 0:009-12.fc15

Installed:
  dracut.noarch 0:009-10.fc15

Then I ran:

sudo dracut initramfs-2.6.40.4-5.fc15.x86_64.img 2.6.40.4-5.fc15.x86_64 --force

and did a reboot. Same error message as before.

Michael

Comment 5 Doug Ledford 2011-09-09 16:02:12 UTC
For some reason, on your system, the hard drives are not being found.  Can you boot into the working kernel, then run dmesg and post the output of that into this bug please?

Comment 6 Michael Würsch 2011-09-09 17:29:28 UTC
Created attachment 522371 [details]
dmesg output

I have attached the log.

Comment 7 Doug Ledford 2011-09-09 18:24:36 UTC
OK, so when the machine boots up successfully, it is starting drives sda and sdb as an imsm raid array, so when you try to boot the new kernel, it drops you to a debug shell.  From that debug shell, I need you to do a few things.

First, verify that /dev/sda and /dev/sdb exist.  Next, if they exist, try to assemble them using mdadm via the following commands:

/sbin/mdadm -I /dev/sda
/sbin/mdadm -I /dev/sdb

If those commands work, then you should now have a new md device.  Try running this command on that new device:

/sbin/mdadm -I /dev/md<device_number>

If that gets you your raid array up and running, then the question becomes "Why isn't this happening automatically like it's supposed to?"

To try and answer that, make sure that the files /lib/udev/rules.d/64-md-raid.rules and /lib/udev/rules.d/65-md-incremental.rules exist.

Let me know what you find out.

Comment 8 Peter Bieringer 2011-09-11 15:54:57 UTC
Perhaps my note https://bugzilla.redhat.com/show_bug.cgi?id=729205#c15 helps, at least in my case downgrade to mdadm-3.1.5-2 and recreate initramfs files will result in a proper working newer kernel. initramfs containing mdadm binary from  mdadm-3.2.2-6 nor 3.2.2-9 will not work in my case and result in a broken boot.

Any hints how to debug the mdadm problem in dracut shell?

Comment 9 Michael Würsch 2011-09-12 07:49:00 UTC
I booted into dracut debug shell and entered:

/sbin/mdadm -I /dev/sda
/sbin/mdadm -I /dev/sdb

Output was:

mdam: no RAID superblock on /dev/sda and /dev/sdb, respectively.

/lib/udev/rules.d/64-md-raid.rules  does exist, whereas /lib/udev/rules.d/65-md-incremental.rules does not.

Here's the raid info from the "good" kernel:

---
[user ~]$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2

  Member Arrays : /dev/md127

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda
       1       8       16        -        /dev/sdb

---
sudo mdadm --detail /dev/md127
/dev/md127:
      Container : /dev/md0, member 0
     Raid Level : raid1
     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 2

    Update Time : Mon Sep 12 09:16:20 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

---
cat /proc/mdstat
Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      1953511424 blocks super external:/md0/0 [2/2] [UU]
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

Comment 10 Doug Ledford 2011-09-12 15:19:48 UTC
Peter, Michael: if you boot into an initramfs that does not work, then what do you get when you run mdadm -E /dev/sda?  Does it simply say there is no superblock at all, or does it say it finds one but it's invalid, and if it does say it's invalid, does it say why?

Comment 11 Peter Bieringer 2011-09-12 19:25:47 UTC
mdadm -E /dev/sda shows proper output like
/dev/sda
 Magic: Intel Raid ISM Cfg. Sig.
 ...
 Attributes: All supported
 ...

[OS] (name of configured RAID1 set in BIOS)
 ...
 Migrate State: repair (because of all this failed boots...)
 ...


cat /proc/mdstat tells
 md127 : inactive sda[1] sdb[0]
          ... blocks super exsternal:-md0/0

md0 : inactive sdb[1](S) sda[0](S)
       .. blocks super external: imsm


For me it looks like that the new version of mdadm simply forget to activate the RAID, while the old version does

Comment 12 Michael Würsch 2011-09-16 07:13:19 UTC
Sorry for the delay, here's the output of mdadm:

dracut:/# /sbin/mdadm -E /dev/sd?
/dev/sda:
	Magic : Intel Raid ISM Cfg Sig.
	Version : 1.1.00
	Orig Family : 0932e0b0
	Family : 0932e0b0
	Generation : 00261fa8
	Attributes : All supported
	UUID : ...:...:...
	Checksum : 045764af correct
	MPB Sectors : 1
	Disks : 2
	RAID Devices : 1
	
	Disk00 Serial : JK11A8B9JL8X5F
	State : active
	Id : 00000000
	Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[System:]
	UUID : ...:...:...
	RAID LEVEL : 1
	Members : 2
	SLOTS : [UU]
	FAILED DISK : none
	This Slot : 0
	Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
	Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
	Sector Offset : 0
	Num Stripes : 15261808
	Chunk Size : 64 KiB
	Reserved : 0
	Migrate State : idle
	Map State : normal
	Dirty State : dirty

	Disk00 Serial : JK11A8B9JL8X5F
	State : active
	Id : 00000000
	Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

... (pretty much the same for /dev/sdb, as above)

Comment 13 Doug Ledford 2011-09-16 18:01:17 UTC
Peter, you left out part of the contents of /proc/mdstat, what does the personality line read on a failed boot? (And I would like the same info from you Michael, aka the full contents of /proc/mdstat on a failed boot)

Comment 14 Peter Bieringer 2011-09-16 18:18:54 UTC
Next notes:

1. always successful boot with old mdadm:
Personalities : [raid1]

2. did now successful boot to a "NORMAL" (BIOS) array also with new mdadm. But here the resync starts immediately.

[    3.260537] md: md0 stopped.
[    3.263234] md: bind<sda>
[    3.263338] md: bind<sdb>
[    3.263490] dracut: mdadm: Container /dev/md0 has been assembled with 2 drives
[    3.272304] md: md127 stopped.
[    3.272514] md: bind<sdb>
[    3.272653] md: bind<sda>
[    3.273900] md: raid1 personality registered for level 1
[    3.274490] md/raid1:md127: not clean -- starting background reconstruction
^^^^ BIOS told "NORMAL" !

[    3.274564] md/raid1:md127: active with 2 out of 2 mirrors
[    3.274643] md127: detected capacity change from 0 to 160038912000
[    3.282507] md: md127 switched to read-write mode.
[    3.282761] md: resync of RAID array md127
[    3.282790] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[    3.282826] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[    3.282882] md: using 128k window, over a total of 156288132k.
[    3.292149] dracut: mdadm: Started /dev/md127 with 2 devices
[    3.401892]  md127: p1 p2 p3 p4 < p5 p6 p7 p8 >
[    3.724356] md: md1 stopped.
[    3.727722] md: bind<sdc1>
[    3.730487] md: bind<sdd1>
[    3.734248] md/raid1:md1: active with 2 out of 2 mirrors
[    3.736817] md1: detected capacity change from 0 to 160039174144
[    3.739379] dracut: mdadm: /dev/md1 has been started with 2 drives.
[    3.743099]  md1: unknown partition table

Just note here, I ran 2 RAID1 with 4 drives
/dev/sd{a,b} is IMSM (dual boot with Windows)
/dev/sd{c,d} is a Linux only software RAID

3. Reboot now during this running resync results in BIOS "VERIFY" (just note that I think during shutdown something like store of current sync position is shown. Booting with new mdadm results now in broken boot, where 

Personalities : [raid1]

and md1 (the Linux software RAID) is active, while md127 is inactive

So as other also already have seen, if the IMSM RAID is in "VERIFY" mode, mdadm will not start the RAID.

Comment 15 Michael Würsch 2011-09-20 09:18:15 UTC
cat /proc/mdstat does not list anything when dropped to the dracut debug shell, i.e.:

dracut:/# cat /proc/mdstat
Personalities :
unused devices: <none>

Output for the old kernel (the one which is able to boot) is:

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      1953511424 blocks super external:/md0/0 [2/2] [UU]
      [>....................]  resync =  0.0% (1727872/1953511556) finish=5236.1min speed=6212K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

Comment 16 Doug Ledford 2011-09-20 19:54:44 UTC
OK, I've got enough info to try and reproduce it here.  I'll see if I can work up a fix to this.  It seems that the mdadm-3.2.2 binary is misinterpreting some of the bits in the imsm superblock so that it doesn't assemble arrays in VERIFY state and when the BIOS thinks an array is clean, mdadm thinks it is dirty and starts a rebuild.

Comment 17 Peter Bieringer 2011-10-06 18:07:08 UTC
I ran additional tests because also after downgrading to mdadm-3.1.5-2.fc15.i686 the rebuild starts even on a clean array, which keeps my system after each reboot very busy for 90 minutes.

Crossdowngrading to mdadm-3.1.3-0.git20100804.3.fc14 of F14 and creating a special new ramdisk finally solves the issue.

Please check all changes from 3.1.3 to 3.1.5/3.2.2

Comment 18 Jes Sorensen 2011-10-07 15:19:34 UTC
I have been trying to bisect my way through 3.1.5 to 3.2.2 and not really
had much luck with it. I am running a setup where I have 2 drives in a raid1,
and one drive for the OS, so I do not depend on assembly during the initramfs
state.

I did notice that in some cases if I ran mdadm -I manually, after 3-4 tries
they raid would suddenly come up and start syncing. In other cases it would
show up in PENDING state as inactive.

Still catching up on this so not sure what causes this to result in a raid
being marked PENDING?

Could we have a race with a missing memory barrier or something?

I'll try and go back to 3.1.3 as well.

Cheers,
Jes

Comment 19 Doug Ledford 2011-10-07 15:46:28 UTC
Jes: for clarification's sake, when you say PENDING, do you mean something BIOS related, or do you mean the array is inactive and marked PENDING in the output of /proc/mdstat?

Comment 20 Jes Sorensen 2011-10-07 15:47:56 UTC
Doug: This is /proc/mdstat output. I am pretty sure the word was PENDING,
but I'll have to double check as I am not near the box showing the problems
right now.

Comment 21 Jes Sorensen 2011-10-09 11:28:53 UTC
Ok just to confirm, if I use 3.1.5 I sometimes get the md device into PENDING,
like this:

[root@mahomaho mdadm-nbrown]# cat /proc/mdstat 
Personalities : [raid1] 
md126 : active (read-only) raid1 sda[1] sdb[0]
      41943040 blocks super external:/md127/0 [2/2] [UU]
      	resync=PENDING
      
md127 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

This is after I run 'mdadm -I /dev/sda ; mdadm -I /dev/sdb' for a raid1
md device.

Going back to 3.1.3 seems to make it start resync'ing reliably.

Jes

Comment 22 Jes Sorensen 2011-10-09 11:52:16 UTC
Hi,

More testing and some bad news .... I can reproduce this with 3.1.3 as well!
It is just a bit harder to reproduce with the older version. My guess is we
have a race condition somewhere, and something happened along the way that
altered the timing.

It is fairly easy to reproduce if you are looking for it. Sometimes I have to
try 30-40 times, but it shows up in the end, even with 3.1.3. My setup is
fairly simple: IMSM raid in raid1 mode over two drives (sd[ab]) and my OS
installed on sdc. Boot, then manually run this:

./mdadm -S /dev/md126 ; ./mdadm -S /dev/md127 ; cat /proc/mdstat
./mdadm -I /dev/sda ; ./mdadm -I /dev/sdb ; cat /proc/mdstat

Repeat Until the array shows up in PENDING (like in the previous post) or
inactive. In some cases md126 isn't found, in other cases both show up as
inactive, and in some cases I get PENDING.

Jes

Comment 23 Charles Butterfield 2011-10-10 13:16:17 UTC
I was recently bitten by this, presumably after my F15 kernel (or mdadm) was routinely updated.

Here is my information in case there are any helpful clues:

I have the following on my system:
kernel: 2.6.40.6-0.fc15.x86_64
kernel: 2.6.40.4-5.fc15.x86_64
kernel: 2.6.40.3-0.fc15.x86_64 - using this one for now
mdadm: 3.2.2-9.fc15
IMSM mirror: boot on MD parition, root and rest on LVM on second MD partition.

If I boot into any of the above kernels with the IMSM in "VERIFY" mode, the boot fails with a dracut error trying to access the files under LVM.

If I boot into the "4-5" kernel, with IMSM in "NORMAL" mode, the boot fails with dracut not finding the LVM volumes, and the IMSM gets set to "VERIFY".  What I do to recover is to boot from an F15 DVD, enter rescue mode, and wait a few hours for the MD raid to rebuild and get set to NORMAL again.

I have not yet tried to boot in "6-0" with the IMSM in "NORMAL" mode.

Questions:

1) Is this problem well enough understood that there is a fix somewhere?  I couldn't see anything in "testing"?

2) How do you guys get all that debug info exported from a system that fails to boot and drops into the very limited debug shell?  I've been using pencil and paper -- very laborious.

Comment 24 Jes Sorensen 2011-10-10 15:46:58 UTC
Charles,

Thanks for the data! The problem you are seeing with the 4-5 kernel may have
been fixed in Fedora 16, but I am not 100% sure it is safe to ask you to
update your dracut binary to this one:
https://koji.fedoraproject.org/koji/buildinfo?buildID=266766
Maybe Harald can comment on this.

With regard to understanding the problem, then unfortunately no, it isn't
well enough understood yet to say what is causing this. I am going to do a
fresh Fedora 15 and play with the two kernel versions you mention, it could
give us a hint.

Last, how to copy data across, I find the simplest way is to use a USB
stick. Switch to console mode CTRL-ALT-F1, mount it, then copy /tmp/*log
to the USB stick.

Jes

Comment 25 Jes Sorensen 2011-10-10 15:48:00 UTC
Harald,

Can you comment on whether the mdraid changes you made to dracut are applicable
to the latest version of Fedora 15 as well, per the two previous comments?

Thanks,
Jes

Comment 26 Jes Sorensen 2011-10-10 19:55:18 UTC
Charles,

I am seeing it here too, I had a clean raid1, booted it into
kernel: 2.6.40.6-0.fc15.x86_64 and it got marked dirty. Taking it
offline and re-adding it and it behaves like previously reporting in
this bug.

I will try and roll back to kernel: 2.6.40.3-0.fc15.x86_64

Cheers,
Jes

Comment 27 Jes Sorensen 2011-10-10 20:16:49 UTC
Tried 2.6.40-3.0 and I still see the same - then rolled back to
2.6.38.6-26.rc1.fc15.x86_64 and there I also see the problem with the
array refusing to start syncing.....

Comment 28 Gerhard 2011-10-11 05:55:54 UTC
Hej guys,

I also have the same line-up (and so the same problem) with my workstation as Charles Butterfield has.
@Jes: If I can do any dirty testing (I already saved my data to another disk), let me know. ;-)

Greetz,
   Gerhard

Comment 29 Maciej Patelczyk 2011-10-12 11:54:21 UTC
If after upgrading to mdadm 3.2.2 you see message like this:

"First, Rodney, you're original bug was this:
dracut: mdadm: Container /dev/md127 has been assembled with 2 drives
dracut: mdadm (IMSM): Unsupported attributes: 40000000
dracut: mdadm IMSM metadata load not allowed due to attribute incompatibility"

which is in first comment by Doug then i suggest that you should try the following  patch from Neil's repo:

commit id: 418f9b368a1200370695527d22aba8c3606172c5

    IMSM: allow some array attribute bits to be ignored.
    
    Some bits are not handled by mdadm, but their presence should not
    cause failure.
    In particular MPB_ATTRIB_NEVER_USE appears harmless.
    
    Reported-by: Thomas Steinborn <thestonewell>
    Signed-off-by: NeilBrown <neilb>

Doug could you try this?

Comment 30 Jes Sorensen 2011-10-13 08:23:53 UTC
Gerhard,

Just to be sure, in your case are you trying to boot off the raid1 device
or is it a secondary device in the system that doesn't get assembled correctly
at boot?

Thanks,
Jes

Comment 31 Gerhard 2011-10-13 09:52:49 UTC
Hej Jes,

exactly. I try to start from the raid1 device and after a while I get an error-msg and the dracut-shell.

The only difference to Charles is, I don't use an LVM - just four primary partitions.

Greetz,
   Gerhard

Comment 32 Fedora Update System 2011-10-20 08:12:11 UTC
mdadm-3.2.2-10.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.2-10.fc15

Comment 33 Aram Agajanian 2011-10-21 16:01:26 UTC
*** Bug 727696 has been marked as a duplicate of this bug. ***

Comment 34 Jes Sorensen 2011-10-22 08:27:03 UTC
Neil Brown (mdadm maintainer) spotted a bug in one of my fixes. I'll update
and push a fixed version.

Comment 35 Fedora Update System 2011-10-22 08:29:32 UTC
Package mdadm-3.2.2-10.fc15:
* should fix your issue,
* was pushed to the Fedora 15 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing mdadm-3.2.2-10.fc15'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14760
then log in and leave karma (feedback).

Comment 36 Fedora Update System 2011-10-22 15:13:26 UTC
mdadm-3.2.2-12.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.2-12.fc15

Comment 37 Peter Bieringer 2011-10-23 19:20:58 UTC
Hmm, this version still behaves strange.
- updated to mentioned version
- initramfs rebuilt
- boot
- array is recognized (good)
- array in resync mode (good)
- reboot after 10 min (sync still ongoing)
- array still in resync mode (good)
- wait until array is 100% resync'ed
- reboot, BIOS shows "Normal"
- array starts resyncing again.

It looks also that there is a difference between normal reboots and using ALT-SYSRQ boot (needed sometimes because hanging in "unmount" state) if array is in "Normal" state. ALT-SYSRQ B keeps "Normal", while normal reboot triggers resync.

Comment 38 Charles Butterfield 2011-10-23 20:30:51 UTC
mdadm-3.2.2-10.fc15 works well enough to allow me to use my system.  Last night I installed mdadm-3.2.2-10.fc15, and recreated the initramfs for all 3 of my most recent kernels.  I asked how to rebuild the initramfs and got the following guidance, which I am sharing:

To rebuild various initramfs, do the following:

cd /boot
dracut -f initramfs-<kernel version>.img <kernel version> # for each version

where <kernel version> = 2.6.40.3-0.fc15.x86_64, etc

Lastly, ignore the warnings about missing modules, they seem to be benign (at least in my case).

Thanks guys!

Comment 39 Jes Sorensen 2011-10-24 08:16:47 UTC
Thanks for the feedback!

Note that mdadm-3.2.2-12 is out, fixing a bug in the previous version.
However if -10 works for you, not a problem, the bug is not malicious.

Peter, I have seen the reboot issue occasionally. I don't think it is
related to this particular problem, but rather an issue with raids not
being shutdown correctly at reboot. If you could file a separate BZ on
that issue, that would be good.

Cheers,
Jes

Comment 40 Michael Würsch 2011-10-24 09:44:49 UTC
I did:

sudo yum update --enablerepo=updates-testing mdadm-3.2.2-10.fc15
sudo dracut initramfs-2.6.40.4-5.fc15.x86_64.img 2.6.40.4-5.fc15.x86_64 --force

and rebooted, selecting the corresponding kernel. Still fails to boot and instead drops into to dracut debug shell.

Cheers,

Michael

Comment 41 Jes Sorensen 2011-10-24 09:53:45 UTC
Michael,

Please grab mdadm-3.2.2-12.fc15

Then as root run 'dracut -f "" 2.6.40.4-5.fc15.x86_64'

The symptoms you are seeing sounds very much like it is picking up
the old initramfs image.

Jes

Comment 42 Michael Würsch 2011-10-24 10:05:18 UTC
Thanks, Jes and sorry for bothering you - I did not see the latest comment. I will check the new version and report back as soon as it is available (it does not seem to have arrived in the updates-testing repo yet).

Michael

Comment 43 Michael Würsch 2011-10-24 10:44:29 UTC
No luck, so far:

I grabbed the rpm in the meantime from
http://kojipkgs.fedoraproject.org/packages/mdadm/3.2.2/12.fc15/x86_64/mdadm-3.2.2-12.fc15.x86_64.rpm.

I have also used yum update to get the latest kernel version
(2.6.40.6-0.fc15.x86_64).

Installed the mdadm update, rebuilt the initramfs for the latest kernel (but
4-5 does not work either), and rebooted with:

sudo yum install /home/wuersch/mdadm-3.2.2-12.fc15.x86_64.rpm
sudo dracut initramfs-2.6.40.6-0.fc15.x86_64.img 2.6.40.6-0.fc15.x86_64 --force
sudo shutdown -r now

Still, I am getting to the dracut debug shell with dmesg showing:

dracut: Autoassembling MD Raid
dracut Warning: No root device
"block:/dev/disk/by-uuid/812eb062-d765-4065-be34-4a2cf4160064"

(as mentioned a couple of weeks ago, I can still boot with
2.6.35.14-95.fc14.x86_64)

Let me know, if I can provide any additional information to sort this out.

Michael

Comment 44 Jes Sorensen 2011-10-24 11:04:59 UTC
Michael,

Very odd - could you try and grab a copy of /init.log and a snapshot of the
screen when it goes wrong? You should be able to mount a usb stick from the
dracut shell.

Thanks,
Jes

Comment 45 Michael Würsch 2011-10-24 12:02:17 UTC
Created attachment 529846 [details]
dmesg with rdshell rdinitdebug

Comment 46 Michael Würsch 2011-10-24 12:02:40 UTC
There's no /init.log (I have removed rhgb quiet from the kernel commandline and added rdshell rdinitdebug instead). See attachement above.

Comment 47 Jes Sorensen 2011-10-24 12:10:31 UTC
The init.log might be in a different directory.

However your dmesg output seems to be a cycle of messages about a Fedora 15
disc in the DVD drive filling the log. Could you try booting without this 
disc in the drive?

Thanks,
Jes

Comment 48 Michael Würsch 2011-10-24 12:32:34 UTC
Created attachment 529854 [details]
dmesg with rdshell rdinitdebug

Removed the DVD as requested.

--Michael

Comment 49 Jes Sorensen 2011-10-24 13:09:00 UTC
Still mostly messages from the DVD drive - please add "log_buf_len=1M"

Comment 50 Jes Sorensen 2011-10-24 13:09:50 UTC
Ok input from Harald, please try with these parameters added:

rdshell rdbreak rdinitdebug quiet loglevel=9 log_buf_len=1M

Comment 51 Jes Sorensen 2011-10-24 13:10:19 UTC
Btw. init.log should be either /init.log or /run/initramfs/init.log

Comment 52 Michael Würsch 2011-10-24 14:05:16 UTC
Created attachment 529876 [details]
dmesg with rdshell rdbreak rdinitdebug quiet loglevel=9 log_buf_len=1M

Here's the dmesg output again. Do you still need init.log in addition?

Comment 53 Harald Hoyer 2011-10-24 14:27:23 UTC
[    4.509309] dracut: + /sbin/mdadm -As --auto=yes --run

Comment 54 Doug Ledford 2011-10-24 15:26:26 UTC
Michael, on a failed boot can you please grab the contents of /etc/mdadm.conf from the rdshell.  Also, mdadm -E /dev/sda and mdadm -E /dev/sdb would be helpful.  Finally, you can try removing any instances of rd_MD_UUID from the grub command line and see if it boots successfully that way.

Comment 55 Jes Sorensen 2011-10-25 07:25:52 UTC
Michael,

In addition, once you get dropped into the rdshell, are there any references
on the screen at that point about 'reshape' and the lack of a data file?

Thanks,
Jes

Comment 56 Michael Würsch 2011-10-25 09:23:20 UTC
Created attachment 530036 [details]
/run/initramfs/init.log from a failed boot

Comment 57 Michael Würsch 2011-10-25 09:24:07 UTC
Created attachment 530037 [details]
Output from mdadm -E /dev/sda

Comment 58 Michael Würsch 2011-10-25 09:24:37 UTC
Created attachment 530038 [details]
Output from mdadm -E /dev/sdb

Comment 59 Michael Würsch 2011-10-25 09:25:27 UTC
Created attachment 530039 [details]
/etc/mdadm.conf from a failed boot

Comment 60 Michael Würsch 2011-10-25 09:28:05 UTC
Jes,

I do neither see 'reshape' or any data file mentioned on the screen. The only output (except for the stuff also in dmesg output) is:

Dropping to debug shell.

sh: can't access tty; job control turned off
dracut:/#

Thanks for investigating,

Michael

Comment 61 Michael Würsch 2011-10-25 09:49:52 UTC
Doug,

Removing rd_MD_UUID from the kernel parameters via grub still brings me to the debug shell.

Michael

Comment 62 Fedora Update System 2011-11-10 09:56:51 UTC
mdadm-3.2.2-14.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.2-14.fc15

Comment 63 brian.broussard 2011-11-15 20:00:25 UTC
Does not fix the issue .... 

issue a sudo shutdown -r 0

fresh build with 2.6.40.6-0 and then yum mdadm-3.2.2-14 

worked fine with root shutdown -r 0 and shutdown -h 0  but when a user that has sudo rights to /sbin/shutdown system failed 

sh: can't access tty; job control turned off
dracut:/#

Target is a DELL Optiplex 990 with Intel Matrix RAID.

Comment 64 Fedora Update System 2011-11-23 10:49:50 UTC
mdadm-3.2.2-15.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.2-15.fc15

Comment 65 brian.broussard 2011-11-23 15:21:30 UTC
fresh build yum update and added mdadm-3.2.2-15.fc15 
note took minimal build from the DVD, then yum update -y

as root reboot (OK)
make user
login as user
su 
reboot (failed)

sh: can't access tty; job control turned off
dracut:/#

target system Dell 960 & 990 with Intel Matrix RAID.

Comment 66 Jes Sorensen 2011-11-23 15:38:45 UTC
same as with -14, right?

Comment 67 brian.broussard 2011-11-23 16:06:06 UTC
yes, 

note the reboot was issues with "su" not "su -" so I am not sure if it is an environment issues, but when the system comes back up the RAID states it is in Verify (not normal or initial)... 

Question should the mdadm be able to recover from this? Is there a way to get away from 0.90 Metadata?  Note in F16 I have not been able to reproduce this issue, just have a number of other issues mainly with third party components...  
  

(In reply to comment #66)
> same as with -14, right?

Comment 68 Jes Sorensen 2011-11-23 16:23:26 UTC
Brian,

This is still the issue where it goes wrong when a user reboots, but not
when root issues the reboot? If that is the case, I would still expect
it to be related to the environment when the command is issued.

However this BZ is about problems with IMSM RAIDs, which is different
metadata than 0.90, so I don't quite understand how the two are intersecting.

Jes

Comment 69 brian.broussard 2011-11-23 16:55:16 UTC
thanks i was not seeing the link, just see a lot of comments on the metadata. Want to make sure I did not miss something.  I have look over the source code and am not understanding all the differences between the F15 2.6.x and the F16 3.1.x and how their associated files are responding deferentially to the same hardware.  Looking at porting our robotic solution to F16 and just move forward.  In your opinion why does this not happen in the latest F16?  If F16 will not see this issue because of some design/logic differences then we will just move forward.

this IMSM concerns are forefront of my mind today as all my field systems have them; most are FC11 running great, and the last few months been shipping FC15 kernel 2.6.38... also doing fine, until a yum updated and lost the machine on the next reboot.  

Also note mdadm-3.2.2-13 on three identical machines with FC15 (2.6.38 & 2.6.40) and FC16 (3.1.1) only see the issue with FC15 kernel 2.6.40, yes and their associated files. 

I do not like not understanding the WHY, but I need to move on with a working solution, so will let you know if I see this in the F16 solution.     

thanks
brian
  

(In reply to comment #68)
> Brian,
> 
> This is still the issue where it goes wrong when a user reboots, but not
> when root issues the reboot? If that is the case, I would still expect
> it to be related to the environment when the command is issued.
> 
> However this BZ is about problems with IMSM RAIDs, which is different
> metadata than 0.90, so I don't quite understand how the two are intersecting.
> 
> Jes

Comment 70 Jes Sorensen 2011-11-24 12:09:51 UTC
Brian,

There are some differences in how dracut assembles the raid between F15 and
F16 - I suspect this is where it goes wrong. The mdadm packages should be
pretty much identical between the two Fedora releases.

If F16 works for you, I'd recommend going down that path.

Jes

Comment 71 brian.broussard 2011-11-24 15:16:23 UTC
thanks
(In reply to comment #70)
> Brian,
> 
> There are some differences in how dracut assembles the raid between F15 and
> F16 - I suspect this is where it goes wrong. The mdadm packages should be
> pretty much identical between the two Fedora releases.
> 
> If F16 works for you, I'd recommend going down that path.
> 
> Jes

Comment 72 Doug Ledford 2011-12-02 15:23:11 UTC
(In reply to comment #70)
> Brian,
> 
> There are some differences in how dracut assembles the raid between F15 and
> F16 - I suspect this is where it goes wrong. The mdadm packages should be
> pretty much identical between the two Fedora releases.
> 
> If F16 works for you, I'd recommend going down that path.
> 
> Jes

Just to elaborate on this a little bit: Fedora is moving to a different initramfs scheme that involves the initramfs bringing raid devices up, then switching to the real root and exec'ing systemd as init, then when you shutdown, systemd will kill everything off that was started on the real root filesystem but leave things started from the initrd alive, switch the initrd root back to being the system root, then tear everything down in the initrd in reverse order that the initrd started it up.  This is a complex set of operations that they don't have done yet, and won't appear in f15, but they have started to lay ground work in place in the dracut package IIUTC.  Now, if you install the latest dracut on your system, and you install the latest kernel on your system, and the bug is actually in dracut and not the kernel, then it will appear to be a kernel bug because only new kernels will be effected when in fact it's a dracut bug and since dracut is used to build initramfs images and then those images are not updated just because dracut is updated, the new dracut only shows its bug on new kernel installs.  So, the kernel issue can be a big red herring many times when it comes to mdadm/raid bootup issues.  The real culprit in many of those cases is the initramfs image (either because dracut made a bad one, or there is a bad mdadm binary on it, or bad udev rules files on it, etc).  So I wouldn't be so sure that your problem is related to the 2.6.40 kernel on f15 is my point ;-)  I'd probably be looking more closely at one or both of systemd and dracut on that failing f15 box.

Comment 73 Fedora Update System 2011-12-14 23:37:16 UTC
mdadm-3.2.2-15.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.