Bug 751667

Summary: dracut attempts to assemble mdraid before all drives are initialized
Product: [Fedora] Fedora Reporter: H. Peter Anvin <hpa>
Component: dracutAssignee: Harald Hoyer <harald>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 15CC: agk, dan.j.williams, dledford, harald, Jes.Sorensen, jonathan, mbroz
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: dracut-013-19.fc16 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-23 23:28:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Boot messages (after manual restart of md1)
none
Screenshot - dropping to dracut shell
none
Screenshot - /proc/mdstat
none
Screenshot - manual restart of /dev/md1
none
dmesg, as requested none

Description H. Peter Anvin 2011-11-07 05:07:07 UTC
Description of problem:
On one machine I have, half the drives are on ahci and the other one on mvsas.  During initialization, dracut attempts to assemble the boot drive (/dev/md1) when only the ahci drives are available, resulting in failure.  Assembling them manually from the dracut error console works, after which boot proceeds.

Version-Release number of selected component (if applicable):
dracut-009-12.fc15.noarch
kernel-2.6.40.6-0.fc15.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Set up a machine as described above
2. Install Fedora 15 (may require slipstreamed boot media as the mvsas driver on the F15 install medium doesn't work)
3. Try to boot
  
Actual results:


Expected results:


Additional info:

Comment 1 H. Peter Anvin 2011-11-07 05:07:58 UTC
Created attachment 531962 [details]
Boot messages (after manual restart of md1)

Comment 2 H. Peter Anvin 2011-11-07 05:13:04 UTC
Created attachment 531963 [details]
Screenshot - dropping to dracut shell

Comment 3 H. Peter Anvin 2011-11-07 05:13:26 UTC
Created attachment 531964 [details]
Screenshot - /proc/mdstat

Comment 4 H. Peter Anvin 2011-11-07 05:13:50 UTC
Created attachment 531965 [details]
Screenshot - manual restart of /dev/md1

Comment 5 Harald Hoyer 2011-11-07 09:06:22 UTC
Can you try dracut from Fedora 16? I changed the mdadm assemble strategy there.

http://koji.fedoraproject.org/koji/buildinfo?buildID=271877

# rpm -Uvh http://kojipkgs.fedoraproject.org/packages/dracut/013/18.fc16/noarch/dracut-013-18.fc16.noarch.rpm
# dracut -f
# reboot

Comment 6 H. Peter Anvin 2011-11-09 00:59:13 UTC
Sorry for the delay... this is a machine on which I have to schedule reboot, so I will try it as soon as I can get a reboot window.

Comment 7 H. Peter Anvin 2011-11-09 16:59:51 UTC
Bad news... this change actually made it worse, not better.  Now instead of failing to assemble the root device, /dev/md1, it now partially assembles and fails on all four md devices, requiring them all to be torn down and re-assembled manually from the dracut shell.

Comment 8 H. Peter Anvin 2011-11-09 17:04:20 UTC
Furthermore, after this change systemd doesn't boot all the way to the login prompt anymore; logging in via ssh it shows the following things in ps:

 2762 ?        Ss     0:00 /bin/plymouth --wait
 2763 ?        Ss     0:00 /bin/plymouth quit

Comment 9 Harald Hoyer 2011-11-10 12:41:27 UTC
Please add "rd.debug log_buf_len=1M" to the kernel command line and attach dmesg.

Comment 10 H. Peter Anvin 2011-11-12 23:41:52 UTC
Created attachment 533306 [details]
dmesg, as requested

Comment 11 Harald Hoyer 2011-11-15 08:40:26 UTC
[    0.000000] Command line: ro root=UUID=28d969db-6776-497f-8bea-f967fd464a6e vga=0x317 selinux=off SYSFONT=latarcyrheb-sun16 LANG=en_US.utf8 KEYTABLE=us nomodeset rd.debug log_buf_len=1M

So, you did not specify rd.md.uuid=<md raid uuid>. So dracut tries to assemble _every_ raid device it sees.

$ man dracut.kernel
...
       rd.md.uuid=<md raid uuid>
           only activate the raid sets with the given UUID. This parameter can be specified multiple times.
...

Because no rd.md.uuid exists on the kernel command line and /etc/mdadm.conf exists (was copied in the initramfs), dracut is calling:

# mdadm -As --auto=yes

several times, but mdadm fails to add the new (appearing) devices to the array, which, in my humble opinion it should do.

Comment 12 Jes Sorensen 2011-11-15 09:19:01 UTC
Peter,

What is your RAID config? I presume you are using standard mdadm RAID
with recent metadata? RAID1, RAID5, or?

Once the RAID is assembled, could you post the output of mdadm --detail?

I have a box here with an hpt controller in it, I might be able to setup
a RAID that spans both controllers (if I can convince the hpt to not export
the devices in AHCI mode).

Cheers,
Jes

Comment 13 Jes Sorensen 2011-11-15 09:25:35 UTC
Peter,

One more note, I fixed a race in the assembly code in mdadm-3.2.2-10.
Could you verify that you have at least -10 or later on the system as
well?

Thanks,
Jes

Comment 14 Doug Ledford 2011-11-15 17:06:15 UTC
(In reply to comment #11)
> Because no rd.md.uuid exists on the kernel command line and /etc/mdadm.conf
> exists (was copied in the initramfs), dracut is calling:
> 
> # mdadm -As --auto=yes
> 
> several times, but mdadm fails to add the new (appearing) devices to the array,
> which, in my humble opinion it should do.

Humble opinion aside, if you are calling mdadm -A and expecting it to add devices to an already existing array, then your code is broken.

Assemble does one thing and one thing only: takes a list of currently free devices and tries to make runnable arrays out of them.  It does not touch already assembled (or partially assembled) arrays, and it does not touch component devices that are already claimed in some way.

It would take a major rearchitecting of assemble mode to support adding drives to existing arrays, which is why when we wanted to support that, we wrote a new mode: incremental.  If you want mdadm to support adding newly found devices to already created and partially populated arrays, then use incremental support.  Anything else is a bug.

Comment 15 H. Peter Anvin 2011-11-16 04:43:13 UTC
The system has a mix of RAID1 (/boot) and RAID6 (the others); the metadata is 0.90 because these drives were pulled from an older system as-is.

And no, I haven't specified rd.md.uuid, although with the earlier dracut it would successfully assemble all arrays *except* /dev/md1 (/).

Comment 16 H. Peter Anvin 2011-11-16 04:56:51 UTC
mdadm is: mdadm-3.2.2-9.fc15.x86_64

There is no -10 in the F15 repos.  I guess I could try upgrading to 16.

Comment 17 H. Peter Anvin 2011-11-16 06:44:39 UTC
Well, I upgraded to Fedora 16, and it made absolutely no difference, including systemd never giving me a shell prompt (which has been the case since getting the fc16 dracut as requested in #5).  Note that it never actually shows any kind of graphical display.

ps still shows:

 2112 ?        Ss     0:00 /bin/plymouth --wait
 2115 ?        Ss     0:00 /bin/plymouth quit

Comment 18 Jes Sorensen 2011-11-16 09:46:46 UTC
(In reply to comment #15)
> The system has a mix of RAID1 (/boot) and RAID6 (the others); the metadata is
> 0.90 because these drives were pulled from an older system as-is.
> 
> And no, I haven't specified rd.md.uuid, although with the earlier dracut it
> would successfully assemble all arrays *except* /dev/md1 (/).

v0.90? uh oh, then you really want to be careful - you will want
mdadm-3.2.2-14 (should be the latest) as there is a bug in the old
version where you can destroy your raids if you try to upgrade to
new drives > 2TB and then grow the raid beyond 2GB. For F15 the new
version will limit you to 2TB per drive, and it should do 4TB per
drive in F16 - that said, I would recommend migrating to newer at
some point when you can.

I was trying to reproduce your problem using my hpt622 but I wasn't
able to get it to run in non ahci mode, so I will try with a sil
controller as soon as it arrives.

Cheers,
Jes

Comment 19 Doug Ledford 2011-11-16 17:47:47 UTC
HPA, the underlying problem appears to be that, for whatever reason, modprobe scsi_wait_scan is broken and not waiting until all scsi scans are complete.  If dracut were doing assembly incrementally instead of relying on scsi_wait_scan to let it know that it can now run mdadm -As, then you would be OK as the incremental assembly would happen eventually and the boot could continue after that.  To test that idea, can you add rd_MD_UUID= lines to the boot command line in grub and see if dracut's incremental mode works any better than it's assemble mode?

Harald, maybe the proper thing to do here would be to use incremental assembly always, and if your root isn't available in the 180 second or so timeout, then drop to the rdshell.

In any case, due to dracut's reliance on scsi_wait_scan which appears broken, combined with using mdadm -As which is *not* tolerant of a broken scsi_wait_scan and can *not* be used incrementally, we aren't booting.

Comment 20 Fedora Update System 2011-11-17 10:36:15 UTC
dracut-013-19.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/dracut-013-19.fc16

Comment 21 Harald Hoyer 2011-11-17 12:32:06 UTC
please retry with dracut-013-19.fc16

Comment 22 Fedora Update System 2011-11-19 06:00:14 UTC
Package dracut-013-19.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing dracut-013-19.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-16098/dracut-013-19.fc16
then log in and leave karma (feedback).

Comment 23 H. Peter Anvin 2011-11-20 03:35:29 UTC
Tested.  Worked marvellously.

Comment 24 H. Peter Anvin 2011-11-22 01:47:55 UTC
I was thinking some more about this change, and something that really concerns me (if done wrong) would be that it would seem to probabilistically start the system with one or more arrays in degraded mode, even though all the drives are there.  Even if those drives are added later, it would mean they are now stale and require an entire resynchronization cycle, during which the array is not redundant.

Comment 25 Doug Ledford 2011-11-22 05:36:05 UTC
What change would seem to probalistically start the system with one or more arrays in degraded mode?  You have to be more specific with your comments.

Comment 26 H. Peter Anvin 2011-11-22 19:08:12 UTC
My understanding of the dracut change (in dracut-013-19.fc16) was that instead of relying on scsi_wait_scan it would run mdadm incrementally until the array can be started.  This doesn't mean the array is complete, however, and arguably there isn't any way to know if the array is ever going to complete.

Consider the case of a RAID6: the array is startable with N-2 drives, but it isn't complete until N drives.  Do you start it at N-2?  Do you wait for N (what is a drive is missing?)  Do you wait for N but time out after some time T if you have at least N-2 drives available?

Comment 27 H. Peter Anvin 2011-11-22 19:15:31 UTC
For the record: I have verified that the system can boot even with one drive physically removed.

Comment 28 Doug Ledford 2011-11-23 08:10:39 UTC
mdadm arrays now start as soon as possible in a state called auto-read-only.  This state will not dirty the array unless the filesystem above initiates a write.  In this way, we keep the array clean until the last possible minute, and if no final drives have shown up by the time the file system finally starts issuing writes, then we go ahead and switch to read-write mode and treat the array as degraded.  As a practical matter, since udev processes events sequentially during boot up, we generally have all of our devices before the filesystem ever writes to the device (this is because the queued up device events are generally before the queued up filesystem available udev event, although this behavior is not guaranteed).  However, in the event that the device goes live before all devices are present, and you want to minimize resync time, then I suggest you add a bitmap to the device as that will limit resyncs to just those sections of drives that were dirtied prior to the drive being readded.  This can reduce resync times from days on huge arrays to just minutes.  It does, however, come at the cost of some small overhead and latency on write requests.

Comment 29 Fedora Update System 2011-11-23 23:28:50 UTC
dracut-013-19.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 30 H. Peter Anvin 2012-02-06 22:42:49 UTC
This fix as currently implemented is NOT SAFE.

I just had an array failure because of this -- dracut brought up the array with 3 drives (insufficient for making the disk operate, as this is a RAID-6 with 6 drives) and now the kernel refuse to load the rest of the drives as their serial numbers don't match.

mdadm --assemble --force seems to work, and I'm hoping for minimal loss of actual content, but in effect this "fix" has promoted a boot failure into a data loss event.

Comment 31 Doug Ledford 2012-02-06 23:07:02 UTC
There should be no data loss, but there also shouldn't have been an event counter update.  Can you elaborate on dracut bringing the raid array up with three devices please?  What happened afterwards, did the machine attempt to continue booting up, was it power cycled, how did you end up getting to the point that the drives no longer saw each other as in sync?  If you could relay the entire sequence of events, starting from the last time the machine was shut down until you issued the --assemble --force, that would be most helpful.

Comment 32 H. Peter Anvin 2012-02-06 23:14:28 UTC
This is what I *know* of the sequence:

- The machine went down for a kernel update on Feb 4.
- The machine never came back on.  I incorrectly guessed that this was due to an SELinux relabel.
- I didn't have a console on the machine, so I reset it after about 24 hours.
- When it didn't come back online after several hours, I attached a monitor, and found that it was sitting at the dracut shell, with three drives pulled into /dev/md1 and /dev/md3.  Oddly enough /dev/md2 (on the same drives) was correctly assembled.