Bug 1203049 - systemd cannot fsck a software RAID array on boot due to MDADM/UDEV locking and dev creation
Summary: systemd cannot fsck a software RAID array on boot due to MDADM/UDEV locking a...
Keywords:
Status: CLOSED DUPLICATE of bug 912735
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 21
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-18 02:02 UTC by Sebastian Weigand
Modified: 2015-03-20 00:26 UTC (History)
30 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 912735
Environment:
Last Closed: 2015-03-20 00:26:51 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Sebastian Weigand 2015-03-18 02:02:18 UTC
Hi folks!

Not much to add here, except that this bug seems to persist in Fedora 21. Essentially, if one installs Fedora 21 onto a system which uses software RAID (in my case IMSM), it will fail to reboot after entries are created in /etc/fstab, as the device is in use prior to fsck running.

I'd really love to get this RAID stuff working, as everyone seems to be having issues with it. Ubuntu won't assemble the array per fakeraid confusion, and Arch has this identical bug. I'm hoping the wonderful Fedora / Red Hat team will come through!

Cheers,

-Sebastian Weigand

+++ This bug was initially created as a clone of Bug #912735 +++

--- Additional comment from Tony Marchese on 2013-02-19 09:26:41 EST ---

I am now running:

mdadm-3.2.6-14.fc18.x86_64
dracut-024-25.git20130205.fc18.x86_64
kernel-3.7.8-202.fc18.x86_64

I overlooked that mdadm-3.2.6-14.fc18.x86_64 and dracut-024-25.git20130205.fc18.x86_64 were only available through the updates-testing repo. After installation I have been several times rebooted running dracut -f and issuing the command mdmon --all --takeover --offroot in the pre-mount shell invoked through boot parameter rd.break=pre-mount

Here is my fstab:


# /etc/fstab
# Created by anaconda on Tue Feb 12 19:33:07 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=7afe1956-b93a-4bdf-bb5b-83f0ae011a83 /                       ext4    defaults        1 1
UUID=db1a19a4-8db3-4d12-be7c-bcc28e3ce471 /boot                   ext4    defaults        1 2
UUID=3393924f-fdd0-4150-8e23-55f8fd679f1e swap                    swap    defaults        0 0
UUID=7f6f71c8-784e-4ec7-bdc0-11a48b6fa9e7 /home 		  ext4	  defaults,nofail 0 2

# mdadm -D /dev/md126
/dev/md126:
      Container : /dev/md/imsm0, member 0
     Raid Level : raid1
     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 2

          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0


           UUID : 6862ff21:72014ea5:67fa1e10:f1d2a26b
    Number   Major   Minor   RaidDevice State
       1       8       16        0      active sync   /dev/sdb
       0       8       32        1      active sync   /dev/sdc

# mdadm -D /dev/md127
/dev/md127:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2


           UUID : bd9c2866:4fbc5b7b:3ba8e429:d291d6d7
  Member Arrays : /dev/md/Volume0_0

    Number   Major   Minor   RaidDevice

       0       8       16        -        /dev/sdb
       1       8       32        -        /dev/sdc


Th behaviour is actually that the system boots (the nofail in fstab helps), but the raid-1 volume is not mounted. Below is an extract from my journalctl -xb

...skipping...
feb 19 15:00:50 tonyhome kernel: md/raid1:md126: active with 2 out of 2 mirrors
feb 19 15:00:50 tonyhome kernel: md126: detected capacity change from 0 to 2000395698176
feb 19 15:00:50 tonyhome kernel:  md126: unknown partition table
feb 19 15:00:50 tonyhome kernel: asix 2-5.3:1.0 eth0: register 'asix' at usb-0000:00:1d.7-5.3, ASIX AX88772 USB 2.0 Ethernet, 00:50:b6:54:89:0c
feb 19 15:00:50 tonyhome kernel: usbcore: registered new interface driver asix
feb 19 15:00:50 tonyhome kernel: Adding 14336916k swap on /dev/sda3.  Priority:-1 extents:1 across:14336916k SS
feb 19 15:00:50 tonyhome systemd-fsck[593]: /dev/sda1: clean, 375/128016 files, 165808/512000 blocks
feb 19 15:00:50 tonyhome systemd-fsck[599]: /dev/md126 is in use.
feb 19 15:00:50 tonyhome systemd-fsck[599]: e2fsck: Impossibile continuare, operazione annullata.
feb 19 15:00:50 tonyhome systemd-fsck[599]: fsck failed with error code 8.
feb 19 15:00:50 tonyhome systemd-fsck[599]: Ignoring error.
feb 19 15:00:50 tonyhome mount[606]: mount: /dev/md126 is already mounted or /home busy
feb 19 15:00:50 tonyhome kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
feb 19 15:00:50 tonyhome kernel: SELinux: initialized (dev sda1, type ext4), uses xattr
feb 19 15:00:50 tonyhome kernel: md: export_rdev(sdc)
feb 19 15:00:50 tonyhome kernel: md: export_rdev(sdb)
feb 19 15:00:50 tonyhome kernel: md: md126 switched to read-write mode.
feb 19 15:00:51 tonyhome kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.1/sound/card2/input17
feb 19 15:00:51 tonyhome kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.1/sound/card2/input18
feb 19 15:00:51 tonyhome kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:03:00.1/sound/card2/input19
feb 19 15:00:51 tonyhome kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:03:00.1/sound/card2/input20
feb 19 15:00:51 tonyhome fedora-storage-init[627]: Impostazione del Logical Volume Management:   No volume groups found
feb 19 15:00:51 tonyhome fedora-storage-init[627]: [  OK  ]
feb 19 15:00:51 tonyhome fedora-storage-init[635]: Impostazione del Logical Volume Management:   No volume groups found
feb 19 15:00:51 tonyhome fedora-storage-init[635]: [  OK  ]
feb 19 15:00:51 tonyhome lvm[642]: No volume groups found
feb 19 15:00:51 tonyhome auditd[645]: Started dispatcher: /sbin/audispd pid: 648
...skipping...

Afterwords I can login in the system as root and I can manually run mount -a which normally mount the raid-1 volume in /home

Feb 19 15:01:38 tonyhome kernel: [   55.531726] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null)

From there on the system works normally until next reboot...

I don't know whether this issue is still related to this bug or it is about something else.
Thank you for analyzing!

--- Additional comment from Doug Ledford on 2013-02-19 09:32:39 EST ---

Tony, since your problem is occurring with the latest software, and given that your problem (now) is not the same as the one in this bug report, I'm cloning just your last comment into a new bug.

--- Additional comment from Doug Ledford on 2013-02-19 09:50:33 EST ---

The basic problem here, according to the logs, is that mdadm is creating a new device as a result of a udev event on a different device, and during that creation process mdadm holds an exclusive open on the new device's device file.  Systemd (or udev, however you want to look at things), being the speedy little daemon that they are, do not wait for mdadm to complete the creation process on the device and so they attempt to open it before mdadm has released it's exclusive open, fail, and the system then does not mount the raid device.  Of course, less than a second later, mdadm finishes, exclusive open is released, and so when the user attempts to mount things manually, it all just works.

Or I guess the problem could be systemd and not udev.  If the newly created device is not having fsck run on it as a result of a udev rule, but instead systemd is picking up the existence of the newly created device directly and immediately going to work on it, then it is systemd that would need to be made aware of the fact that the device is not yet ready for use.  So, not sure where this belongs, I'm just sure it's a race condition on the newly created device.

--- Additional comment from Michal Schmidt on 2013-02-19 11:08:38 EST ---

systemd ships udev rules (in /lib/udev/rules.d/99-systemd.rules) that are meant to delay the moment when systemd sees the device as ready:

# Ignore raid devices that are not yet assembled and started
SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", KERNEL=="md*", TEST!="md/array_state", ENV{SYSTEMD_READY}="0"
SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", KERNEL=="md*", ATTR{md/array_state}=="|clear|inactive", ENV{SYSTEMD_READY}="0"


Is there any other way that full readiness of an md array can be detected from udev rules?

Or would it be possible for mdadm to release the exclusive open before causing the final change of the array_state attribute?

Or would it be possible for the kernel to flip the attribute only after the process that holds the exclusive open closes it?

--- Additional comment from Doug Ledford on 2013-02-19 11:56:21 EST ---

These rules are conflating two separate items as being the same thing.  The moment that the array is ready is one thing, whether or not a process has an exclusive open on the array is another.  They are orthogonal.  It may be that mdadm has the exclusive open, but running fsck on the array also holds it open exclusively, as does mounting the array.

So, is there another way to tell if the array is fully ready?  No, this test is good.  It just isn't testing the right thing in this case.

Would it be possible for mdadm to release the lock early?  No, not without creating new race conditions (multiple mdadm instances spawned by udev for multiple constituent devices would cause us to race on which device actually triggers the array start as well as a few other things).

Would it be possible to flip the attributes on close of the device file?  Maybe, would need upstream buy in to do that.

It might be easier to modify mdadm so that any time we are doing incremental assembly we open the md device file, we create a temporary lock file named after the md device file as seen by the kernel (aka, if we are creating /dev/md/home, it will still be /dev/md127 in the kernel, so create a /dev/md127.lock or maybe /dev/md127.lock.$PID), only after we have the lock file do we do the manipulation and start of the array, then when we are done we first close the md device file, then we close/rm the md device lock file, and you add a test to your udev rule above that spins for as long as there is a $DEVNAME.lock* file present.  This wouldn't require kernel changes, so might be a bit easier to get upstream buy in than the kernel modification.  Myself though, I think the kernel modification to flip the ready status on close would be a better solution.

--- Additional comment from Harald Hoyer on 2013-02-26 08:35:14 EST ---

Why not have a sys ATTR{}, which flips, when mdadm has done it's job and a "change" uevent is emitted.

--- Additional comment from Tony Marchese on 2013-02-26 08:45:13 EST ---

In meantime my workaround for this issue was to create /etc/rc.d/rc.local with the following content:

#!/bin/sh
############# mounting /dev/md126 until bug 912735 is not solved
/bin/mount -a

--- Additional comment from Doug Ledford on 2013-02-27 12:17:02 EST ---

You know, in hindsight, I want to rethink my position on how this should be solved.

The change from SysV init to udev + systemd has changed a lot.  One of the primary changes is that it took what used to be a serialized startup and made it parallel.  OK, I'm fine with that.  But with the change from serialized to parallel, you need to have proper locking around certain events.  This is to be expected.

But mdadm is already *doing* the proper locking.  It's doing the same locking as the kernel does when it takes a device and mounts it, or when it takes a device and adds it to another virtual device.  The exclusive holding open of a device, whether by a user space program or by the kernel itself, is the authoritative locking around a device.  Everything else is secondary.

So the udev test is fine in that it tests that the md array has been brought up live (something you don't have to worry about on real devices, but is common to all virtual devices).

It does nothing to test if it is available for use.  And in truth, udev *shouldn't* be testing for that.  The proper test for whether or not the device is available for use is to spin on attempting to open the file until either the file is opened, or a timeout passes.  And it should be systemd that does this, not the kernel and not mdadm.

We've been thinking about this from a boot perspective, and in that instance I can sort of see where systemd might want to have the kernel or mdadm fix this issue.  But this isn't a kernel or mdadm issue, it's a parallel startup locking issue.  The parallel startup locking is handled by systemd.  It's what added the parallel bootup, it's where all the other parallel bootup locking is done, so it's where this locking should be too.

For non-boot scenarios, it would be entirely valid for a program that wants to create an md device to do so itself (without the use of mdadm, think some of the gnome disk utility programs, or anaconda) and then to immediately transition to using the device exclusively.  There is nothing preventing this.  So, the idea of flipping state on device close means that such a program would have to open the device, create the md array, close the device, reopen the device, use the array.  The clunkyness of such a usage scenario (despite it being contrived in the sense that no one actually does this) points out the fact that delaying state transition to close is a hack for this problem, not the right fix.

The proper fix here is for systemd to attempt to open a device before it attempts to call fsck/mount on the device.  If you do the open in a thread/process (presumably the same thread/process that you spawned/forked for the fsck/mount operations), then it doesn't interfere with the rest of systemd's operation and you can do something simple like:

    alarm(5); /* give 5 seconds for the device to become usable */
    fd = open(*device_path, O_EXCL);
    if (fd == -1 && errno == EINTR)
        /* We timed out, the device isn't available for use */
        return <whatever>;
    if (fd == -1) {
        /* Non-timeout error, make a note of it */
        perror("open");
        return <whatever>;
    }
    close(fd);
    /* Proceed with fsck and possible mount */

This has the advantage of being generic, applicaple to all virtual devices (and real devices too), it provides a bit of insulation against udev triggered access races in that we will wait for 5 seconds in the case some other udev triggered program beat us to the device, and it's simple.  So, in my opinion, this bug needs to be switched over to systemd and the fix put in place there.

--- Additional comment from Fedora End Of Life on 2013-12-21 06:32:12 EST ---

This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

--- Additional comment from Fedora End Of Life on 2014-02-05 14:18:31 EST ---

Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 1 Zbigniew Jędrzejewski-Szmek 2015-03-20 00:26:51 UTC

*** This bug has been marked as a duplicate of bug 912735 ***


Note You need to log in before you can comment on or make changes to this bug.