Bug 184570

Summary: multiple problems with md autodetect
Product: Red Hat Enterprise Linux 4 Reporter: Charlie Brady <charlieb-redhat-bugzilla>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: bugzilla, jbaron
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-27 23:55:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Charlie Brady 2006-03-09 21:53:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20060202 CentOS/1.0.7-1.4.3.centos4 Firefox/1.0.7

Description of problem:
I've been bitten by a rather disconcerting problem with RAID autostart. I've had a system running for a while with a RAID1 pair, with two partitions (/root and /) on each disk. To simulate a disk failure and recovery, and removed one disk, rebooted, then shut down and added another old disk. After startup I'm dismayed to discover that the old system on the second disk is running, not the new system on the first disk.

[root@test7 ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdb2[1]
      4088448 blocks [2/1] [_U]

md1 : active raid1 hdb1[1]
      104320 blocks [2/1] [_U]

unused devices: <none>
[root@test7 ~]#

Looking at dmesg tells me that the RAID autostart looked for RAID autostart partitions first on the second disk, and was not able to start the first disk partitions because the raid metadevice was already in use:

md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
...
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hdb2 ...
md:  adding hdb2 ...
md: hdb1 has different UUID to hdb2
md: hda2 has different UUID to hdb2
md: hda1 has different UUID to hdb2
md: created md2
md: bind<hdb2>
md: running: <hdb2>
raid1: raid set md2 active with 1 out of 2 mirrors
md: considering hdb1 ...
md:  adding hdb1 ...
md: hda2 has different UUID to hdb1
md: hda1 has different UUID to hdb1
md: created md1
md: bind<hdb1>
md: running: <hdb1>
raid1: raid set md1 active with 1 out of 2 mirrors
md: considering hda2 ...
md:  adding hda2 ...
md: hda1 has different UUID to hda2
md: md2 already running, cannot run hda2
md: export_rdev(hda2)
md: considering hda1 ...
md:  adding hda1 ...
md: md1 already running, cannot run hda1
md: export_rdev(hda1)
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hda1 ...
md:  adding hda1 ...
md: hda2 has different UUID to hda1
md: md1 already running, cannot run hda1
md: export_rdev(hda1)
md: considering hda2 ...
md:  adding hda2 ...
md: md2 already running, cannot run hda2
md: export_rdev(hda2)
md: ... autorun DONE.
cdrom: open failed.
kjournald starting.  Commit interval 5 seconds
...

I've found a patch from 2.4 timeframe which seems to address this issue:

http://www.ussg.iu.edu/hypermail/linux/kernel/0111.3/1644.html

And a more recent post from the linux-lvm list.

https://www.redhat.com/archives/linux-lvm/2005-April/msg00111.html

The md documentation in the kernel source suggested that I could override autodetection via command line arguments, so I tried that:

[root@test7 ~]# cat /proc/cmdline
ro root=/dev/vg_primary/lv_root raid=noautodetect md=1,/dev/hda1 md=2,/dev/hda2
[root@test7 ~]#  

I saw no difference, other than acknowledgement that the md code had seen the command arguments:

[root@test7 ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdb2[1]
      4088448 blocks [2/1] [_U]

md1 : active raid1 hdb1[1]
      104320 blocks [2/1] [_U]

unused devices: <none>
[root@test7 ~]#

Kernel command line: ro root=/dev/vg_primary/lv_root raid=noautodetect md=1,/dev/hda1 md=2,/dev/hda2
md: Will configure md1 (super-block) from /dev/hda1, below.
md: Will configure md2 (super-block) from /dev/hda2, below.
...
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
...
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hdb2 ...
md:  adding hdb2 ...
md: hdb1 has different UUID to hdb2
md: hda2 has different UUID to hdb2
md: hda1 has different UUID to hdb2
md: created md2
md: bind<hdb2>
md: running: <hdb2>
raid1: raid set md2 active with 1 out of 2 mirrors
md: considering hdb1 ...
md:  adding hdb1 ...
md: hda2 has different UUID to hdb1
md: hda1 has different UUID to hdb1
md: created md1
md: bind<hdb1>
md: running: <hdb1>
raid1: raid set md1 active with 1 out of 2 mirrors
md: considering hda2 ...
md:  adding hda2 ...
md: hda1 has different UUID to hda2
md: md2 already running, cannot run hda2
md: export_rdev(hda2)
md: considering hda1 ...
md:  adding hda1 ...
md: md1 already running, cannot run hda1
md: export_rdev(hda1)
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hda1 ...
md:  adding hda1 ...
md: hda2 has different UUID to hda1
md: md1 already running, cannot run hda1
md: export_rdev(hda1)
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: considering hda1 ...
md:  adding hda1 ...
md: hda2 has different UUID to hda1
md: md1 already running, cannot run hda1
md: export_rdev(hda1)
md: considering hda2 ...
md:  adding hda2 ...
md: md2 already running, cannot run hda2
md: export_rdev(hda2)
md: ... autorun DONE.

I see two problems here.

First huge problem is that an old system ends up running from the second drive. This would be very surprising to most admins, I would think, and makes the current data on first disk inaccessible, and at risk of being wiped and "rejoined" to the mirror.

The second problem is that the md/raid command arguments don't work as advertised.



Version-Release number of selected component (if applicable):
kernel-2.6.9-11.EL, kernel-smp-2.6.9-11.EL

How reproducible:
Always

Steps to Reproduce:
1. Break a two disk mirror set by removing second disk
2. Add second disk which was previously part of a mirror pair
3. Reboot
  

Additional info:

Comment 1 Charlie Brady 2006-08-27 10:22:52 UTC
I've discovered a technique to ensure that the md devices with the correct uids
are started.

The mdadm package source code includes a file mdassemble.c. If you run "make
mdassemble", you obtain a statically linked program of about 150kb.

If you include this program and a suitable /etc/mdadm.conf in an initrd file,
and replace these lines:

raidautorun /dev/md1
raidautorun /dev/md2

in /init with:

mknod /dev/md1 b 9 1
mknot /dev/md2 b 9 2
mdassemble

then the raid arrays are constructed correctly assembled.

A suitable /etc/mdadm.conf can be created capturing the output of "mdadm
--examine --scan":

[root@test7 ~]# mdadm --examine --scan
ARRAY /dev/md2 level=raid1 num-devices=2  \
   UUID=a347e7f8:61d99d7b:bbbfa329:a5175902
   devices=/dev/hda2,/dev/hdb2
ARRAY /dev/md1 level=raid1 num-devices=2 \
   UUID=294182c7:af337fcf:2513036f:3c6231a9
   devices=/dev/hda1,/dev/hdb1
[root@test7 ~]#

[Note that although I've reported this against RHEL4, I haven't seen anything to
suggest the same problem doesn't also affect recent FC releases. I consider this
a serious problem (people commonly reuse disks, and don't expect the last disk
added to be the one which actually boots). I'm surprised this issue hasn't
received any attention from RedHat.]

Comment 2 Doug Ledford 2006-08-27 23:55:33 UTC
The linux kernel's autodetect feature is working as best it can.  In this
situation, you have presented it with two different raid devices that both claim
to be md0 and both have valid, but different, uuids and both think they are up
to date.  The kernel has *no* way of knowing which one is right, so you only
have a 50/50 chance of getting the right device started.  In order to avoid a
situation like this, there are multiple options:

1) Don't use autodetect and instead use manual startup (which is what your
second post describes, although you really want to leave the device lines out of
each array definition and add the line 'DEVICE partitions' to your mdadm.conf).

2) Before taking a disk out of service where it might be reused elsewhere, wipe
the partition table so that it will be seen as clean when you put it into
another machine.

3) If you don't want to wipe the partition table, you can at least set the
partition type to Linux instead of Raid Autodetect which will keep the drives
from interferring with the normal startup of raid arrays in whatever machine you
put them into (I should also note that if you are using manual startup by uuid
like in your second post, you can switch all of your partitions to linux instead
of raid autodetect and mdassemble will still assemble them just fine).

4) If you do put the drive into a machine and have the problem you posted about,
all you need to do is run fdisk on the new drive, switch partition types to
linux, reboot, you'll now be on your original raid arrays, then you can hot add
the replacement disks partitions into your running arrays, which will over write
the superblocks on the new disks to make it match your running arrays, then you
can re-enable the autodetect partition type on the replacement disk partitions,
then reboot and the system will come up on the correct drive and with the array
fully assembled.

So, the long and short of it is that we don't support both A) using autodetect
and B) putting disks with valid superblocks and raid autodetect partitions into
a system that already has existing md devices of the same md device name and
already using autodetect.  It is the system administrator's responsibility to
control which devices are labelled with RAID superblocks and tagged in the
partition table for autodetect startup, especially when shifting drives between
machines.

Comment 3 Charlie Brady 2006-08-28 05:14:13 UTC
> The linux kernel's autodetect feature is working as best it can. 

That's debatable. I think it would be less surprising, and more likely to be
correct, if it searched devices first to last, rather than last to first.

> The kernel has *no* way of knowing which one is right, so you only
> have a 50/50 chance of getting the right device started. 

Correct. This is why I believe RedHat should not leave it to the kernel, but
should provide some assistance via initrd.

> 1) Don't use autodetect and instead use manual startup (which is what your
> second post describes, although you really want to leave the device lines
> out of each array definition and add the line 'DEVICE partitions' to your
> mdadm.conf).

Isn't this what RedHat/FC should do, so that the mounted root partition is the
correct one - matching the booting kernel and the grub entry?

> So, the long and short of it is that we don't support both [A and B]

I think your product would be more reliable if you did(*), and it would be a
simple modification to mkinitrd and mdadm to make it so.

However, if you choose not to make it so, then I think you should document this
gotcha. If as you say, the sysadmin has a responsibility to avoid this problem,
then he/she should be made aware of that responsibility. It's certainly a
surprising one.

(*) it seems that debian's initrd makes efforts to ensure that the correct uuid
is mounted as root. See, e.g.:

http://www.mail-archive.com/debian-bugs-closed@lists.debian.org/msg84008.html
http://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg227364.html
http://lists.debian.org/debian-kernel/2005/03/msg00180.html


Comment 4 Charlie Brady 2006-08-31 04:35:39 UTC
This change to mdadm spec file adds mdassemble to the system:

@@ -30,6 +30,7 @@
 %build
 make CXFLAGS="$RPM_OPT_FLAGS" SYSCONFDIR="%{_sysconfdir}" mdadm
 make CXFLAGS="$RPM_OPT_FLAGS" SYSCONFDIR="%{_sysconfdir}" -C mdmpd mdmpd
+make CXFLAGS="$RPM_OPT_FLAGS" SYSCONFDIR="%{_sysconfdir}" mdassemble

 %install
 make DESTDIR=$RPM_BUILD_ROOT MANDIR=%{_mandir} BINDIR=/sbin install
@@ -40,6 +41,8 @@
 mkdir -p -m 700 $RPM_BUILD_ROOT/var/run/mdmpd
 mkdir -p -m 700 $RPM_BUILD_ROOT/var/run/mdadm

+install -D -m750 mdassemble $RPM_BUILD_ROOT/sbin/mdassemble
+
 %clean
 [ $RPM_BUILD_ROOT != / ] && rm -rf $RPM_BUILD_ROOT

@@ -75,6 +78,9 @@
 %attr(0700,root,root) %dir /var/run/mdadm

 %changelog
+* Mon Aug 28 2006 Charlie Brady <charlieb> 1.6.0-3sme01
+- Add mdassemble
+

And this patch to mkinitrd will use mdassemble instead of raidautorun in an initrd:

@@ -702,8 +702,12 @@
 if [ -n "$startraid" ]; then
     for dev in $raiddevices; do
        cp -a /dev/${dev} $MNTIMAGE/dev
-       echo "raidautorun /dev/${dev}" >> $RCFILE
     done
+    cp -a /sbin/mdassemble $MNTIMAGE/sbin
+    mkdir -p $MNTIMAGE/etc
+    echo DEVICE partitions > $MNTIMAGE/etc/mdadm.conf
+    mdadm --examine --scan | sed '/devices=/d' >> $MNTIMAGE/etc/mdadm.conf
+    echo "/sbin/mdassemble" >> $RCFILE
 fi

 if [ -z "$USE_UDEV" ]; then


Comment 5 Charlie Brady 2006-08-31 06:02:30 UTC
This mkinitrd patch actually works. The '/devices=/d' sed line in previous patch
stripped rather too much of the config. I don't understand why we need mknod
rather than static copies of the device node files, but it seems we do.

@@ -705,9 +705,16 @@

 if [ -n "$startraid" ]; then
     for dev in $raiddevices; do
-       cp -a /dev/${dev} $MNTIMAGE/dev
-       echo "raidautorun /dev/${dev}" >> $RCFILE
+       echo mknod /dev/${dev} b 9 $(echo $dev | sed s/md//) >> $RCFILE
     done
+    cp -a /sbin/mdassemble $MNTIMAGE/sbin
+    mkdir -p $MNTIMAGE/etc
+    echo DEVICE partitions > $MNTIMAGE/etc/mdadm.conf
+    mdadm --examine --scan | \
+     sed -r \
+       -e '/^ +devices=/d' \
+       -e 's/ num-devices=[0-9]+//' >> $MNTIMAGE/etc/mdadm.conf
+    echo "/sbin/mdassemble" >> $RCFILE
 fi

 if [ -z "$USE_UDEV" ]; then