808774 – make mdadm --add not do stupid things

Bug 808774 - make mdadm --add not do stupid things

Summary: make mdadm --add not do stupid things

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Jes Sorensen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-31 17:55 UTC by Doug Ledford
Modified:	2012-07-29 16:58 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:	807743
Clones:	808776 (view as bug list)
Environment:
Last Closed:	2012-06-07 02:49:23 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Doug Ledford 2012-03-31 17:55:35 UTC

+++ This bug was initially created as a clone of Bug #807743 +++


--- Additional comment from franta on 2012-03-31 05:38:08 EDT ---

(In reply to comment #2)
> Package mdadm-3.2.3-7.fc17:
> * should fix your issue,
> * was pushed to the Fedora 17 testing repository,
> * should be available at your local mirror within two days.
> Update it with:
> # su -c 'yum update --enablerepo=updates-testing mdadm-3.2.3-7.fc17'
> as soon as you are able to.
> Please go to the following url:
> https://admin.fedoraproject.org/updates/FEDORA-2012-4846/mdadm-3.2.3-7.fc17
> then log in and leave karma (feedback).

I was just testing mdadm-3.2.3-7.fc16.i686, still not work for me:

# mdadm --incremental --run /dev/sda7
mdadm: failed to add /dev/sda7 to /dev/md4: Invalid argument.

# mdadm /dev/md4 --re-add /dev/sda7
mdadm: --re-add for /dev/sda7 to /dev/md4 is not possible

# mdadm /dev/md4 --add /dev/sda7
mdadm: /dev/sda7 reports being an active member for /dev/md4, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sda7 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda7" first.

# mdadm --examine /dev/sdb7
/dev/sdb7:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : aba28f0e:c4cf2667:c6e4ae59:86859548
           Name : localhost.localdomain:3
  Creation Time : Tue Oct 26 18:12:49 2010
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 590895085 (281.76 GiB 302.54 GB)
     Array Size : 590894820 (281.76 GiB 302.54 GB)
  Used Dev Size : 590894820 (281.76 GiB 302.54 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 27f49748:2e609ce9:e9eb7f9e:b5c4bc44

    Update Time : Sat Mar 31 08:54:35 2012
       Checksum : cac8a2b7 - correct
         Events : 698


   Device Role : Active device 1
   Array State : .A ('A' == active, '.' == missing)


# mdadm --examine /dev/sda7
/dev/sda7:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : aba28f0e:c4cf2667:c6e4ae59:86859548
           Name : localhost.localdomain:3
  Creation Time : Tue Oct 26 18:12:49 2010
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 590895085 (281.76 GiB 302.54 GB)
     Array Size : 590894820 (281.76 GiB 302.54 GB)
  Used Dev Size : 590894820 (281.76 GiB 302.54 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 700ef5e0:032a67af:e629ec27:3588cc7b

    Update Time : Fri Mar 30 17:13:50 2012
       Checksum : e96aaa00 - correct
         Events : 228


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)

# cat /proc/mdstat 
Personalities : [raid1] 
md4 : active raid1 sdb7[1]
      295447410 blocks super 1.1 [2/1] [_U]
      
md2 : active raid1 sda5[0] sdb5[1]
      16553736 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sdb6[1]
      157285308 blocks super 1.1 [2/1] [_U]
      
md0 : active raid1 sdb2[1] sda2[0]
      409588 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sdb4[1] sda4[0]
      16587704 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk


In facts, I have several (6+) F16 i686 machines with md RAID1 devices (nearly always configured as md0=/boot, md1=/, md2=/home and several (1-3) others RAID1 md's mounted somewhere under /mnt). And at all of them occurs situations, when after reboot some of them are degraded. It seems as md0 and md1 (/boot and /) are always OK, and others degraded. And I think (not sure) command
"mdadm /dev/mdX --re-add /dev/sdYN" usually work at arrays with 1.2 metadata, and on others fails with message
"mdadm: --re-add for /dev/sda7 to /dev/md4 is not possible"
And "mdadm /dev/mdX --add /dev/sdYN" ends with message:
mdadm: /dev/sda7 reports being an active member for /dev/md4, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sda7 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda7" first.

Zeroing superbloch works, but it is rather awesome and I never had to use it at Fedora 14 and olders. Thus I have one problem in addition to this: why are RAID1 disks in degraded state after Fedora reboot? :(

--- Additional comment from Jes.Sorensen on 2012-03-31 06:30:20 EDT ---

Frantisek,

The raids you show don't have bitmaps, so the change in 3.2.3-7 will not
affect you at all, that is expected.

Second, the --re-add is a change that was introduced because the old
behaviour was dangerous. In a plain raid1 there is no way for mdadm to
know which of the two disks are authoritative, which means you could
end up with the drive with old data on being started first and the one
with fresh data then being over-written in a re-add. That is why mdadm
explicitly asks you to zero the superblock first.

I hope that clarifies the issue.

Jes

--- Additional comment from franta on 2012-03-31 07:38:53 EDT ---

Yeah, I read about this now, sorry for noise. And I was thinking that mdadm decide which data are newer on some timestamp basis, some as "Device Update Time" - IMO when system start with only one from two disks, then active disk must have newer data, I'm right? Then I must zero superblock on inactive one and add it to array. I think there is some impractical - when system isn't able decide which disk contain newer data, how can decide human operator?
Umhh, anyway, still I should uncover why at my F16 machines are after regular reboot some RAID1 disks degraded.

--- Additional comment from dledford on 2012-03-31 13:52:20 EDT ---

(In reply to comment #3)
> (In reply to comment #2)
> > Package mdadm-3.2.3-7.fc17:
> > * should fix your issue,
> > * was pushed to the Fedora 17 testing repository,
> > * should be available at your local mirror within two days.
> > Update it with:
> > # su -c 'yum update --enablerepo=updates-testing mdadm-3.2.3-7.fc17'
> > as soon as you are able to.
> > Please go to the following url:
> > https://admin.fedoraproject.org/updates/FEDORA-2012-4846/mdadm-3.2.3-7.fc17
> > then log in and leave karma (feedback).

I'm replying to Frank's comment, but this is as much for Jes' benefit as anything else.

> I was just testing mdadm-3.2.3-7.fc16.i686, still not work for me:
> 
> # mdadm --incremental --run /dev/sda7
> mdadm: failed to add /dev/sda7 to /dev/md4: Invalid argument.

This makes sense.  Incremental mode is normally an automated mode, /dev/sda7 can't be brought into the array because it's out of date, so kick it out.  However, please note that neither add nor re-add are automated modes, they are manual modes.

> # mdadm /dev/md4 --re-add /dev/sda7
> mdadm: --re-add for /dev/sda7 to /dev/md4 is not possible

This makes sense.  We can't re-add the device to the array because it doesn't have a bitmap and the whole deal behind a re-add is that is uses the bitmap to know which portions of the disk are out of date and only resync those portions over from the current disk to the out of date disk.

However, I would note that if we *did* have a bitmap, mdadm would happily re-add the device to the running array and we will be  *explicitly* wiping out part of /dev/sda7 with the contents of the running array.  This is *no* different from adding /dev/sda7 *except* that we don't wipe out all of /dev/sda7, we only wipe out the portion the bitmap says is out of date.

So, to address Jes' comments about not knowing which of the two disks are authoritative, having a bitmap does not solve that problem in any way.  We still don't know which disk is authoritative during a re-add, we will simply copy from the active array to the re-added disk (we have to, if we did anything else then the upper layer block device would suddenly see inconsistent data as the new data copied over the previously existing data).

The trick to both re-add and add then is to always add the stale device to a running array built from the current device.

> # mdadm /dev/md4 --add /dev/sda7
> mdadm: /dev/sda7 reports being an active member for /dev/md4, but a --re-add
> fails.
> mdadm: not performing --add as that would convert /dev/sda7 in to a spare.
> mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda7" first.

This, on the other hand, is the decision that I find to be wholly unjustified.

First, the user didn't ask for a re-add, they asked for an add.  So telling me that a re-add won't work is neither helpful nor what I want.  If I wanted a re-add, I would have *used* re-add.

Second, the device doesn't have a bitmap, so automatically upgrading the procedure from an add to a re-add didn't make sense, it *never* had a chance of working.  At a minimum, if the user asks for add you should only upgrade it to re-add if the device you are adding to has a bitmap.  Otherwise, it should remain as an add.  And even if you upgrade to re-add and fail, it's not clear that you shouldn't then fall back to add and try again since add is what the user asked for in the first place.

Third, this nanny state user protectionism about "we don't know what device is current" is all total whooey.  The add process *defines* which device is treated as current: the active array is current and the added device is *always* the one we overwrite.  This is no different than re-add.  In fact, if you were to re-add a device with a bitmap where the bitmap was all set to 1, then the net result is *exactly* the same as an add, but we don't prompt users to zero the superblock first in that case, we just simply go ahead and do the re-add and completely wipe out the re-added disk.

One of the first principles of any nanny-state protection scheme is that it at least needs to do what it is supposed to do.  In this case, it is intended to protect the user against accidentally wiping out the more current device with the less current device's contents (and that's a valid issue).  But this change doesn't do that.

In the case of re-add it doesn't actually do *anything* towards this goal, it leaves things just like they used to be with add.  In the case of add, it pushes the problem off on the user by requiring an extra step that might result in the user inspecting things and catching their error (if there is even one to catch).  But not all users know that the event counter is what will tell them which device is more current.  Nor do they know that the event counters of a raid1 device can diverge such that they are both incremented independently and it might be impossible to tell even by the event counters which raid1 device is most current.  And the error message doesn't help them figure this out, it just instructs them to zero their superblock.  This is no safer than just doing the add in the first place.

And it takes a previously working command (add) and renders it almost useless because there is no option for "don't automatically upgrade my add to re-add, do the add that I requested damn you" (something for which I have already received a customer complaint about, and there is a legitimate complaint here in that the customer manages remote systems and they don't want their local users to have to zero *any* superblocks, they consider this a far riskier situation than just adding a drive to an array).

Now, what we should be doing, which is sufficient for 90% of all cases, it checking the event counter of the array versus the device being added.  If the device being added is behind the current count, then the add or re-add should succeed.  If the event counter is ahead, then we should refuse to do the add citing that the other device is "fresher" than our current array, and noting that because our current array is up and running, then the other device and our current array are diverging at this very moment, we suggest a shut down, followed by booting into rescue mode, then recover whatever you can from each version of the array (possibly by backing each member of the raid1 array up separately), then merging the two arrays back together, then restarting the machine and using the backups to merge the divergent data manually.

That is about the best we can do given the current superblock.  It does not, however, address the case where the current array is ahead of the device being added, but the device being added has in fact been mounted since it was kicked from the array and has diverged.  In that case, we would silently loose that data divergence.  If Neil *really* wants to solve this problem properly, there is only one way to do so (and fortunately only raid1 and maybe certain specific layouts of raid10 need this, all the stripe based raid arrays need not worry about this).

Make the event counter exist both in the superblock and in a newly created array of devices that we would track in each superblock, and that array would specify the device uuid of every device in the array, the device's state, and the device's event counter at the last sighting.  Then, when re-adding or adding, either one, if the superblock of the device being added has a recognized device uuid, and the event counter in that device's superblock matches the last sighting event counter in the other device's array, then we know that the device being added/re-added has not been mounted and used since it was kicked from the array and no data changes have been made since it was last part of this array and it may be safely added back into the array (in either re-add or add mode) without loss of data, it will essentially be fast-forwarded to our current state.  If the device has a differing event counter, then we would know that the two arrays have been mounted separately and that their contents have diverged (really only a problem for raid1 arrays, nothing with stripes or parity have this problem, and so this limits the number of array entries we need and means we might be able to squeeze this into the main array superblock), we can print this information out to the user and refuse to do the add until the user has determined what data might be lost and then zeroed the superblock on the device that they have already recovered their data from.

I'll be bringing this up with Neil (and you can in person Jes if you read this before you see him).  But I think this change needs to be backed out, and a simple test of the event counter is all we need to do unless and until Neil implements the superblock change I outlined above.  I'll look into making this happen in our mdadm package.  Will clone this to another bug to track the issue there.

> # mdadm --examine /dev/sdb7
> /dev/sdb7:
>     Update Time : Sat Mar 31 08:54:35 2012
>        Checksum : cac8a2b7 - correct
>          Events : 698
> 
> 
>    Device Role : Active device 1
>    Array State : .A ('A' == active, '.' == missing)
 
> # mdadm --examine /dev/sda7
> /dev/sda7:
>     Update Time : Fri Mar 30 17:13:50 2012
>        Checksum : e96aaa00 - correct
>          Events : 228
> 
> 
>    Device Role : Active device 0
>    Array State : AA ('A' == active, '.' == missing)

Frank: the Events counter above is how mdadm knows if two array devices are in sync.  The date is not so important as it's possible that you could start a raid1 device with only one member, check it in read only mode, then stop the device.  This would update the time on the superblock, but only if you wrote something to the device that would cause the device to have new data would it update the event counter.

Now, as for why your devices are coming up degraded, that I can't answer but it is certainly something that needs to be addressed (although not in this bug).

Comment 1 Frantisek Hanzlik 2012-03-31 19:03:43 UTC

Doug, many thanks for Your explanation. And now I'm thinking what I wrote before (.. command "mdadm /dev/mdX --re-add /dev/sdYN" usually work at arrays with 1.2 metadata, and on others fails..) is a bit alternatively - command was rather working on md's with bitmap and fails on others (perhaps anaconda or newer mdadm defaults makes arrays with bitmap, and maybe other older which I create manually were without bitmap. In "/proc/mdstat" i see line about bitmap mainly at newly created arrays, older are without.

I shall be happy with mdadm behavior as You propose here. Thanks in advance!

Comment 2 Doug Ledford 2012-03-31 19:51:40 UTC

Frank, you're correct in that it is "works on arrays with bitmaps and not no others".  The later Anaconda installers (f15 or maybe f16 and later) will use a bitmap by default on larger arrays (something like 10GB or larger), but not on small arrays (arrays used for /boot and maybe swap depending on how you set things up).  The bitmap causes a small, but perceptible, performance degradation under write operations.  However, the bitmap is mainly used so that when an array needs resynced, you only need to resync the parts of the array where the bit marks the array as dirty.  On multi-terrabyte arrays, this can mean the difference between a resync taking hours (or even longer than a day), to just minutes instead.  However, on very small arrays where it only takes a minute or two to resync the entire array, there really is no benefit to the bitmap.  So, manually created arrays where you didn't specify to create a bitmap, or small arrays.  That's where you wouldn't expect to find a bitmap.  You can, however, add a bitmap after creation with this command:

mdadm --grow /dev/md<whatever> --bitmap=internal

Comment 3 Frantisek Hanzlik 2012-04-08 13:17:18 UTC

According to 16.2.2012 "F16 occasionally breaks RAID1 (md) on boot" thread from Fedora mailing list, Sam Varshavchik state that "mdraid_start" routine in initramfs is not reliable. And as workaround is put "rd.md.uuid=RAID_UUID" for all RAID1 md device and let initialize them by kernel. It seems woked fine for me too.

Comment 4 Fedora Update System 2012-05-02 10:09:54 UTC

mdadm-3.2.3-9.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.3-9.fc17

Comment 5 Fedora Update System 2012-05-02 10:15:59 UTC

mdadm-3.2.3-9.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/mdadm-3.2.3-9.fc16

Comment 6 Fedora Update System 2012-05-02 10:19:46 UTC

mdadm-3.2.3-9.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.3-9.fc15

Comment 7 Fedora Update System 2012-05-02 20:34:51 UTC

Package mdadm-3.2.3-9.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing mdadm-3.2.3-9.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-7105/mdadm-3.2.3-9.fc17
then log in and leave karma (feedback).

Comment 8 Fedora Update System 2012-05-10 14:49:23 UTC

mdadm-3.2.4-2.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-2.fc17

Comment 9 Fedora Update System 2012-05-10 14:51:59 UTC

mdadm-3.2.4-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-2.fc16

Comment 10 Fedora Update System 2012-05-10 14:54:38 UTC

mdadm-3.2.4-2.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-2.fc15

Comment 11 Fedora Update System 2012-05-15 16:36:44 UTC

mdadm-3.2.4-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-3.fc17

Comment 12 Fedora Update System 2012-05-15 16:43:07 UTC

mdadm-3.2.4-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-3.fc16

Comment 13 Fedora Update System 2012-05-15 16:47:53 UTC

mdadm-3.2.4-3.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-3.fc15

Comment 14 Fedora Update System 2012-05-21 15:08:53 UTC

mdadm-3.2.5-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.5-1.fc17

Comment 15 Fedora Update System 2012-05-21 15:15:09 UTC

mdadm-3.2.5-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/mdadm-3.2.5-1.fc16

Comment 16 Fedora Update System 2012-05-21 15:20:54 UTC

mdadm-3.2.5-1.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/mdadm-3.2.5-1.fc15

Comment 17 Fedora Update System 2012-06-07 02:49:23 UTC

mdadm-3.2.5-1.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 18 Fedora Update System 2012-06-07 02:52:36 UTC

mdadm-3.2.5-1.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 19 Fedora Update System 2012-06-26 07:17:03 UTC

mdadm-3.2.5-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.5-3.fc17

Comment 20 Fedora Update System 2012-07-11 23:54:39 UTC

mdadm-3.2.5-3.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.