807743 – mdadm write intent map not functioning as before

Bug 807743 - mdadm write intent map not functioning as before

Summary: mdadm write intent map not functioning as before

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Jes Sorensen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:	769323
Blocks:	791159
TreeView+	depends on / blocked

Reported:	2012-03-28 15:20 UTC by Jes Sorensen
Modified:	2012-07-11 23:55 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:	769323
Clones:	808774 (view as bug list)
Environment:
Last Closed:	2012-07-11 23:55:05 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jes Sorensen 2012-03-28 15:20:52 UTC

+++ This bug was initially created as a clone of Bug #769323 +++

Description of problem:

When a temporarily missing RAID 1 component becomes available, an incremental assembly should consult the write intent map and resynchronize the device with those blocks recorded as changed.  Instead, "mdadm -I --run /dev/sdc1" declares the temporarily missing component "invalid".

Version-Release number of selected component (if applicable):

mdadm-3.2.2-125.fc16.x86_64
kernel-3.1.5-6.fc16.x86_64

How reproducible:

Always.

Steps to Reproduce:
1.  Create a RAID 1 device as so:

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb1 --write-mostly /dev/sdc1 --bitmap=/var/preserve/vg_extern0 --assume-clean

3. Record the result of "mdadm -D --brief /dev/md0" in /etc/mdadm.conf:

ARRAY /dev/md0 metadata=1.2 bitmap=/var/preserve/extern0 name=myhost:md0 UUID=c84d260e:f10d4241:af5e0c24:1f18d143

2. Use the RAID device (i.e., create a logical volume group containing it, put a file system on one of the logical volumes).  This works as expected.

3. Shut down the system.

4. Physically disconnect the /dev/sdc device.  Leave /dev/sdb connected.

5. Boot the system.  /dev/md0 will be assembled and run with the one available component.

6. Use the RAID 1 device.  Activate the logical volume, mount the filesystem from it, and write a file in that file system.

7. Shut down the system.

8. Physically reattach the /dev/sdc device.

9. Boot the system.

10. /dev/md0 will be created with only the /dev/sdb component participating.

11.  sudo mdadm -I --run /dev/sdc1 
  
Actual results:

mdadm: failed to add /dev/sdc1 to /dev/md0: Invalid argument.

Expected results:

/dev/sdc1 to participate in the RAID 1 array, and shown to by synchronizing in results of "cat /proc/mdstat".  If the writes made to the one RAID component were small, the synchronization will be finished before you can observe the synchronization state, and the RAID device will contain both components up to date "[UU]".

Additional info:

This worked as expected in Fedora 14.

--- Additional comment from Jes.Sorensen on 2012-01-03 05:33:13 EST ---

Stephen,

What is the output from mdadm when you try to add the second device?
It has to be more than just 'Invalid argument'

Second, you shouldn't need to add --run when specifying -I, I don't know
if that makes any difference.

Jes

--- Additional comment from Jes.Sorensen on 2012-01-03 08:15:21 EST ---

Stephen,

Ok, I tried reproducing the bug here, and it looks like I am seeing the
exact same as you.

I will try and investigate further.

Jes

--- Additional comment from Jes.Sorensen on 2012-01-03 10:33:45 EST ---

Ok, this is strange.

If I try to add the drive this way:

mdadm -I --run /dev/sdf3

I get the same error as you. On the other hand if I do the following, it
works just fine:

mdadm -a /dev/md42 /dev/sdf3

Looks like something isn't detected correctly by either the kernel or mdadm
when doing an Incremental add. This happens with the latest mdadm from
Neil's git tree and Linus' top of tree kernel as well btw.

Jes

--- Additional comment from sschaefer on 2012-01-04 16:35:50 EST ---

Thanks for pursuing this and sorry for not responding; I'd been under the weather.  I'll see if the "-a" version does what I need - if the write intent map gets ignored, I end up with a 16 hour sync that needs to finish in a 10 hour window.

--- Additional comment from Jes.Sorensen on 2012-01-05 03:43:09 EST ---

Stephen,

No worries, I've been battling being under the weather myself. I think
you have a genuine bug here, but -a seems to work as a workaround for me.
I think it respects the bitmap, but I would certainly appreciate it if you
can confirm that it does.

I will continue investigating why -I doesn't do the right thing.

Cheers,
Jes

--- Additional comment from sschaefer on 2012-01-05 11:49:28 EST ---

I'm pleased to report that, with the -a workaround, the write intent map works as hoped.  Thanks,

    - Stephen

--- Additional comment from Jes.Sorensen on 2012-01-06 01:32:24 EST ---

Stephen,

Thanks for confirming this. This means at least the core infrastructure
seems to be working as expected. I'll look into why it doesn't do the
right thing with -I

Cheers,
Jes

Comment 1 Fedora Update System 2012-03-28 15:54:24 UTC

mdadm-3.2.3-7.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.3-7.fc17

Comment 2 Fedora Update System 2012-03-28 19:37:15 UTC

Package mdadm-3.2.3-7.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing mdadm-3.2.3-7.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-4846/mdadm-3.2.3-7.fc17
then log in and leave karma (feedback).

Comment 3 Frantisek Hanzlik 2012-03-31 09:38:08 UTC

(In reply to comment #2)
> Package mdadm-3.2.3-7.fc17:
> * should fix your issue,
> * was pushed to the Fedora 17 testing repository,
> * should be available at your local mirror within two days.
> Update it with:
> # su -c 'yum update --enablerepo=updates-testing mdadm-3.2.3-7.fc17'
> as soon as you are able to.
> Please go to the following url:
> https://admin.fedoraproject.org/updates/FEDORA-2012-4846/mdadm-3.2.3-7.fc17
> then log in and leave karma (feedback).

I was just testing mdadm-3.2.3-7.fc16.i686, still not work for me:

# mdadm --incremental --run /dev/sda7
mdadm: failed to add /dev/sda7 to /dev/md4: Invalid argument.

# mdadm /dev/md4 --re-add /dev/sda7
mdadm: --re-add for /dev/sda7 to /dev/md4 is not possible

# mdadm /dev/md4 --add /dev/sda7
mdadm: /dev/sda7 reports being an active member for /dev/md4, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sda7 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda7" first.

# mdadm --examine /dev/sdb7
/dev/sdb7:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : aba28f0e:c4cf2667:c6e4ae59:86859548
           Name : localhost.localdomain:3
  Creation Time : Tue Oct 26 18:12:49 2010
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 590895085 (281.76 GiB 302.54 GB)
     Array Size : 590894820 (281.76 GiB 302.54 GB)
  Used Dev Size : 590894820 (281.76 GiB 302.54 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 27f49748:2e609ce9:e9eb7f9e:b5c4bc44

    Update Time : Sat Mar 31 08:54:35 2012
       Checksum : cac8a2b7 - correct
         Events : 698


   Device Role : Active device 1
   Array State : .A ('A' == active, '.' == missing)


# mdadm --examine /dev/sda7
/dev/sda7:
          Magic : a92b4efc
        Version : 1.1
    Feature Map : 0x0
     Array UUID : aba28f0e:c4cf2667:c6e4ae59:86859548
           Name : localhost.localdomain:3
  Creation Time : Tue Oct 26 18:12:49 2010
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 590895085 (281.76 GiB 302.54 GB)
     Array Size : 590894820 (281.76 GiB 302.54 GB)
  Used Dev Size : 590894820 (281.76 GiB 302.54 GB)
    Data Offset : 2048 sectors
   Super Offset : 0 sectors
          State : clean
    Device UUID : 700ef5e0:032a67af:e629ec27:3588cc7b

    Update Time : Fri Mar 30 17:13:50 2012
       Checksum : e96aaa00 - correct
         Events : 228


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)

# cat /proc/mdstat 
Personalities : [raid1] 
md4 : active raid1 sdb7[1]
      295447410 blocks super 1.1 [2/1] [_U]
      
md2 : active raid1 sda5[0] sdb5[1]
      16553736 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sdb6[1]
      157285308 blocks super 1.1 [2/1] [_U]
      
md0 : active raid1 sdb2[1] sda2[0]
      409588 blocks super 1.0 [2/2] [UU]
      
md1 : active raid1 sdb4[1] sda4[0]
      16587704 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk


In facts, I have several (6+) F16 i686 machines with md RAID1 devices (nearly always configured as md0=/boot, md1=/, md2=/home and several (1-3) others RAID1 md's mounted somewhere under /mnt). And at all of them occurs situations, when after reboot some of them are degraded. It seems as md0 and md1 (/boot and /) are always OK, and others degraded. And I think (not sure) command
"mdadm /dev/mdX --re-add /dev/sdYN" usually work at arrays with 1.2 metadata, and on others fails with message
"mdadm: --re-add for /dev/sda7 to /dev/md4 is not possible"
And "mdadm /dev/mdX --add /dev/sdYN" ends with message:
mdadm: /dev/sda7 reports being an active member for /dev/md4, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sda7 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda7" first.

Zeroing superbloch works, but it is rather awesome and I never had to use it at Fedora 14 and olders. Thus I have one problem in addition to this: why are RAID1 disks in degraded state after Fedora reboot? :(

Comment 4 Jes Sorensen 2012-03-31 10:30:20 UTC

Frantisek,

The raids you show don't have bitmaps, so the change in 3.2.3-7 will not
affect you at all, that is expected.

Second, the --re-add is a change that was introduced because the old
behaviour was dangerous. In a plain raid1 there is no way for mdadm to
know which of the two disks are authoritative, which means you could
end up with the drive with old data on being started first and the one
with fresh data then being over-written in a re-add. That is why mdadm
explicitly asks you to zero the superblock first.

I hope that clarifies the issue.

Jes

Comment 5 Frantisek Hanzlik 2012-03-31 11:38:53 UTC

Yeah, I read about this now, sorry for noise. And I was thinking that mdadm decide which data are newer on some timestamp basis, some as "Device Update Time" - IMO when system start with only one from two disks, then active disk must have newer data, I'm right? Then I must zero superblock on inactive one and add it to array. I think there is some impractical - when system isn't able decide which disk contain newer data, how can decide human operator?
Umhh, anyway, still I should uncover why at my F16 machines are after regular reboot some RAID1 disks degraded.

Comment 6 Doug Ledford 2012-03-31 17:52:20 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > Package mdadm-3.2.3-7.fc17:
> > * should fix your issue,
> > * was pushed to the Fedora 17 testing repository,
> > * should be available at your local mirror within two days.
> > Update it with:
> > # su -c 'yum update --enablerepo=updates-testing mdadm-3.2.3-7.fc17'
> > as soon as you are able to.
> > Please go to the following url:
> > https://admin.fedoraproject.org/updates/FEDORA-2012-4846/mdadm-3.2.3-7.fc17
> > then log in and leave karma (feedback).

I'm replying to Frank's comment, but this is as much for Jes' benefit as anything else.

> I was just testing mdadm-3.2.3-7.fc16.i686, still not work for me:
> 
> # mdadm --incremental --run /dev/sda7
> mdadm: failed to add /dev/sda7 to /dev/md4: Invalid argument.

This makes sense.  Incremental mode is normally an automated mode, /dev/sda7 can't be brought into the array because it's out of date, so kick it out.  However, please note that neither add nor re-add are automated modes, they are manual modes.

> # mdadm /dev/md4 --re-add /dev/sda7
> mdadm: --re-add for /dev/sda7 to /dev/md4 is not possible

This makes sense.  We can't re-add the device to the array because it doesn't have a bitmap and the whole deal behind a re-add is that is uses the bitmap to know which portions of the disk are out of date and only resync those portions over from the current disk to the out of date disk.

However, I would note that if we *did* have a bitmap, mdadm would happily re-add the device to the running array and we will be  *explicitly* wiping out part of /dev/sda7 with the contents of the running array.  This is *no* different from adding /dev/sda7 *except* that we don't wipe out all of /dev/sda7, we only wipe out the portion the bitmap says is out of date.

So, to address Jes' comments about not knowing which of the two disks are authoritative, having a bitmap does not solve that problem in any way.  We still don't know which disk is authoritative during a re-add, we will simply copy from the active array to the re-added disk (we have to, if we did anything else then the upper layer block device would suddenly see inconsistent data as the new data copied over the previously existing data).

The trick to both re-add and add then is to always add the stale device to a running array built from the current device.

> # mdadm /dev/md4 --add /dev/sda7
> mdadm: /dev/sda7 reports being an active member for /dev/md4, but a --re-add
> fails.
> mdadm: not performing --add as that would convert /dev/sda7 in to a spare.
> mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda7" first.

This, on the other hand, is the decision that I find to be wholly unjustified.

First, the user didn't ask for a re-add, they asked for an add.  So telling me that a re-add won't work is neither helpful nor what I want.  If I wanted a re-add, I would have *used* re-add.

Second, the device doesn't have a bitmap, so automatically upgrading the procedure from an add to a re-add didn't make sense, it *never* had a chance of working.  At a minimum, if the user asks for add you should only upgrade it to re-add if the device you are adding to has a bitmap.  Otherwise, it should remain as an add.  And even if you upgrade to re-add and fail, it's not clear that you shouldn't then fall back to add and try again since add is what the user asked for in the first place.

Third, this nanny state user protectionism about "we don't know what device is current" is all total whooey.  The add process *defines* which device is treated as current: the active array is current and the added device is *always* the one we overwrite.  This is no different than re-add.  In fact, if you were to re-add a device with a bitmap where the bitmap was all set to 1, then the net result is *exactly* the same as an add, but we don't prompt users to zero the superblock first in that case, we just simply go ahead and do the re-add and completely wipe out the re-added disk.

One of the first principles of any nanny-state protection scheme is that it at least needs to do what it is supposed to do.  In this case, it is intended to protect the user against accidentally wiping out the more current device with the less current device's contents (and that's a valid issue).  But this change doesn't do that.

In the case of re-add it doesn't actually do *anything* towards this goal, it leaves things just like they used to be with add.  In the case of add, it pushes the problem off on the user by requiring an extra step that might result in the user inspecting things and catching their error (if there is even one to catch).  But not all users know that the event counter is what will tell them which device is more current.  Nor do they know that the event counters of a raid1 device can diverge such that they are both incremented independently and it might be impossible to tell even by the event counters which raid1 device is most current.  And the error message doesn't help them figure this out, it just instructs them to zero their superblock.  This is no safer than just doing the add in the first place.

And it takes a previously working command (add) and renders it almost useless because there is no option for "don't automatically upgrade my add to re-add, do the add that I requested damn you" (something for which I have already received a customer complaint about, and there is a legitimate complaint here in that the customer manages remote systems and they don't want their local users to have to zero *any* superblocks, they consider this a far riskier situation than just adding a drive to an array).

Now, what we should be doing, which is sufficient for 90% of all cases, it checking the event counter of the array versus the device being added.  If the device being added is behind the current count, then the add or re-add should succeed.  If the event counter is ahead, then we should refuse to do the add citing that the other device is "fresher" than our current array, and noting that because our current array is up and running, then the other device and our current array are diverging at this very moment, we suggest a shut down, followed by booting into rescue mode, then recover whatever you can from each version of the array (possibly by backing each member of the raid1 array up separately), then merging the two arrays back together, then restarting the machine and using the backups to merge the divergent data manually.

That is about the best we can do given the current superblock.  It does not, however, address the case where the current array is ahead of the device being added, but the device being added has in fact been mounted since it was kicked from the array and has diverged.  In that case, we would silently loose that data divergence.  If Neil *really* wants to solve this problem properly, there is only one way to do so (and fortunately only raid1 and maybe certain specific layouts of raid10 need this, all the stripe based raid arrays need not worry about this).

Make the event counter exist both in the superblock and in a newly created array of devices that we would track in each superblock, and that array would specify the device uuid of every device in the array, the device's state, and the device's event counter at the last sighting.  Then, when re-adding or adding, either one, if the superblock of the device being added has a recognized device uuid, and the event counter in that device's superblock matches the last sighting event counter in the other device's array, then we know that the device being added/re-added has not been mounted and used since it was kicked from the array and no data changes have been made since it was last part of this array and it may be safely added back into the array (in either re-add or add mode) without loss of data, it will essentially be fast-forwarded to our current state.  If the device has a differing event counter, then we would know that the two arrays have been mounted separately and that their contents have diverged (really only a problem for raid1 arrays, nothing with stripes or parity have this problem, and so this limits the number of array entries we need and means we might be able to squeeze this into the main array superblock), we can print this information out to the user and refuse to do the add until the user has determined what data might be lost and then zeroed the superblock on the device that they have already recovered their data from.

I'll be bringing this up with Neil (and you can in person Jes if you read this before you see him).  But I think this change needs to be backed out, and a simple test of the event counter is all we need to do unless and until Neil implements the superblock change I outlined above.  I'll look into making this happen in our mdadm package.  Will clone this to another bug to track the issue there.

> # mdadm --examine /dev/sdb7
> /dev/sdb7:
>     Update Time : Sat Mar 31 08:54:35 2012
>        Checksum : cac8a2b7 - correct
>          Events : 698
> 
> 
>    Device Role : Active device 1
>    Array State : .A ('A' == active, '.' == missing)
 
> # mdadm --examine /dev/sda7
> /dev/sda7:
>     Update Time : Fri Mar 30 17:13:50 2012
>        Checksum : e96aaa00 - correct
>          Events : 228
> 
> 
>    Device Role : Active device 0
>    Array State : AA ('A' == active, '.' == missing)

Frank: the Events counter above is how mdadm knows if two array devices are in sync.  The date is not so important as it's possible that you could start a raid1 device with only one member, check it in read only mode, then stop the device.  This would update the time on the superblock, but only if you wrote something to the device that would cause the device to have new data would it update the event counter.

Now, as for why your devices are coming up degraded, that I can't answer but it is certainly something that needs to be addressed (although not in this bug).

Comment 7 Fedora Update System 2012-05-02 10:10:22 UTC

mdadm-3.2.3-9.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.3-9.fc17

Comment 8 Fedora Update System 2012-05-10 14:49:49 UTC

mdadm-3.2.4-2.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-2.fc17

Comment 9 Fedora Update System 2012-05-15 16:37:08 UTC

mdadm-3.2.4-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.4-3.fc17

Comment 10 Fedora Update System 2012-05-21 15:09:34 UTC

mdadm-3.2.5-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.5-1.fc17

Comment 11 Fedora Update System 2012-06-26 07:17:49 UTC

mdadm-3.2.5-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/mdadm-3.2.5-3.fc17

Comment 12 Fedora Update System 2012-07-11 23:55:05 UTC

mdadm-3.2.5-3.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.