600900 – Hot-plugable RAID components sometimes not assembled correctly

Bug 600900 - Hot-plugable RAID components sometimes not assembled correctly

Summary: Hot-plugable RAID components sometimes not assembled correctly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-06 15:30 UTC by Piergiorgio Sartor
Modified:	2010-12-07 20:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:	mdadm-3.1.3-0.git20100804.2.fc13
Clone Of:
Clones:	616596 (view as bug list)
Environment:
Last Closed:	2010-12-07 20:12:56 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg and mdstat from six boots of F13 (46.34 KB, application/x-gzip) 2010-06-17 19:22 UTC, Ron Yorston	no flags	Details
Another five boot attempts. (46.22 KB, application/x-gzip) 2010-07-22 20:09 UTC, Ron Yorston	no flags	Details
View All

Description Piergiorgio Sartor 2010-06-06 15:30:38 UTC

Description of problem:
I've a bunch of RAID-6 volumes, composed by different partitions of 10 USB disks.

These disks are connect together using some USB HUBs, so there is a single USB cable which can be connect to the PC.

Once the cable is connected the 10 disks are initialized and incremental assembly of the different RAIDs starts.

There seem to be some sort of race condition, so that sometimes, partitions belonging to the same RAID volume, are assembled as different RAID volumes, leading to two incomplete volumes and related mess.

So, for example, the partitions /dev/sd[d-m]1 form one RAID-6 volume.

The incremental assemble, sometimes, creates two RAIDs, /dev/md127 and /dev/md126, composed, for example, one with /dev/sdd1 and the second with /dev/sd[e-m]1, which, of course, does not work.

Version-Release number of selected component (if applicable):
mdadm-3.1.2-10.fc13.x86_64

How reproducible:
Sometimes, usually after the first hot-plug, it seems to work better.

Steps to Reproduce:
1.
Hot-plug the HDDs.
2.
Wait the mess to finish
3.
Check /proc/mdstat

Actual results:
Sometimes more, not working, RAID volumes are assemble than expected

Expected results:
The incremental assembly should work properly

Additional info:
The different volumes are PVs of an LVM VG, which means also LVM runs on hot-plug.

Hope this helps,

bye

pg

Comment 1 Ron Yorston 2010-06-17 19:21:23 UTC

I'm seeing something similar, but with ATA devices not USB. I have ten partitions across three ATA drives that are combined into five RAID1 volumes. Here's what they look like in F12:

Personalities : [raid1] 
md123 : active raid1 sda5[0] sdb5[1]
      31463168 blocks [2/2] [UU]
      
md124 : active raid1 sda6[0] sdb6[1]
      30033408 blocks [2/2] [UU]
      
md125 : active raid1 sda7[0] sdb7[1]
      30796480 blocks [2/2] [UU]
      
md126 : active raid1 sda9[0] sdc10[1]
      41953600 blocks [2/2] [UU]
      
md127 : active raid1 sdc12[0] sda8[1]
      30788416 blocks [2/2] [UU]
      
unused devices: <none>

I installed F13 onto the partitions that used to hold F11 but, as is my custom, didn't tell anaconda what to do with the RAID volumes. Later I added them to fstab in F13. Initially there were problems that I took to be due to the line 'AUTO +imsm +1.x -all' that anaconda had put in mdadm.conf. All my RAID partitions have 0.9 metadata. I commented out the AUTO line and put in ARRAY lines specifying the UUIDs of the RAID devices.

Now I find that more often than not F13 fails to correctly assemble the arrays. F12 always succeeds. In six boots of F13 the arrays were only properly built once. The failures are all different.  Here's one example:

Personalities : [raid1] 
md127 : active (auto-read-only) raid1 sda8[1]
      30788416 blocks [2/1] [_U]
      
md125 : active raid1 sda7[0] sdb7[1]
      30796480 blocks [2/2] [UU]
      
md123 : inactive sda5[0](S)
      31463168 blocks
       
md126 : active (auto-read-only) raid1 sdc10[1]
      41953600 blocks [2/1] [_U]
      
md124 : active raid1 sdb6[1] sda6[0]
      30033408 blocks [2/2] [UU]
      
unused devices: <none>

I'll attach some more information in case anyone can see a pattern in this. I certainly can't.

Comment 2 Ron Yorston 2010-06-17 19:22:42 UTC

Created attachment 424916 [details]
dmesg and mdstat from six boots of F13

Comment 3 Ron Yorston 2010-06-20 16:22:23 UTC

I seem to have got this working more reliably by removing rd_NO_MD from the kernel line in grub.conf.  At least, I've been able to boot Fedora 13 five times now and the RAID arrays have been assembled correctly every time.

Without rd_NO_MD the arrays are assembled earlier in the boot process, though I don't know why that would make any difference.

Comment 4 Doug Ledford 2010-07-20 21:29:19 UTC

@Ron: The difference you are seeing is that earlier in the boot process udev is likely processing disk add events sequentially instead of in parallel. Evidently there is a race condition when devices are added in parallel.

OK, after some code inspection, I've found the race. Specifically, if two devices belonging to the same array are assembled in parallel, then if the array is not yet listed in the md-device-map file, each parallel tries to open a lock file, then attempts an exclusive lock on the lock file. One process gets it, the other waits. The process that got the lock then adds the array and calls map_update to write out the new map entry. Finally it calls unlock on the existing file, then it unlinks the lock file. The problem here is that if another instance was already waiting on the lock, it doesn't care that the file was unlinked and gets a new lock on an unlinked file, while a totally different instance of mdadm creates a new lock file and locks the new lock file, resulting in two instances having exclusive locks on two different lock files and being allowed to actually run in parallel, resulting in this problem.

My solution is to change the locking mechanism to pass the flags O_CREAT and O_EXCL to the open command, which will fail the open if it is not the process that created the file. For as long as we fail due to the file already existing, we keep trying to open the file. Once we manage to create the file, then we already have a lock and are free to run. The fix for this will be in the next mdadm update (mdadm-3.1.3-0.git07202010.1 or later).

Comment 5 Fedora Update System 2010-07-22 15:36:31 UTC

mdadm-3.1.3-0.git20100722.1.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc13

Comment 6 Ron Yorston 2010-07-22 20:08:43 UTC

I reinstated rd_NO_MD in grub.conf and installed mdadm-3.1.3-0.git20100722.2.fc13.i686.rpm.  It doesn't seem to help, though.  In five boots the arrays weren't properly built once.  And mdadm segfaults, which probably isn't good.

Comment 7 Ron Yorston 2010-07-22 20:09:43 UTC

Created attachment 433800 [details]
Another five boot attempts.

Comment 8 Piergiorgio Sartor 2010-07-22 20:45:30 UTC

Hi all,

re-considering this bug, I must say that maybe it does not belong to "mdadm", but rather to "udev".

Let's consider the following script:

for i in /dev/sd*
do
  mdadm -I $i &
done

I guess, if I complained it does not work, the answer would have been something like "remove the '&'"...

Nevertheless, "udev" does the same and nobody complains (I understood).

Is there any way to tell "udev" to serialize the operations under certain conditions? Or for certain rules?

IMHO that would be better as solution, or not?

Thanks,

bye,

pg

Comment 9 Doug Ledford 2010-07-22 21:51:59 UTC

@Ron:  In my testing, I was able to reliably make it fail every time using the below reproducer, and after the package I listed, it never failed once:

for i in 0 2 4 6 8 10 12 14; do
  dd if=/dev/zero bs=1024k count=100 of=/tmp/block$i
  dd if=/dev/zero bs=1024k count=100 of=/tmp/block$[ $i + 1 ]
  losetup /dev/loop$i /tmp/block$i
  losetup /dev/loop$[ $i + 1] /tmp/block$p $i + 1 ]
  mdadm -C /dev/md/test$i -l1 -n2 --name=test$i /tmp/loop$i /tmp/loop$[ $i + 1 ]
done

mdadm -S /dev/md/test*

for i in /dev/loop{0..15}; do
  mdadm -I $i &
done

As far as your current issue, it is *definitely* something different (but that needs figured out nonetheless).  You will note that in *none* of your original dmesg outputs did mdadm segfault, it simply didn't add all the disks (which is the problem I fixed).  In the last 5 dmesg outputs, all of the failures were the result of segfaults in mdadm.  The race condition I fixed is gone, but you are being effected by something else now.  I'll actually need a new bug for tracking the new problem if you could open one please.

@Piergiorgio: no, the answer would be that it should work, and the latest mdadm fixes the problem you depict.

Comment 10 Fedora Update System 2010-07-23 02:38:59 UTC

mdadm-3.1.3-0.git20100722.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc13

Comment 11 Piergiorgio Sartor 2010-07-31 17:00:46 UTC

Hi,

I tried the latest mdadm from update-testing and I'm sorry to inform you it does not work.

Differently than before, I did not get "multiple" md devices, but I got around 15 "mdadm -I ..." hanging, using the maximum CPU available.

After killing these mdadm and removing the incomplete md devices, I was able to assemble the arrays, but not to stop them smoothly.

Specifically, trying "mdadm --stop /dev/md/12X" hangs, but when interrupted (ctrl-C or kill) the md device results removed.

There are also some udevd errors, like:
...
... udevd[669]: worker [20022] failed while handling '/devices/pci0000:00/0000:00:0b.1/usb1/1-7/1-7.3/1-7.3.4/1-7.3.4:1.0/host12/target12:0:0/12:0:0:0/block/sdi/sdi3
...
... udevd-work[24331]: '/sbin/mdadm -I /dev/sdi4' unexpected exit with status 0x000f
...

Hope this helps.

I guess the status of this bug should be changed to something else, now.

Hope this helps,

bye,

pg

Comment 12 Michal Schmidt 2010-08-05 11:22:33 UTC

(In reply to comment #11)
> Differently than before, I did not get "multiple" md devices, but I got around
> 15 "mdadm -I ..." hanging, using the maximum CPU available.

Probably related to bug 621524.

Comment 13 Fedora Update System 2010-08-05 14:25:30 UTC

mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13

Comment 14 Fedora Update System 2010-08-05 14:26:05 UTC

mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12

Comment 15 Fedora Update System 2010-08-05 14:26:44 UTC

mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14

Comment 16 Piergiorgio Sartor 2010-08-05 18:43:48 UTC

(In reply to comment #13)
> mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13.
> http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13    

Didn't I mentioned this does not work?

Actually, this version is a regression in comparison with the actual one, at least for my specific setup.

Would it be possible to have this update pulled back?

Thanks,

bye,

pg

Comment 17 Doug Ledford 2010-08-05 19:02:51 UTC

No, you mentioned that 20100722.2 did not work for you.  This is for version 20100804.1, which was just built yesterday, so unless you downloeded it directly from koji, I'm positive you haven't tested this version yet.

Comment 18 Piergiorgio Sartor 2010-08-05 19:18:37 UTC

Oh, sorry then, I mis-read the git date.

Then I'll try this one (maybe this weekend).

Thanks!

bye,

pg

Comment 19 Fedora Update System 2010-08-05 23:29:42 UTC

mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12

Comment 20 Fedora Update System 2010-08-05 23:53:14 UTC

mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13

Comment 21 Piergiorgio Sartor 2010-08-07 12:35:32 UTC

Hi again,

I tried mdadm-3.1.3-0.git20100804.2.fc13 and it seems to work.

The arrays are properly assembled on hot plug, without partial duplications.

Furthermore, the "spinning" issue (15 mdadm hanging) seems also solved.

There is still a strange catch.

Once the arrays are auto-assembled, some operations fail.

For example "mdadm --grow /dev/md121 --bitmap=none" return "mdadm: failed to remove internal bitmap."

In the logs ("dmesg" or "/var/log/messages") is reported "md: couldn't update array info. -16"

If I stop the arrays and restart them manually, then the above operations work.

Any suggestions?

Thanks,

bye,

pg

Comment 22 Fedora Update System 2010-08-10 01:30:19 UTC

mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14

Comment 23 Piergiorgio Sartor 2010-08-14 17:29:26 UTC

Hi,

about comment #21, it seems the arrays are assembled "auto-read-only".

Is this expected?

Or there is something that needs to be tuned?

Thanks,

bye,

pg

Comment 24 Fedora Update System 2010-12-07 20:12:14 UTC

mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 25 Fedora Update System 2010-12-07 20:14:10 UTC

mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.