616596 – Hot-plugable RAID components sometimes not assembled correctly

Bug 616596 - Hot-plugable RAID components sometimes not assembled correctly

Summary: Hot-plugable RAID components sometimes not assembled correctly

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-20 21:29 UTC by Doug Ledford
Modified:	2010-07-20 23:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:	mdadm-3.1.3-0.git07202010.2
Clone Of:	600900
Clones:	616597 (view as bug list)
Environment:
Last Closed:	2010-07-20 23:00:55 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Doug Ledford 2010-07-20 21:29:58 UTC

+++ This bug was initially created as a clone of Bug #600900 +++

Description of problem:
I've a bunch of RAID-6 volumes, composed by different partitions of 10 USB disks.

These disks are connect together using some USB HUBs, so there is a single USB cable which can be connect to the PC.

Once the cable is connected the 10 disks are initialized and incremental assembly of the different RAIDs starts.

There seem to be some sort of race condition, so that sometimes, partitions belonging to the same RAID volume, are assembled as different RAID volumes, leading to two incomplete volumes and related mess.

So, for example, the partitions /dev/sd[d-m]1 form one RAID-6 volume.

The incremental assemble, sometimes, creates two RAIDs, /dev/md127 and /dev/md126, composed, for example, one with /dev/sdd1 and the second with /dev/sd[e-m]1, which, of course, does not work.

Version-Release number of selected component (if applicable):
mdadm-3.1.2-10.fc13.x86_64

How reproducible:
Sometimes, usually after the first hot-plug, it seems to work better.

Steps to Reproduce:
1.
Hot-plug the HDDs.
2.
Wait the mess to finish
3.
Check /proc/mdstat

Actual results:
Sometimes more, not working, RAID volumes are assemble than expected

Expected results:
The incremental assembly should work properly

Additional info:
The different volumes are PVs of an LVM VG, which means also LVM runs on hot-plug.

Hope this helps,

bye

pg

--- Additional comment from rmy on 2010-06-17 15:21:23 EDT ---

I'm seeing something similar, but with ATA devices not USB. I have ten partitions across three ATA drives that are combined into five RAID1 volumes. Here's what they look like in F12:

Personalities : [raid1] 
md123 : active raid1 sda5[0] sdb5[1]
      31463168 blocks [2/2] [UU]
      
md124 : active raid1 sda6[0] sdb6[1]
      30033408 blocks [2/2] [UU]
      
md125 : active raid1 sda7[0] sdb7[1]
      30796480 blocks [2/2] [UU]
      
md126 : active raid1 sda9[0] sdc10[1]
      41953600 blocks [2/2] [UU]
      
md127 : active raid1 sdc12[0] sda8[1]
      30788416 blocks [2/2] [UU]
      
unused devices: <none>

I installed F13 onto the partitions that used to hold F11 but, as is my custom, didn't tell anaconda what to do with the RAID volumes. Later I added them to fstab in F13. Initially there were problems that I took to be due to the line 'AUTO +imsm +1.x -all' that anaconda had put in mdadm.conf. All my RAID partitions have 0.9 metadata. I commented out the AUTO line and put in ARRAY lines specifying the UUIDs of the RAID devices.

Now I find that more often than not F13 fails to correctly assemble the arrays. F12 always succeeds. In six boots of F13 the arrays were only properly built once. The failures are all different.  Here's one example:

Personalities : [raid1] 
md127 : active (auto-read-only) raid1 sda8[1]
      30788416 blocks [2/1] [_U]
      
md125 : active raid1 sda7[0] sdb7[1]
      30796480 blocks [2/2] [UU]
      
md123 : inactive sda5[0](S)
      31463168 blocks
       
md126 : active (auto-read-only) raid1 sdc10[1]
      41953600 blocks [2/1] [_U]
      
md124 : active raid1 sdb6[1] sda6[0]
      30033408 blocks [2/2] [UU]
      
unused devices: <none>

I'll attach some more information in case anyone can see a pattern in this. I certainly can't.

--- Additional comment from rmy on 2010-06-17 15:22:42 EDT ---

Created an attachment (id=424916)
dmesg and mdstat from six boots of F13

--- Additional comment from rmy on 2010-06-20 12:22:23 EDT ---

I seem to have got this working more reliably by removing rd_NO_MD from the kernel line in grub.conf.  At least, I've been able to boot Fedora 13 five times now and the RAID arrays have been assembled correctly every time.

Without rd_NO_MD the arrays are assembled earlier in the boot process, though I don't know why that would make any difference.

--- Additional comment from dledford on 2010-07-20 17:29:19 EDT ---

@Ron: The difference you are seeing is that earlier in the boot process udev is likely processing disk add events sequentially instead of in parallel.  Evidently there is a race condition when devices are added in parallel.

OK, after some code inspection, I've found the race.  Specifically, if two devices belonging to the same array are assembled in parallel, then if the array is not yet listed in the md-device-map file, each parallel tries to open a lock file, then attempts an exclusive lock on the lock file.  One process gets it, the other waits.  The process that got the lock then adds the array and calls map_update to write out the new map entry.  Finally it calls unlock on the existing file, then it unlinks the lock file.  The problem here is that if another instance was already waiting on the lock, it doesn't care that the file was unlinked and gets a new lock on an unlinked file, while a totally different instance of mdadm creates a new lock file and locks the new lock file, resulting in two instances having exclusive locks on two different lock files and being allowed to actually run in parallel, resulting in this problem.

My solution is to change the locking mechanism to pass the flags O_CREAT and O_EXCL to the open command, which will fail the open if it is not the process that created the file.  For as long as we fail due to the file already existing, we keep trying to open the file.  Once we manage to create the file, then we already have a lock and are free to run.  The fix for this will be in the next mdadm update (mdadm-3.1.3-0.git07202010.1 or later).

Note You need to log in before you can comment on or make changes to this bug.