When running "mdadm --stop /dev/md#" on sets which are a member of a container, sometimes the command will fail with the EBUSY error message, even though nothing is using the set. A second run of the same command then will succeed (and cat /proc/mdstat often shows it did actually stop). This seems to only happen with sets which are a member of a container, and then only when they are being rebuild. I guess this is another one for forwarding upstream.
Ok, I've done some further investigation of this, and there seem to be 2 different scenarios where this can happen: 1) Sometimes mdadm --stop on a container member will fail with the EBUSY error message, but if you look at /proc/mdstat the array has actually been stopped. I've added a workaround to anaconda for this for now. Note this happens every now and then I cannot reproduce this version at will 2) After incremental assembling a 2 raid 10 sets containing imsm container, doing some stuff, then stopping all members + container, then incremental assembling the 2 raid 10 sets containing imsm container again, the second set in the container will not stop at all, failing with an EBUSY every time killing the mdmon process for this set fixes this.
I wonder if this is a hald interaction? I have seen it hold the device open preventing it from being stopped. Can you attach the output of 'lsof | grep md' when this state occurs.
(In reply to comment #2) > I wonder if this is a hald interaction? I have seen it hold the device open > preventing it from being stopped. Can you attach the output of 'lsof | grep > md' when this state occurs. The only process showing anything related is mdmon, and that only has sysfs files open from the md device in question. Also note that I can stop the set after killing mdmon! While re-testing this I did have one new behaviour, I know have a PV of an VG on this set wand that VG fails to activate because of a device mapper device lookup error. Note I'm talking about case 2) from comment #1 here. I think this might be related to the RAID 10 lockdep issue, so let me try this with a fixed kernel.
Both issue 2) from comment #1 as the "device mapper device lookup error" from comment #3 persist when using a kernel patched with the raid10 lockdep fix.
Created attachment 361610 [details] lsof output when set cannot be stopped, as requested I've reproduced problem 2) from comment #1 again, here is the requested lsof output, also: array_state says read-auto Note I can reproduce at will, so let me know if you need any more information, this is with a kernel with the lockdep issue fixed, and mdmon with the known segfaults fixed.
There are two places where the kernel return EBUSY to stop requests and both of them are due to some other thread having the array open. The lsof output does not show anyone holding the array open, so I wonder if this is a race with a udev rule that briefly opens the device to look for a filesytem label?
(In reply to comment #6) > There are two places where the kernel return EBUSY to stop requests and both of > them are due to some other thread having the array open. The lsof output does > not show anyone holding the array open, so I wonder if this is a race with a > udev rule that briefly opens the device to look for a filesytem label? Note that we have 2 scenarios: 1) It sometimes returns an error, but looking at /proc/mdstat afterwards, the array did actually stop 2) It consistently fails once in this state, in this case killing the mdmon process related to this set, allows on to stop the set. and mdmon has a load of sysfs files open related to the set. So it seems that it is somehow failing to stop mdmon (which I believe it should do before stopping the set)
*** Bug 528017 has been marked as a duplicate of this bug. ***
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Is still an issue with the latest mdadm? There are numberous mdmon fixes in the latest mdadm and this might not be an issue any longer.
In the mean time we've worked around this issue in anaconda by simply never stopping mdraid containers (and sets there in) once setup, as there is no reason to do so. So I cannot reproduce this anymore (unless I were to create a special anaconda just to see if this still reproduces). I think it is best to close this as insufficient data at this point.
OK, closing.