Red Hat Bugzilla – Bug 523334
mdadm --stop sometimes fails with EBUSY on non busy arrays
Last modified: 2010-02-20 10:53:38 EST
When running "mdadm --stop /dev/md#" on sets which are a member of a container,
sometimes the command will fail with the EBUSY error message, even though
nothing is using the set. A second run of the same command then will succeed
(and cat /proc/mdstat often shows it did actually stop).
This seems to only happen with sets which are a member of a container, and then
only when they are being rebuild.
I guess this is another one for forwarding upstream.
I've done some further investigation of this, and there seem to be 2 different
scenarios where this can happen:
1) Sometimes mdadm --stop on a container member will fail with the EBUSY
error message, but if you look at /proc/mdstat the array has actually been
stopped. I've added a workaround to anaconda for this for now.
Note this happens every now and then I cannot reproduce this version at
2) After incremental assembling a 2 raid 10 sets containing imsm container, doing
some stuff, then stopping all members + container, then incremental
assembling the 2 raid 10 sets containing imsm container again, the second
set in the container will not stop at all, failing with an EBUSY every time
killing the mdmon process for this set fixes this.
I wonder if this is a hald interaction? I have seen it hold the device open preventing it from being stopped. Can you attach the output of 'lsof | grep md' when this state occurs.
(In reply to comment #2)
> I wonder if this is a hald interaction? I have seen it hold the device open
> preventing it from being stopped. Can you attach the output of 'lsof | grep
> md' when this state occurs.
The only process showing anything related is mdmon, and that only
has sysfs files open from the md device in question.
Also note that I can stop the set after killing mdmon!
While re-testing this I did have one new behaviour, I know have a PV of an
VG on this set wand that VG fails to activate because of a device mapper
device lookup error. Note I'm talking about case 2) from comment #1 here.
I think this might be related to the RAID 10 lockdep issue, so let me try this with a fixed kernel.
Both issue 2) from comment #1 as the "device mapper device lookup error" from
comment #3 persist when using a kernel patched with the raid10 lockdep fix.
Created attachment 361610 [details]
lsof output when set cannot be stopped, as requested
I've reproduced problem 2) from comment #1 again, here is the requested lsof output, also: array_state says read-auto
Note I can reproduce at will, so let me know if you need any more information,
this is with a kernel with the lockdep issue fixed, and mdmon with the known
There are two places where the kernel return EBUSY to stop requests and both of them are due to some other thread having the array open. The lsof output does not show anyone holding the array open, so I wonder if this is a race with a udev rule that briefly opens the device to look for a filesytem label?
(In reply to comment #6)
> There are two places where the kernel return EBUSY to stop requests and both of
> them are due to some other thread having the array open. The lsof output does
> not show anyone holding the array open, so I wonder if this is a race with a
> udev rule that briefly opens the device to look for a filesytem label?
Note that we have 2 scenarios:
1) It sometimes returns an error, but looking at /proc/mdstat afterwards, the
array did actually stop
2) It consistently fails once in this state, in this case killing the mdmon
process related to this set, allows on to stop the set. and mdmon has
a load of sysfs files open related to the set. So it seems that it is
somehow failing to stop mdmon (which I believe it should do before stopping
*** Bug 528017 has been marked as a duplicate of this bug. ***
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.
More information and reason for this action is here:
Is still an issue with the latest mdadm? There are numberous mdmon fixes in the latest mdadm and this might not be an issue any longer.
In the mean time we've worked around this issue in anaconda by simply never stopping mdraid containers (and sets there in) once setup, as there is no reason to do so.
So I cannot reproduce this anymore (unless I were to create a special anaconda just to see if this still reproduces).
I think it is best to close this as insufficient data at this point.