523334 – mdadm --stop sometimes fails with EBUSY on non busy arrays

Bug 523334 - mdadm --stop sometimes fails with EBUSY on non busy arrays

Summary: mdadm --stop sometimes fails with EBUSY on non busy arrays

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	12
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	528017 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-09-14 22:11 UTC by Hans de Goede
Modified:	2010-02-20 15:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-02-20 15:53:38 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lsof output when set cannot be stopped, as requested (10.86 KB, text/plain) 2009-09-18 08:22 UTC, Hans de Goede	no flags	Details
View All

Description Hans de Goede 2009-09-14 22:11:57 UTC

When running "mdadm --stop /dev/md#" on sets which are a member of a container,
sometimes the command will fail with the EBUSY error message, even though
nothing is using the set. A second run of the same command then will succeed
(and cat /proc/mdstat often shows it did actually stop).

This seems to only happen with sets which are a member of a container, and then
only when they are being rebuild.

I guess this is another one for forwarding upstream.

Comment 1 Hans de Goede 2009-09-15 11:25:52 UTC

Ok,

I've done some further investigation of this, and there seem to be 2 different
scenarios where this can happen:

1) Sometimes mdadm --stop on a container member will fail with the EBUSY
   error message, but if you look at /proc/mdstat the array has actually been
   stopped. I've added a workaround to anaconda for this for now.

   Note this happens every now and then I cannot reproduce this version at
   will

2) After incremental assembling a 2 raid 10 sets containing imsm container, doing
   some stuff, then stopping all members + container, then incremental
   assembling the 2 raid 10 sets containing imsm container again, the second
   set in the container will not stop at all, failing with an EBUSY every time
   killing the mdmon process for this set fixes this.

Comment 2 Dan Williams 2009-09-15 23:12:46 UTC

I wonder if this is a hald interaction?  I have seen it hold the device open preventing it from being stopped.  Can you attach the output of 'lsof | grep md' when this state occurs.

Comment 3 Hans de Goede 2009-09-16 09:39:30 UTC

(In reply to comment #2)
> I wonder if this is a hald interaction?  I have seen it hold the device open
> preventing it from being stopped.  Can you attach the output of 'lsof | grep
> md' when this state occurs.  

The only process showing anything related is mdmon, and that only
has sysfs files open from the md device in question.

Also note that I can stop the set after killing mdmon!

While re-testing this I did have one new behaviour, I know have a PV of an
VG on this set wand that VG fails to activate because of a device mapper
device lookup error. Note I'm talking about case 2) from comment #1 here.

I think this might be related to the RAID 10 lockdep issue, so let me try this with a fixed kernel.

Comment 4 Hans de Goede 2009-09-16 12:14:34 UTC

Both issue 2) from comment #1 as the "device mapper device lookup error" from
comment #3 persist when using a kernel patched with the raid10 lockdep fix.

Comment 5 Hans de Goede 2009-09-18 08:22:42 UTC

Created attachment 361610 [details]
lsof output when set cannot be stopped, as requested

I've reproduced problem 2) from comment #1 again, here is the requested lsof output, also: array_state says read-auto

Note I can reproduce at will, so let me know if you need any more information,
this is with a kernel with the lockdep issue fixed, and mdmon with the known
segfaults fixed.

Comment 6 Dan Williams 2009-10-01 23:06:30 UTC

There are two places where the kernel return EBUSY to stop requests and both of them are due to some other thread having the array open.  The lsof output does not show anyone holding the array open, so I wonder if this is a race with a udev rule that briefly opens the device to look for a filesytem label?

Comment 7 Hans de Goede 2009-10-02 07:20:58 UTC

(In reply to comment #6)
> There are two places where the kernel return EBUSY to stop requests and both of
> them are due to some other thread having the array open.  The lsof output does
> not show anyone holding the array open, so I wonder if this is a race with a
> udev rule that briefly opens the device to look for a filesytem label?  

Note that we have 2 scenarios:

1) It sometimes returns an error, but looking at /proc/mdstat afterwards, the
   array did actually stop

2) It consistently fails once in this state, in this case killing the mdmon
   process related to this set, allows on to stop the set. and mdmon has
   a load of sysfs files open related to the set. So it seems that it is
   somehow failing to stop mdmon (which I believe it should do before stopping 
   the set)

Comment 8 Hans de Goede 2009-10-09 06:25:25 UTC

*** Bug 528017 has been marked as a duplicate of this bug. ***

Comment 9 Bug Zapper 2009-11-16 12:23:03 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 10 Doug Ledford 2010-02-19 19:59:10 UTC

Is still an issue with the latest mdadm?  There are numberous mdmon fixes in the latest mdadm and this might not be an issue any longer.

Comment 11 Hans de Goede 2010-02-20 11:21:12 UTC

In the mean time we've worked around this issue in anaconda by simply never stopping mdraid containers (and sets there in) once setup, as there is no reason to do so.

So I cannot reproduce this anymore (unless I were to create a special anaconda just to see if this still reproduces).

I think it is best to close this as insufficient data at this point.

Comment 12 Doug Ledford 2010-02-20 15:53:38 UTC

OK, closing.

Note You need to log in before you can comment on or make changes to this bug.