455471 – [NEC/Stratus 5.3 bug] various crashes in md - rdev removed in the middle of ITERATE_RDEV

Bug 455471 - [NEC/Stratus 5.3 bug] various crashes in md - rdev removed in the middle of ITERATE_RDEV

Summary: [NEC/Stratus 5.3 bug] various crashes in md - rdev removed in the middle of I...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Doug Ledford
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	432518 460128
TreeView+	depends on / blocked

Reported:	2008-07-15 17:28 UTC by nate.dailey
Modified:	2009-06-20 04:56 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 19:35:56 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
455471.patch (4.93 KB, patch) 2008-07-24 12:47 UTC, nate.dailey	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description nate.dailey 2008-07-15 17:28:14 UTC

Description of problem:

Stratus has seen several crashes when a device is removed from a raid1 mirror.
These happen when an rdev is being removed from an mddev's "disks" list. The
reconfig_mutex would prevent this, but there are some code paths that don't hold
it when iterating over the list.

Version-Release number of selected component (if applicable):

kernel 2.6.18-92.el5, plus a few fixes (copied from upstream) to the raid1 module.

How reproducible:

Definitely reproducible, but under normal circumstances not too frequent. It's
happened between 5 and 10 times in the past few weeks, but I'm not sure how many
total test hours this involves.

Steps to Reproduce:

Set faulty and remove a member in an MD raid1 (either via mdadm or thru sysfs).
If the right stuff is happening at the moment this is done, a crash can result.

I've been able to make one of these happen very easily with a tight loop of
"mdadm -f" "-r" "-a" on a set of 3 mirrors, plus stress on the mirrors. Another
one comes out when the same is done via sysfs (echo faulty (remove) >
/sys/block/mdN/md/dev-sdX/state) while at the same time reading /proc/mdstat
over and over.
  
Actual results:

Crash.

Expected results:

No crash.

Additional info:

Here's an example:

crash> bt
PID: 12301  TASK: ffff81007bb32860  CPU: 2   COMMAND: "mdadm"
 #0 [ffff8100732b3b90] crash_kexec at ffffffff800aa977
 #1 [ffff8100732b3c50] __die at ffffffff800650af
 #2 [ffff8100732b3c90] do_page_fault at ffffffff80066aa1
 #3 [ffff8100732b3d80] error_exit at ffffffff8005dde9
    [exception RIP: bdevname]
    RIP: ffffffff800ff4dc  RSP: ffff8100732b3e30  RFLAGS: 00010293
    RAX: ffff81007e420018  RBX: 0000000000000002  RCX: ffff8100732b3d58
    RDX: ffff81001a94bb80  RSI: ffff8100732b3e58  RDI: 0000000000000000
    RBP: ffff81007e420000   R8: 00000000ffffffff   R9: 0000000000000020
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff81001a94bb80
    R13: ffff81007b4e7d40  R14: 0000000000040b00  R15: ffff81001a94bb80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff8100732b3e30] md_seq_show at ffffffff8020419f
 #5 [ffff8100732b3eb0] seq_read at ffffffff8003f209
 #6 [ffff8100732b3f10] vfs_read at ffffffff8000b337
 #7 [ffff8100732b3f40] sys_read at ffffffff80011715
 #8 [ffff8100732b3f80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 000000346e6c40d0  RSP: 00007ffffc7a0158  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000002000  RSI: 00007ffffc7a02f0  RDI: 0000000000000003
    RBP: 00007ffffc7a02f0   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 0000000000002000
    R13: 000000000ca4ea60  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b


I posted about this to the linux-raid mailing list:

-----Original Message-----
From: Dailey, Nate 
Sent: Monday, July 14, 2008 2:30 PM
To: 'linux-raid.org'
Cc: 'mingo'; 'neilb'
Subject: crash: write_sb_page walks mddev.disks without holding reconfig_mutex

Hitting several related crashes, and looking for advice/help...

I'm using MD raid1 under RHEL 5 update 2 (kernel 2.6.18-92.el5). I've also
incorporated a few upstream patches to address various bugs, but don't believe
any of these are causing what I'm now seeing.

I've hit 3 different crashes that all involve an rdev being ripped out from
under someone walking the mddev.disks list. It looks like the reconfig_mutex is
supposed to prevent this.

One of these cases has already been fixed upstream (modify rdev_attr_store to
call mddev_lock):

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=ca38805945edf5d1f5444b283eed95bb954772e8

But I don't see fixes for the other two. Here are the problematic code paths
(I'm using raid1, but I imagine the same thing applies to other personalities):

- raid1d -> [freeze_array ->] flush_pending_writes -> bitmap_unplug ->
write_page -> write_sb_page

- raid1d -> md_check_recovery -> bitmap_daemon_work -> write_page ->
write_sb_page

In both cases, write_sb_page walks the mddev.disks list without holding the
reconfig_mutex. I think these are only problematic when using an internal bitmap.

I've made changes that seem to fix these for me (patches below). But I'm not
sure I'm doing it in the best way possible, and specifically:

- I'm not using mddev_lock/unlock. This is mainly because mddev_unlock wakes up
the thread, which does not seem like something you want to do unconditionally
(my CPUs were monopolized running various raid1d threads).

- I'm using mutex_lock_interruptible because that's what mddev_lock uses, but no
idea what to do if this call returns non-zero.

- Not sure about using mutex_lock vs. mutex_trylock. The existing code in
md_check_recovery uses trylock, but I'm not sure why.

So in other words, I have something that seems to work okay, but I don't know
this code well at all and am hoping to get this fixed the "right" way.

Thanks!

Nate Dailey
Stratus Technologies



--- old/drivers/md/raid1.c      2008-07-14 13:51:02.000000000 -0400
+++ new/drivers/md/raid1.c      2008-07-14 13:51:22.000000000 -0400
@@ -625,7 +625,17 @@ static int flush_pending_writes(conf_t *
                spin_unlock_irq(&conf->device_lock);
                /* flush any pending bitmap writes to
                 * disk before proceeding w/ I/O */
-               bitmap_unplug(conf->mddev->bitmap);
+               /* Need the reconfig_mutex when calling bitmap_unplug to
+                  prevent an rdev from going away. What to do if mutex_
+                  lock_interruptible returns non-zero? */
+               if (!mutex_lock_interruptible(&conf->mddev->reconfig_mutex)) {
+                       bitmap_unplug(conf->mddev->bitmap);
+
+                       /* Call only mutex_unlock (not md_wakeup_thread, as
+                          mddev_unlock would do), don't want to wake the
+                          thread. */
+                       mutex_unlock(&conf->mddev->reconfig_mutex);
+               }

                while (bio) { /* submit pending writes */
                        struct bio *next = bio->bi_next;


--- old/drivers/md/md.c 2008-07-09 14:44:13.000000000 -0400
+++ new/drivers/md/md.c 2008-07-11 15:59:21.000000000 -0400
@@ -5360,8 +5381,18 @@ void md_check_recovery(mddev_t *mddev)
        struct list_head *rtmp;


-       if (mddev->bitmap)
-               bitmap_daemon_work(mddev->bitmap);
+       if (mddev->bitmap) {
+               /* Need the reconfig mutex when calling bitmap_daemon_work
+                  to prevent an rdev from going away. What to do if
+                  mddev_lock returns non-zero? */
+               if (!mddev_lock(mddev)) {
+                       bitmap_daemon_work(mddev->bitmap);
+
+                       /* Call mutex_unlock instead of mddev_unlock to
+                          avoid waking the thread. */
+                       mutex_unlock(&mddev->reconfig_mutex);
+               }
+       }

        if (mddev->ro)
                return;




And received the following response:



 

-----Original Message-----
From: Neil Brown [neilb] 
Sent: Monday, July 14, 2008 8:37 PM
To: Dailey, Nate
Cc: linux-raid.org; mingo
Subject: Re: crash: write_sb_page walks mddev.disks without holding reconfig_mutex

On Monday July 14, Nate.Dailey wrote:
> Hitting several related crashes, and looking for advice/help...
> 
> I'm using MD raid1 under RHEL 5 update 2 (kernel 2.6.18-92.el5). I've
> also incorporated a few upstream patches to address various bugs, but
> don't believe any of these are causing what I'm now seeing.
> 
> I've hit 3 different crashes that all involve an rdev being ripped out
> from under someone walking the mddev.disks list. It looks like the
> reconfig_mutex is supposed to prevent this.

Thanks for reporting this.
You are right.  The mddev.disks list is not being protected properly.
It is only ever changed under reconfig_mutex, and lots of the accesses
are under the same mutex.  However I count three that are not:
    write_sb_page (which you found)
    match_mddev_units
    is_mddev_idle

It is not really appropriate to take reconfig_mutex in these cases.

I think the best fix would be to use the 'rcu' approach.
The following patch attempts that.  If you could test it I would
really appreciate it.

Thanks,
NeilBrown


commit ec54a752a284ee3ace5177935bc0385c5ee2c70c
Author: Neil Brown <neilb>
Date:   Tue Jul 15 10:35:28 2008 +1000

    Protect access to mddev->disks list using RCU
    
    All modifications and most access to the mddev->disks list are made
    under the reconfig_mutex lock.  However there are three places where
    the list is walked without any locking.  If a reconfig happens at this
    time, havoc (and oops) can ensue.
    
    So use RCU to protect these accesses:
      - wrap them in rcu_read_{,un}lock()
      - use list_for_each_entry_rcu
      - add to the list with list_add_rcu
      - delete from the list with list_del_rcu
      - delay the 'free' with call_rcu rather than schedule_work
    
    Note that export_rdev did a list_del_init on this list.  In almost all
    cases the entry was not in the list anymore so it was a no-op and so
    safe.  It is no longer safe as after list_del_rcu we may not touch
    the list_head.
    An audit shows that export_rdev is called:
      - after unbind_rdev_from_array, in which case the delete has
         already been done,
      - after bind_rdev_to_array fails, in which case the delete isn't needed.
      - before the device has been put on a list at all (e.g. in
          add_new_disk where reading the superblock fails).
      - and in autorun devices after a failure when the device is on a
          different list.
    
    So remove the list_del_init call from export_rdev, and add it back
    immediately before the called to export_rdev for that last case.
    
    Note also that ->same_set is sometimes used for lists other than
    mddev->list (e.g. candidates).  In these cases rcu is not needed.
    
    Signed-off-by: NeilBrown <neilb>

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index eba83e2..621a272 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -241,10 +241,10 @@ static struct page *read_sb_page(mddev_t *mddev, long
offset, unsigned long inde
 static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 {
 	mdk_rdev_t *rdev;
-	struct list_head *tmp;
 	mddev_t *mddev = bitmap->mddev;
 
-	rdev_for_each(rdev, tmp, mddev)
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev)
 		if (test_bit(In_sync, &rdev->flags)
 		    && !test_bit(Faulty, &rdev->flags)) {
 			int size = PAGE_SIZE;
@@ -260,11 +260,11 @@ static int write_sb_page(struct bitmap *bitmap, struct
page *page, int wait)
 				    + (long)(page->index * (PAGE_SIZE/512))
 				    + size/512 > 0)
 					/* bitmap runs in to metadata */
-					return -EINVAL;
+					goto bad_alignment;
 				if (rdev->data_offset + mddev->size*2
 				    > rdev->sb_start + bitmap->offset)
 					/* data runs in to bitmap */
-					return -EINVAL;
+					goto bad_alignment;
 			} else if (rdev->sb_start < rdev->data_offset) {
 				/* METADATA BITMAP DATA */
 				if (rdev->sb_start
@@ -272,7 +272,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page
*page, int wait)
 				    + page->index*(PAGE_SIZE/512) + size/512
 				    > rdev->data_offset)
 					/* bitmap runs in to data */
-					return -EINVAL;
+					goto bad_alignment;
 			} else {
 				/* DATA METADATA BITMAP - no problems */
 			}
@@ -282,10 +282,15 @@ static int write_sb_page(struct bitmap *bitmap, struct
page *page, int wait)
 				       size,
 				       page);
 		}
+	rcu_read_unlock();
 
 	if (wait)
 		md_super_wait(mddev);
 	return 0;
+
+ bad_alignment:
+	rcu_read_unlock();
+	return -EINVAL;
 }
 
 static void bitmap_file_kick(struct bitmap *bitmap);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7dcdff6..66ca159 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1397,15 +1397,17 @@ static struct super_type super_types[] = {
 
 static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
 {
-	struct list_head *tmp, *tmp2;
 	mdk_rdev_t *rdev, *rdev2;
 
-	rdev_for_each(rdev, tmp, mddev1)
-		rdev_for_each(rdev2, tmp2, mddev2)
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev1)
+		rdev_for_each_rcu(rdev2, mddev2)
 			if (rdev->bdev->bd_contains ==
-			    rdev2->bdev->bd_contains)
+			    rdev2->bdev->bd_contains) {
+				rcu_read_unlock();
 				return 1;
-
+			}
+	rcu_read_unlock();
 	return 0;
 }
 
@@ -1472,7 +1474,7 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t *
mddev)
 		kobject_del(&rdev->kobj);
 		goto fail;
 	}
-	list_add(&rdev->same_set, &mddev->disks);
+	list_add_rcu(&rdev->same_set, &mddev->disks);
 	bd_claim_by_disk(rdev->bdev, rdev->bdev->bd_holder, mddev->gendisk);
 	return 0;
 
@@ -1482,9 +1484,9 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t *
mddev)
 	return err;
 }
 
-static void md_delayed_delete(struct work_struct *ws)
+static void md_delayed_delete(struct rcu_head *rcu)
 {
-	mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
+	mdk_rdev_t *rdev = container_of(rcu, mdk_rdev_t, rcu_work);
 	kobject_del(&rdev->kobj);
 	kobject_put(&rdev->kobj);
 }
@@ -1497,17 +1499,17 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev)
 		return;
 	}
 	bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk);
-	list_del_init(&rdev->same_set);
+	list_del_rcu(&rdev->same_set);
 	printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
 	rdev->mddev = NULL;
 	sysfs_remove_link(&rdev->kobj, "block");
 
 	/* We need to delay this, otherwise we can deadlock when
-	 * writing to 'remove' to "dev/state"
+	 * writing to 'remove' to "dev/state".  We also need
+	 * to delay it due to rcu usage.
 	 */
-	INIT_WORK(&rdev->del_work, md_delayed_delete);
 	kobject_get(&rdev->kobj);
-	schedule_work(&rdev->del_work);
+	call_rcu(&rdev->rcu_work, md_delayed_delete);
 }
 
 /*
@@ -1560,7 +1562,6 @@ static void export_rdev(mdk_rdev_t * rdev)
 	if (rdev->mddev)
 		MD_BUG();
 	free_disk_sb(rdev);
-	list_del_init(&rdev->same_set);
 #ifndef MODULE
 	if (test_bit(AutoDetected, &rdev->flags))
 		md_autodetect_dev(rdev->bdev->bd_dev);
@@ -4063,8 +4064,10 @@ static void autorun_devices(int part)
 		/* on success, candidates will be empty, on error
 		 * it won't...
 		 */
-		rdev_for_each_list(rdev, tmp, candidates)
+		rdev_for_each_list(rdev, tmp, candidates) {
+			list_del_init(&rdev->same_set);
 			export_rdev(rdev);
+		}
 		mddev_put(mddev);
 	}
 	printk(KERN_INFO "md: ... autorun DONE.\n");
@@ -5528,12 +5531,12 @@ int unregister_md_personality(struct mdk_personality *p)
 static int is_mddev_idle(mddev_t *mddev)
 {
 	mdk_rdev_t * rdev;
-	struct list_head *tmp;
 	int idle;
 	long curr_events;
 
 	idle = 1;
-	rdev_for_each(rdev, tmp, mddev) {
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev) {
 		struct gendisk *disk = rdev->bdev->bd_contains->bd_disk;
 		curr_events = disk_stat_read(disk, sectors[0]) + 
 				disk_stat_read(disk, sectors[1]) - 
@@ -5565,6 +5568,7 @@ static int is_mddev_idle(mddev_t *mddev)
 			idle = 0;
 		}
 	}
+	rcu_read_unlock();
 	return idle;
 }
 
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 35da93c..6fa94ff 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -114,7 +114,7 @@ struct mdk_rdev_s
 					   * for reporting to userspace and storing
 					   * in superblock.
 					   */
-	struct work_struct del_work;	/* used for delayed sysfs removal */
+	struct rcu_head	rcu_work;	/* used for delayed sysfs removal */
 };
 
 struct mddev_s
@@ -339,6 +339,9 @@ static inline char * mdname (mddev_t * mddev)
 #define rdev_for_each(rdev, tmp, mddev)				\
 	rdev_for_each_list(rdev, tmp, (mddev)->disks)
 
+#define rdev_for_each_rcu(rdev, mddev)				\
+	list_for_each_entry_rcu(rdev, &((mddev)->disks), same_set)
+
 typedef struct mdk_thread_s {
 	void			(*run) (mddev_t *mddev);
 	mddev_t			*mddev;




This patch is against an upstream kernel so it doesn't apply cleanly to RHEL5
U2. I'm currently attempting to build & test this.

Comment 2 nate.dailey 2008-07-22 18:38:10 UTC

I'm currently testing a version of the following patch, provided by Neil Brown:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b80991c6cb9efa607bc4fd6f3ecdf5511c31bb0

(this is very close to the patch quoted above, but with a bug fixed)

This patch is against an upstream kernel, so the version I'm testing has a few
modifications to build against the RHEL 5 update 2 (92) kernel. I'll post my
version of the patch after a day or two more of testing (assuming things go well).

Comment 3 nate.dailey 2008-07-24 12:47:51 UTC

Created attachment 312551 [details]
455471.patch

Comment 4 nate.dailey 2008-07-24 13:00:18 UTC

I just attached the patch I've been testing with. As noted in the previous
comment, this is an adaptation of an upstream patch (see that patch for a
description of most of the contents of this patch).

Note that the attached patch also includes parts of the following upstream fixes:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=ca38805945edf5d1f5444b283eed95bb954772e8

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=27c529bb8e906d5d692152bc127cc09477d3629e

These modify rdev_attr_store and rdev_attr_show to take the mddev_lock, which
closes another window where a device can be removed unsafely.

The attached patch also contains part of:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=7dd5e7c3dbe8c4ffb507ddc0ea8fab07c8b11b0b

Not the whole thing, just the change to match_mddev_units (which is needed to
fix this bug).

There have been lots of other changes upstream, some of which are reflected in
differences between the attached patch and the upstream patch it's based on (for
example, the addition of mddev_delayed_work), but I don't think any of these are
necessary to fix this bug.

Comment 6 Larry Troan 2008-08-03 20:04:11 UTC

Tatsukawa-san, I've added yu to the cc list. Please add any additional NEC people necessary to the cc list of this bugzilla.

Comment 7 nate.dailey 2008-08-04 12:40:04 UTC

NEC reports no crashes running with the attached patch between 07/28/2008 and 08/04/2008; they had been able to hit this bug frequently without the patch.

Comment 8 RHEL Program Management 2008-08-12 15:13:02 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Don Howard 2008-09-04 20:58:28 UTC

i686 and x86_64 zstream kernels are available for testing here:
http://people.redhat.com/dhoward/bz460128/

Comment 13 Don Zickus 2008-09-05 20:06:37 UTC

in kernel-2.6.18-108.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 16 Chris Ward 2008-11-28 07:15:05 UTC

Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback. 

If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!

Comment 17 Robert N. Evans 2008-12-02 21:38:29 UTC

Stratus has verified this fix in two ways.  The source of kernel-2.6.18-124.el5 was examined to verify the presence of the fix.  Scripts which previously reproduced this panic in one minute were run for 100 minutes without incident.

Comment 19 errata-xmlrpc 2009-01-20 19:35:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.