Description of problem: Stratus has seen several crashes when a device is removed from a raid1 mirror. These happen when an rdev is being removed from an mddev's "disks" list. The reconfig_mutex would prevent this, but there are some code paths that don't hold it when iterating over the list. Version-Release number of selected component (if applicable): kernel 2.6.18-92.el5, plus a few fixes (copied from upstream) to the raid1 module. How reproducible: Definitely reproducible, but under normal circumstances not too frequent. It's happened between 5 and 10 times in the past few weeks, but I'm not sure how many total test hours this involves. Steps to Reproduce: Set faulty and remove a member in an MD raid1 (either via mdadm or thru sysfs). If the right stuff is happening at the moment this is done, a crash can result. I've been able to make one of these happen very easily with a tight loop of "mdadm -f" "-r" "-a" on a set of 3 mirrors, plus stress on the mirrors. Another one comes out when the same is done via sysfs (echo faulty (remove) > /sys/block/mdN/md/dev-sdX/state) while at the same time reading /proc/mdstat over and over. Actual results: Crash. Expected results: No crash. Additional info: Here's an example: crash> bt PID: 12301 TASK: ffff81007bb32860 CPU: 2 COMMAND: "mdadm" #0 [ffff8100732b3b90] crash_kexec at ffffffff800aa977 #1 [ffff8100732b3c50] __die at ffffffff800650af #2 [ffff8100732b3c90] do_page_fault at ffffffff80066aa1 #3 [ffff8100732b3d80] error_exit at ffffffff8005dde9 [exception RIP: bdevname] RIP: ffffffff800ff4dc RSP: ffff8100732b3e30 RFLAGS: 00010293 RAX: ffff81007e420018 RBX: 0000000000000002 RCX: ffff8100732b3d58 RDX: ffff81001a94bb80 RSI: ffff8100732b3e58 RDI: 0000000000000000 RBP: ffff81007e420000 R8: 00000000ffffffff R9: 0000000000000020 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81001a94bb80 R13: ffff81007b4e7d40 R14: 0000000000040b00 R15: ffff81001a94bb80 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff8100732b3e30] md_seq_show at ffffffff8020419f #5 [ffff8100732b3eb0] seq_read at ffffffff8003f209 #6 [ffff8100732b3f10] vfs_read at ffffffff8000b337 #7 [ffff8100732b3f40] sys_read at ffffffff80011715 #8 [ffff8100732b3f80] tracesys at ffffffff8005d28d (via system_call) RIP: 000000346e6c40d0 RSP: 00007ffffc7a0158 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000002000 RSI: 00007ffffc7a02f0 RDI: 0000000000000003 RBP: 00007ffffc7a02f0 R8: 00000000ffffffff R9: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000002000 R13: 000000000ca4ea60 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b I posted about this to the linux-raid mailing list: -----Original Message----- From: Dailey, Nate Sent: Monday, July 14, 2008 2:30 PM To: 'linux-raid.org' Cc: 'mingo'; 'neilb' Subject: crash: write_sb_page walks mddev.disks without holding reconfig_mutex Hitting several related crashes, and looking for advice/help... I'm using MD raid1 under RHEL 5 update 2 (kernel 2.6.18-92.el5). I've also incorporated a few upstream patches to address various bugs, but don't believe any of these are causing what I'm now seeing. I've hit 3 different crashes that all involve an rdev being ripped out from under someone walking the mddev.disks list. It looks like the reconfig_mutex is supposed to prevent this. One of these cases has already been fixed upstream (modify rdev_attr_store to call mddev_lock): http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=ca38805945edf5d1f5444b283eed95bb954772e8 But I don't see fixes for the other two. Here are the problematic code paths (I'm using raid1, but I imagine the same thing applies to other personalities): - raid1d -> [freeze_array ->] flush_pending_writes -> bitmap_unplug -> write_page -> write_sb_page - raid1d -> md_check_recovery -> bitmap_daemon_work -> write_page -> write_sb_page In both cases, write_sb_page walks the mddev.disks list without holding the reconfig_mutex. I think these are only problematic when using an internal bitmap. I've made changes that seem to fix these for me (patches below). But I'm not sure I'm doing it in the best way possible, and specifically: - I'm not using mddev_lock/unlock. This is mainly because mddev_unlock wakes up the thread, which does not seem like something you want to do unconditionally (my CPUs were monopolized running various raid1d threads). - I'm using mutex_lock_interruptible because that's what mddev_lock uses, but no idea what to do if this call returns non-zero. - Not sure about using mutex_lock vs. mutex_trylock. The existing code in md_check_recovery uses trylock, but I'm not sure why. So in other words, I have something that seems to work okay, but I don't know this code well at all and am hoping to get this fixed the "right" way. Thanks! Nate Dailey Stratus Technologies --- old/drivers/md/raid1.c 2008-07-14 13:51:02.000000000 -0400 +++ new/drivers/md/raid1.c 2008-07-14 13:51:22.000000000 -0400 @@ -625,7 +625,17 @@ static int flush_pending_writes(conf_t * spin_unlock_irq(&conf->device_lock); /* flush any pending bitmap writes to * disk before proceeding w/ I/O */ - bitmap_unplug(conf->mddev->bitmap); + /* Need the reconfig_mutex when calling bitmap_unplug to + prevent an rdev from going away. What to do if mutex_ + lock_interruptible returns non-zero? */ + if (!mutex_lock_interruptible(&conf->mddev->reconfig_mutex)) { + bitmap_unplug(conf->mddev->bitmap); + + /* Call only mutex_unlock (not md_wakeup_thread, as + mddev_unlock would do), don't want to wake the + thread. */ + mutex_unlock(&conf->mddev->reconfig_mutex); + } while (bio) { /* submit pending writes */ struct bio *next = bio->bi_next; --- old/drivers/md/md.c 2008-07-09 14:44:13.000000000 -0400 +++ new/drivers/md/md.c 2008-07-11 15:59:21.000000000 -0400 @@ -5360,8 +5381,18 @@ void md_check_recovery(mddev_t *mddev) struct list_head *rtmp; - if (mddev->bitmap) - bitmap_daemon_work(mddev->bitmap); + if (mddev->bitmap) { + /* Need the reconfig mutex when calling bitmap_daemon_work + to prevent an rdev from going away. What to do if + mddev_lock returns non-zero? */ + if (!mddev_lock(mddev)) { + bitmap_daemon_work(mddev->bitmap); + + /* Call mutex_unlock instead of mddev_unlock to + avoid waking the thread. */ + mutex_unlock(&mddev->reconfig_mutex); + } + } if (mddev->ro) return; And received the following response: -----Original Message----- From: Neil Brown [neilb] Sent: Monday, July 14, 2008 8:37 PM To: Dailey, Nate Cc: linux-raid.org; mingo Subject: Re: crash: write_sb_page walks mddev.disks without holding reconfig_mutex On Monday July 14, Nate.Dailey wrote: > Hitting several related crashes, and looking for advice/help... > > I'm using MD raid1 under RHEL 5 update 2 (kernel 2.6.18-92.el5). I've > also incorporated a few upstream patches to address various bugs, but > don't believe any of these are causing what I'm now seeing. > > I've hit 3 different crashes that all involve an rdev being ripped out > from under someone walking the mddev.disks list. It looks like the > reconfig_mutex is supposed to prevent this. Thanks for reporting this. You are right. The mddev.disks list is not being protected properly. It is only ever changed under reconfig_mutex, and lots of the accesses are under the same mutex. However I count three that are not: write_sb_page (which you found) match_mddev_units is_mddev_idle It is not really appropriate to take reconfig_mutex in these cases. I think the best fix would be to use the 'rcu' approach. The following patch attempts that. If you could test it I would really appreciate it. Thanks, NeilBrown commit ec54a752a284ee3ace5177935bc0385c5ee2c70c Author: Neil Brown <neilb> Date: Tue Jul 15 10:35:28 2008 +1000 Protect access to mddev->disks list using RCU All modifications and most access to the mddev->disks list are made under the reconfig_mutex lock. However there are three places where the list is walked without any locking. If a reconfig happens at this time, havoc (and oops) can ensue. So use RCU to protect these accesses: - wrap them in rcu_read_{,un}lock() - use list_for_each_entry_rcu - add to the list with list_add_rcu - delete from the list with list_del_rcu - delay the 'free' with call_rcu rather than schedule_work Note that export_rdev did a list_del_init on this list. In almost all cases the entry was not in the list anymore so it was a no-op and so safe. It is no longer safe as after list_del_rcu we may not touch the list_head. An audit shows that export_rdev is called: - after unbind_rdev_from_array, in which case the delete has already been done, - after bind_rdev_to_array fails, in which case the delete isn't needed. - before the device has been put on a list at all (e.g. in add_new_disk where reading the superblock fails). - and in autorun devices after a failure when the device is on a different list. So remove the list_del_init call from export_rdev, and add it back immediately before the called to export_rdev for that last case. Note also that ->same_set is sometimes used for lists other than mddev->list (e.g. candidates). In these cases rcu is not needed. Signed-off-by: NeilBrown <neilb> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index eba83e2..621a272 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -241,10 +241,10 @@ static struct page *read_sb_page(mddev_t *mddev, long offset, unsigned long inde static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) { mdk_rdev_t *rdev; - struct list_head *tmp; mddev_t *mddev = bitmap->mddev; - rdev_for_each(rdev, tmp, mddev) + rcu_read_lock(); + rdev_for_each_rcu(rdev, mddev) if (test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) { int size = PAGE_SIZE; @@ -260,11 +260,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) + (long)(page->index * (PAGE_SIZE/512)) + size/512 > 0) /* bitmap runs in to metadata */ - return -EINVAL; + goto bad_alignment; if (rdev->data_offset + mddev->size*2 > rdev->sb_start + bitmap->offset) /* data runs in to bitmap */ - return -EINVAL; + goto bad_alignment; } else if (rdev->sb_start < rdev->data_offset) { /* METADATA BITMAP DATA */ if (rdev->sb_start @@ -272,7 +272,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) + page->index*(PAGE_SIZE/512) + size/512 > rdev->data_offset) /* bitmap runs in to data */ - return -EINVAL; + goto bad_alignment; } else { /* DATA METADATA BITMAP - no problems */ } @@ -282,10 +282,15 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) size, page); } + rcu_read_unlock(); if (wait) md_super_wait(mddev); return 0; + + bad_alignment: + rcu_read_unlock(); + return -EINVAL; } static void bitmap_file_kick(struct bitmap *bitmap); diff --git a/drivers/md/md.c b/drivers/md/md.c index 7dcdff6..66ca159 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1397,15 +1397,17 @@ static struct super_type super_types[] = { static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2) { - struct list_head *tmp, *tmp2; mdk_rdev_t *rdev, *rdev2; - rdev_for_each(rdev, tmp, mddev1) - rdev_for_each(rdev2, tmp2, mddev2) + rcu_read_lock(); + rdev_for_each_rcu(rdev, mddev1) + rdev_for_each_rcu(rdev2, mddev2) if (rdev->bdev->bd_contains == - rdev2->bdev->bd_contains) + rdev2->bdev->bd_contains) { + rcu_read_unlock(); return 1; - + } + rcu_read_unlock(); return 0; } @@ -1472,7 +1474,7 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev) kobject_del(&rdev->kobj); goto fail; } - list_add(&rdev->same_set, &mddev->disks); + list_add_rcu(&rdev->same_set, &mddev->disks); bd_claim_by_disk(rdev->bdev, rdev->bdev->bd_holder, mddev->gendisk); return 0; @@ -1482,9 +1484,9 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev) return err; } -static void md_delayed_delete(struct work_struct *ws) +static void md_delayed_delete(struct rcu_head *rcu) { - mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work); + mdk_rdev_t *rdev = container_of(rcu, mdk_rdev_t, rcu_work); kobject_del(&rdev->kobj); kobject_put(&rdev->kobj); } @@ -1497,17 +1499,17 @@ static void unbind_rdev_from_array(mdk_rdev_t * rdev) return; } bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk); - list_del_init(&rdev->same_set); + list_del_rcu(&rdev->same_set); printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b)); rdev->mddev = NULL; sysfs_remove_link(&rdev->kobj, "block"); /* We need to delay this, otherwise we can deadlock when - * writing to 'remove' to "dev/state" + * writing to 'remove' to "dev/state". We also need + * to delay it due to rcu usage. */ - INIT_WORK(&rdev->del_work, md_delayed_delete); kobject_get(&rdev->kobj); - schedule_work(&rdev->del_work); + call_rcu(&rdev->rcu_work, md_delayed_delete); } /* @@ -1560,7 +1562,6 @@ static void export_rdev(mdk_rdev_t * rdev) if (rdev->mddev) MD_BUG(); free_disk_sb(rdev); - list_del_init(&rdev->same_set); #ifndef MODULE if (test_bit(AutoDetected, &rdev->flags)) md_autodetect_dev(rdev->bdev->bd_dev); @@ -4063,8 +4064,10 @@ static void autorun_devices(int part) /* on success, candidates will be empty, on error * it won't... */ - rdev_for_each_list(rdev, tmp, candidates) + rdev_for_each_list(rdev, tmp, candidates) { + list_del_init(&rdev->same_set); export_rdev(rdev); + } mddev_put(mddev); } printk(KERN_INFO "md: ... autorun DONE.\n"); @@ -5528,12 +5531,12 @@ int unregister_md_personality(struct mdk_personality *p) static int is_mddev_idle(mddev_t *mddev) { mdk_rdev_t * rdev; - struct list_head *tmp; int idle; long curr_events; idle = 1; - rdev_for_each(rdev, tmp, mddev) { + rcu_read_lock(); + rdev_for_each_rcu(rdev, mddev) { struct gendisk *disk = rdev->bdev->bd_contains->bd_disk; curr_events = disk_stat_read(disk, sectors[0]) + disk_stat_read(disk, sectors[1]) - @@ -5565,6 +5568,7 @@ static int is_mddev_idle(mddev_t *mddev) idle = 0; } } + rcu_read_unlock(); return idle; } diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h index 35da93c..6fa94ff 100644 --- a/include/linux/raid/md_k.h +++ b/include/linux/raid/md_k.h @@ -114,7 +114,7 @@ struct mdk_rdev_s * for reporting to userspace and storing * in superblock. */ - struct work_struct del_work; /* used for delayed sysfs removal */ + struct rcu_head rcu_work; /* used for delayed sysfs removal */ }; struct mddev_s @@ -339,6 +339,9 @@ static inline char * mdname (mddev_t * mddev) #define rdev_for_each(rdev, tmp, mddev) \ rdev_for_each_list(rdev, tmp, (mddev)->disks) +#define rdev_for_each_rcu(rdev, mddev) \ + list_for_each_entry_rcu(rdev, &((mddev)->disks), same_set) + typedef struct mdk_thread_s { void (*run) (mddev_t *mddev); mddev_t *mddev; This patch is against an upstream kernel so it doesn't apply cleanly to RHEL5 U2. I'm currently attempting to build & test this.
I'm currently testing a version of the following patch, provided by Neil Brown: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b80991c6cb9efa607bc4fd6f3ecdf5511c31bb0 (this is very close to the patch quoted above, but with a bug fixed) This patch is against an upstream kernel, so the version I'm testing has a few modifications to build against the RHEL 5 update 2 (92) kernel. I'll post my version of the patch after a day or two more of testing (assuming things go well).
Created attachment 312551 [details] 455471.patch
I just attached the patch I've been testing with. As noted in the previous comment, this is an adaptation of an upstream patch (see that patch for a description of most of the contents of this patch). Note that the attached patch also includes parts of the following upstream fixes: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=ca38805945edf5d1f5444b283eed95bb954772e8 http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=27c529bb8e906d5d692152bc127cc09477d3629e These modify rdev_attr_store and rdev_attr_show to take the mddev_lock, which closes another window where a device can be removed unsafely. The attached patch also contains part of: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.25.y.git;a=commit;h=7dd5e7c3dbe8c4ffb507ddc0ea8fab07c8b11b0b Not the whole thing, just the change to match_mddev_units (which is needed to fix this bug). There have been lots of other changes upstream, some of which are reflected in differences between the attached patch and the upstream patch it's based on (for example, the addition of mddev_delayed_work), but I don't think any of these are necessary to fix this bug.
Tatsukawa-san, I've added yu to the cc list. Please add any additional NEC people necessary to the cc list of this bugzilla.
NEC reports no crashes running with the attached patch between 07/28/2008 and 08/04/2008; they had been able to hit this bug frequently without the patch.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
i686 and x86_64 zstream kernels are available for testing here: http://people.redhat.com/dhoward/bz460128/
in kernel-2.6.18-108.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Partners, this bug should be fixed in the latest RHEL 5.3 Snapshot. We believe that you have some interest in its correct functionality, so we're making a friendly request to send us some testing feedback. If you have a chance to test it, please share with us your findings. If you have successfully VERIFIED the fix, please add PartnerVerified to the Bugzilla keywords, along with a description of the results. Thanks!
Stratus has verified this fix in two ways. The source of kernel-2.6.18-124.el5 was examined to verify the presence of the fix. Scripts which previously reproduced this panic in one minute were run for 100 minutes without incident.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html