439200 – md subsystem hang

Bug 439200 - md subsystem hang

Summary: md subsystem hang

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Doug Ledford
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	447953
TreeView+	depends on / blocked

Reported:	2008-03-27 15:02 UTC by Janne Karhunen
Modified:	2018-10-20 01:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-27 19:20:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sysrq-t from hang (233.46 KB, text/plain) 2008-03-27 15:02 UTC, Janne Karhunen	no flags	Details
View All

Description Janne Karhunen 2008-03-27 15:02:48 UTC

Description of problem:

MD-subsystem hangs during a full resync of all the configured mirrors.
Afterwards the host is still running but all access to /proc/mdstat or
raid-mirrors get stuck (D).


Version-Release number of selected component (if applicable):

2.6.9-67

How reproducible:

Requires Flexiplatform userspace software. Takes 5-30 minutes to reproduce on
given hw.

Steps to Reproduce:

1. Run full FlexiPlatform userspace with vanilla kernel (untainted). This means
that the raid1-mirrors have been configured on top LVM-volumes and device mapper
devices.

2. Manually break all the raid1-mirrors and then re-add the spares into the
mirrors so that a parallel resync of all the configured mirrors occurs.
3. Before all the mirrors have been resynced, the command 'cat /proc/mdstat'
hangs as well as 'mount -l'.


Actual results:

md-subsystem hangs

Expected results:

Raid1 mirror re-sync completes successfully.

Additional info:

Test system is 64bit HP Proliant rackmount. Some investigation already present
in IT entry (171011).

Comment 1 Janne Karhunen 2008-03-27 15:02:48 UTC

Created attachment 299338 [details]
sysrq-t from hang

Comment 2 Janne Karhunen 2008-03-28 14:05:58 UTC

So kjournald holds the mddev lock and is blocking in :dm_mod:dm_request __up_read

Comment 3 Janne Karhunen 2008-03-28 16:40:47 UTC

Note: pdflush is showing frame of :dm_mod:dm_request as well but with offset of
+396 (return).

Comment 4 Chris Snook 2008-04-15 20:07:03 UTC

Customer has reproduced on U4 and U5 as well, ruling out the possibility that
recent similar upstream bugs could have been backported to RHEL.

Comment 5 Issue Tracker 2008-04-15 20:21:05 UTC

File uploaded: recreate_171011.tgz

This event sent from IssueTracker by sprabhu 
 issue 171011
it_file 128697

Comment 6 Issue Tracker 2008-04-15 20:21:06 UTC

I used a Xen machine with an attached 20 G device to recreate the problem.

The following steps have to be performed in order.

1) Create a new volume group /dev/local using the 20G device attached to
the machine.

2) Run the script lv.sh. 
This creates 20 logical volumes with the names set to lv-1-1, lv-1-2,
lv-2-1, lv-2-2, etc.
This then creates 10 raid1 devices in the following manner
md1: /dev/local/lv-1-1 /dev/local/lv-1-2

3) Run the script break_mirrors.sh. This is built based on the code
provided by Nokia which was suitably modified for our setup.
This runs the following steps in sequence for each of the raid devices.
a) Mark first device of raid device faulty and remove it.
b) Add first device to raid device.
c) wait for resync. 

4) On a different terminal run the script start_stop_array.sh
This starts and stops md9 continuously. 

5) On another terminal, run watch cat /proc/mdstat.

At some point, you will be unable to read /proc/mdstat. It never seems to
recover from this point.

The script del_lv.sh can be used to break up all the raid devices and
remove the logical volumes.

Sachin Prabhu


This event sent from IssueTracker by sprabhu 
 issue 171011

Comment 7 Bryn M. Reeves 2008-04-21 13:47:49 UTC

In reply to comment #2 - I'm trying to understand how we're blocking in an up
operation - that should always be non-blocking?

Comment 8 Janne Karhunen 2008-04-21 14:13:00 UTC

There's a spinlock within the semaphore - this was so long ago that i can't
remember how I ended up with the thought that that may have been the case. I did
try to document the trail of thought in the IT entry. If it's not obvious from
there I need to look at it again..

Comment 9 Bryn M. Reeves 2008-04-21 18:15:01 UTC

Duh.. yeah, that makes sense. I'll take another look at this - Sachin has a
reproducer configured so I can use that to work on.

Thanks,

Comment 11 Janne Karhunen 2008-04-24 02:42:24 UTC

Come to think of it, it didn't make sense for obvious reasons. It's not R task,
so it's not spinning there (and would not show much if it was). So that __up was
most certainly stale.

Comment 12 Doug Ledford 2008-04-25 13:57:35 UTC

Should I take the comment #6 to mean that this can be reproduced without the
flexiplatform software (whatever that is)?

Comment 13 Bryn M. Reeves 2008-04-25 14:01:32 UTC

In reply to comment #12: yes.

Comment 14 Issue Tracker 2008-04-25 14:02:44 UTC

Doug,

This can be reproduced internally using the scripts provided. I have
successfully reproduced this on a PV Xen machine and a vmware machine
using standard RHEL software.

Sachin Prabhu


This event sent from IssueTracker by sprabhu 
 issue 171011

Comment 16 Doug Ledford 2008-05-01 20:20:04 UTC

OK, I've replicated what I think people are seeing.  It is worth noting however,
that when I replicated the problem, and left the machine sitting long enough, it
cleared up.  In other words, the only hang I've been able to replicate is a
temporary condition (note: I'm not saying it's short lived, just not
permanent...in my test scenario it clears up within about 5 or 10 minutes, which
I'm certain people consider too long for a production environment, but my point
is that the machine isn't actually going caput).

So, here's what I know so far.  This happens when you are attempting to do the
STOP_ARRAY ioctl on the device while other devices are rebuilding.  The more
devices there are rebuilding, the more likely you are to hit the problem.  When
you stop an array that's in the middle of rebuilding with our RHEL4 kernel, this
is the normal set of messages you see in the kernel log:

md: syncing RAID array md9
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) f
or reconstruction.
md: using 128k window, over a total of 511936 blocks.
md: md9: sync done.  <=== sync isn't really done, this is because we are killing
                          the sync thread in do_md_stop()
md: md9 stopped.
md: unbind<dm-17>
md: export_rdev(dm-17)
md: unbind<dm-16>
md: export_rdev(dm-16)
md: md9 stopped.


In this case, the hang happens right after the sync done. message, and by all
accounts the system isn't really hung, it's just waiting for the sync thread to
signal that it has exited.  The fact that do_md_stop() runs with the mddev
locked, and the fact that it is hanging, is the direct cause of access to
/proc/mdstat hanging.  We could paper over the proc/mdstat problem by releasing
the mddev lock around the wait_for_completion() in md_unregister_thread.  In
fact, I'm going to build a kernel with this change alone so that I can more
readily confirm that the other md devices are still working normally and only
the device we are stopping is hanging up.  Once I've either confirmed or denied
that, then I'll move on to the fix for the device we are stopping, which I
suspect may just be pulling in the upstream fix of grabbing the task_list lock
before clearing mddev->run so that the resync thread can't notice run == NULL
and exit before we send it the signal to exit.  Alternatively, I may backport
the changes upstream did to convert the raid stack to use the kernel_thread
helper functions instead of its own internal thread management.  Either way,
that appears to be the thing that needs fixed at the moment.

Comment 19 Doug Ledford 2008-05-09 15:47:02 UTC

The backport did not resolve the issue.  This means that my second theory about
the problem is probably correct.  If you look at where each thread is in the
kernel when the lockup happens, here is what you'll find.

All the raid1 threads will look like this:

md1_raid1     S 00000018  2964  6392      1          6435  5428 (L-TLB)
f5ad6f9c 00000046 f8a556ae 00000018 00000002 f67fd600 ec9f26e0 ec9f26c0 
       00000246 ec9f26e0 f8a56556 c1c1dde0 00000001 00000000 64b415c0 000f4acb 
       f7e210b0 f1f052b0 f1f0541c f735cd08 c0120448 f735cd00 00000000 f5ad6fb8 
Call Trace:
 [<f8a556ae>] unplug_slaves+0x77/0xb8 [raid1]
 [<f8a56556>] raid1d+0x181/0x189 [raid1]
 [<c0120448>] prepare_to_wait+0x12/0x4c
 [<c0275d92>] md_thread+0x103/0x168
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c02d8876>] ret_from_fork+0x6/0x14
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c0275c8f>] md_thread+0x0/0x168
 [<c01041f5>] kernel_thread_helper+0x5/0xb

Aka, they'll all be sleeping in unplug_slaves()

All the resync threads (except the one we are attempting to stop) will look like
this:

md2_resync    D F69344C0  2904 14718      1         14720 14716 (L-TLB)
edf38e04 00000046 f65da280 f69344c0 ed943400 f1fda704 00000046 f7db2210 
       c0227baa f7db2210 c0227bd8 c1c1dde0 00000001 000186a0 65006100 000f4acb 
       f7e210b0 f62ef830 f62ef99c 00000002 f601bf80 f601bfa4 f601bf80 0009c180 
Call Trace:
 [<c0227baa>] __generic_unplug_device+0x14/0x2d
 [<c0227bd8>] generic_unplug_device+0x15/0x21
 [<f8a558fd>] device_barrier+0x173/0x1a4 [raid1]
 [<c011e7b9>] default_wake_function+0x0/0xc
 [<c0160759>] bio_add_page+0x1a/0x1e
 [<f8a56643>] sync_request+0x9c/0x4b5 [raid1]
 [<c01174b6>] smp_apic_timer_interrupt+0x9a/0x9c
 [<c02d93b6>] apic_timer_interrupt+0x1a/0x20
 [<c0227bd8>] generic_unplug_device+0x15/0x21
 [<f8a2ecfb>] dm_table_unplug_all+0x24/0x2b [dm_mod]
 [<c0276d44>] md_do_sync+0x3f1/0x84f
 [<c011cd4a>] recalc_task_prio+0x128/0x133
 [<c011cddd>] activate_task+0x88/0x95
 [<c0275dca>] md_thread+0x13b/0x168
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c02d8876>] ret_from_fork+0x6/0x14
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c0275c8f>] md_thread+0x0/0x168
 [<c01041f5>] kernel_thread_helper+0x5/0xb

Aka, they are actively writing/syncing to their device and blocking on the
device unplug.

The thread we are attempting to kill will look like this:

md9_resync    D 00000000  2904 14937      1               14934 (L-TLB)
ecf9be94 00000046 c011e83f 00000000 f20fb870 f62ef830 00000046 f7db2210 
       c0227baa 00000000 c1c1e7c0 c1c1dde0 00000001 00000000 5899d740 000f4aab 
       f7e210b0 ed028630 ed02879c c1f2e4cc c0120448 c1f2e4cc c1f2e400 ecf9bed0 
Call Trace:
 [<c011e83f>] __wake_up+0x29/0x3c
 [<c0227baa>] __generic_unplug_device+0x14/0x2d
 [<c0120448>] prepare_to_wait+0x12/0x4c
 [<c0276fba>] md_do_sync+0x667/0x84f
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c0275dca>] md_thread+0x13b/0x168
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c02d8876>] ret_from_fork+0x6/0x14
 [<c012051d>] autoremove_wake_function+0x0/0x2d
 [<c0275c8f>] md_thread+0x0/0x168
 [<c01041f5>] kernel_thread_helper+0x5/0xb

The wake_up is likely stale stack data and we are instead spinning in the
__generic_unplug_device code.

What my backport of the kernel changes on the task showed is that we didn't miss
the task exiting, so grabbing the task lock didn't help.  However, loosing the
lock of the mddev while waiting for it to exit would at least free up the
ability to check /proc/mdstat, but it would not solve the underlying problem. 
In other words, you would be able to see that things are still moving, but it
would still take just as long for the operation to complete.

Which brings us to what's going on here.  The problem is that the linux raid
code was originall written to be run on block devices, not lvm devices.  It
checked block devices to see what physical disk they belonged to, and didn't
attempt to run more than one resync/repair operation on any given physical disk.
 With this lvm setup, we are technically running on top of one physical disk, so
all of our commands eventually filter down to a single block device queue, yet
the md stack doesn't know this.  So, it happily runs resyncs on all the md
devices simultaneously.  This means that when one resync thread calls unplug, it
works its way down the stack, and eventually unplugs the physical queue. 
However, other md resync threads are still writing to their devices, so as fast
as the unplug operation is attempting to drain the queue, other resync threads
are refilling it.  This keeps the threads stuck in the unplug section for an
indeterminate amount of time.

I'll have to investigate possible means of solving this issue, but I don't think
it will be easy and I don't think upstream has even tackled the problem.  The md
stack simply wasn't designed to run over lvm volumes and unless the person
setting up the lvm volumes takes special care to specify the exact location of
extents on different physical volumes, lvm can remap things such that the
remapping defeats the purpose of having raid1 volumes in the first place.

Comment 20 Janne Karhunen 2008-05-09 16:19:43 UTC

Excellent writeup Doug, thanks. This has been observed before;
https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=79949

Comment 23 Russell Doty 2008-05-27 19:20:08 UTC

Since this bug was closed in IT by the reporter - see comment #22 - I am moving
this BZ to "deferred".

Note You need to log in before you can comment on or make changes to this bug.