Bug 439200
| Summary: | md subsystem hang | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Janne Karhunen <jkarhune> | ||||
| Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||
| Status: | CLOSED DEFERRED | QA Contact: | Martin Jenner <mjenner> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.6 | CC: | bmr, pcfe, peterm, tao | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2008-05-27 19:20:08 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 447953 | ||||||
| Attachments: |
|
||||||
|
Description
Janne Karhunen
2008-03-27 15:02:48 UTC
Created attachment 299338 [details]
sysrq-t from hang
So kjournald holds the mddev lock and is blocking in :dm_mod:dm_request __up_read Note: pdflush is showing frame of :dm_mod:dm_request as well but with offset of +396 (return). Customer has reproduced on U4 and U5 as well, ruling out the possibility that recent similar upstream bugs could have been backported to RHEL. File uploaded: recreate_171011.tgz This event sent from IssueTracker by sprabhu issue 171011 it_file 128697 I used a Xen machine with an attached 20 G device to recreate the problem. The following steps have to be performed in order. 1) Create a new volume group /dev/local using the 20G device attached to the machine. 2) Run the script lv.sh. This creates 20 logical volumes with the names set to lv-1-1, lv-1-2, lv-2-1, lv-2-2, etc. This then creates 10 raid1 devices in the following manner md1: /dev/local/lv-1-1 /dev/local/lv-1-2 3) Run the script break_mirrors.sh. This is built based on the code provided by Nokia which was suitably modified for our setup. This runs the following steps in sequence for each of the raid devices. a) Mark first device of raid device faulty and remove it. b) Add first device to raid device. c) wait for resync. 4) On a different terminal run the script start_stop_array.sh This starts and stops md9 continuously. 5) On another terminal, run watch cat /proc/mdstat. At some point, you will be unable to read /proc/mdstat. It never seems to recover from this point. The script del_lv.sh can be used to break up all the raid devices and remove the logical volumes. Sachin Prabhu This event sent from IssueTracker by sprabhu issue 171011 In reply to comment #2 - I'm trying to understand how we're blocking in an up operation - that should always be non-blocking? There's a spinlock within the semaphore - this was so long ago that i can't remember how I ended up with the thought that that may have been the case. I did try to document the trail of thought in the IT entry. If it's not obvious from there I need to look at it again.. Duh.. yeah, that makes sense. I'll take another look at this - Sachin has a reproducer configured so I can use that to work on. Thanks, Come to think of it, it didn't make sense for obvious reasons. It's not R task, so it's not spinning there (and would not show much if it was). So that __up was most certainly stale. Should I take the comment #6 to mean that this can be reproduced without the flexiplatform software (whatever that is)? In reply to comment #12: yes. Doug, This can be reproduced internally using the scripts provided. I have successfully reproduced this on a PV Xen machine and a vmware machine using standard RHEL software. Sachin Prabhu This event sent from IssueTracker by sprabhu issue 171011 OK, I've replicated what I think people are seeing. It is worth noting however,
that when I replicated the problem, and left the machine sitting long enough, it
cleared up. In other words, the only hang I've been able to replicate is a
temporary condition (note: I'm not saying it's short lived, just not
permanent...in my test scenario it clears up within about 5 or 10 minutes, which
I'm certain people consider too long for a production environment, but my point
is that the machine isn't actually going caput).
So, here's what I know so far. This happens when you are attempting to do the
STOP_ARRAY ioctl on the device while other devices are rebuilding. The more
devices there are rebuilding, the more likely you are to hit the problem. When
you stop an array that's in the middle of rebuilding with our RHEL4 kernel, this
is the normal set of messages you see in the kernel log:
md: syncing RAID array md9
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) f
or reconstruction.
md: using 128k window, over a total of 511936 blocks.
md: md9: sync done. <=== sync isn't really done, this is because we are killing
the sync thread in do_md_stop()
md: md9 stopped.
md: unbind<dm-17>
md: export_rdev(dm-17)
md: unbind<dm-16>
md: export_rdev(dm-16)
md: md9 stopped.
In this case, the hang happens right after the sync done. message, and by all
accounts the system isn't really hung, it's just waiting for the sync thread to
signal that it has exited. The fact that do_md_stop() runs with the mddev
locked, and the fact that it is hanging, is the direct cause of access to
/proc/mdstat hanging. We could paper over the proc/mdstat problem by releasing
the mddev lock around the wait_for_completion() in md_unregister_thread. In
fact, I'm going to build a kernel with this change alone so that I can more
readily confirm that the other md devices are still working normally and only
the device we are stopping is hanging up. Once I've either confirmed or denied
that, then I'll move on to the fix for the device we are stopping, which I
suspect may just be pulling in the upstream fix of grabbing the task_list lock
before clearing mddev->run so that the resync thread can't notice run == NULL
and exit before we send it the signal to exit. Alternatively, I may backport
the changes upstream did to convert the raid stack to use the kernel_thread
helper functions instead of its own internal thread management. Either way,
that appears to be the thing that needs fixed at the moment.
The backport did not resolve the issue. This means that my second theory about
the problem is probably correct. If you look at where each thread is in the
kernel when the lockup happens, here is what you'll find.
All the raid1 threads will look like this:
md1_raid1 S 00000018 2964 6392 1 6435 5428 (L-TLB)
f5ad6f9c 00000046 f8a556ae 00000018 00000002 f67fd600 ec9f26e0 ec9f26c0
00000246 ec9f26e0 f8a56556 c1c1dde0 00000001 00000000 64b415c0 000f4acb
f7e210b0 f1f052b0 f1f0541c f735cd08 c0120448 f735cd00 00000000 f5ad6fb8
Call Trace:
[<f8a556ae>] unplug_slaves+0x77/0xb8 [raid1]
[<f8a56556>] raid1d+0x181/0x189 [raid1]
[<c0120448>] prepare_to_wait+0x12/0x4c
[<c0275d92>] md_thread+0x103/0x168
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c02d8876>] ret_from_fork+0x6/0x14
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c0275c8f>] md_thread+0x0/0x168
[<c01041f5>] kernel_thread_helper+0x5/0xb
Aka, they'll all be sleeping in unplug_slaves()
All the resync threads (except the one we are attempting to stop) will look like
this:
md2_resync D F69344C0 2904 14718 1 14720 14716 (L-TLB)
edf38e04 00000046 f65da280 f69344c0 ed943400 f1fda704 00000046 f7db2210
c0227baa f7db2210 c0227bd8 c1c1dde0 00000001 000186a0 65006100 000f4acb
f7e210b0 f62ef830 f62ef99c 00000002 f601bf80 f601bfa4 f601bf80 0009c180
Call Trace:
[<c0227baa>] __generic_unplug_device+0x14/0x2d
[<c0227bd8>] generic_unplug_device+0x15/0x21
[<f8a558fd>] device_barrier+0x173/0x1a4 [raid1]
[<c011e7b9>] default_wake_function+0x0/0xc
[<c0160759>] bio_add_page+0x1a/0x1e
[<f8a56643>] sync_request+0x9c/0x4b5 [raid1]
[<c01174b6>] smp_apic_timer_interrupt+0x9a/0x9c
[<c02d93b6>] apic_timer_interrupt+0x1a/0x20
[<c0227bd8>] generic_unplug_device+0x15/0x21
[<f8a2ecfb>] dm_table_unplug_all+0x24/0x2b [dm_mod]
[<c0276d44>] md_do_sync+0x3f1/0x84f
[<c011cd4a>] recalc_task_prio+0x128/0x133
[<c011cddd>] activate_task+0x88/0x95
[<c0275dca>] md_thread+0x13b/0x168
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c02d8876>] ret_from_fork+0x6/0x14
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c0275c8f>] md_thread+0x0/0x168
[<c01041f5>] kernel_thread_helper+0x5/0xb
Aka, they are actively writing/syncing to their device and blocking on the
device unplug.
The thread we are attempting to kill will look like this:
md9_resync D 00000000 2904 14937 1 14934 (L-TLB)
ecf9be94 00000046 c011e83f 00000000 f20fb870 f62ef830 00000046 f7db2210
c0227baa 00000000 c1c1e7c0 c1c1dde0 00000001 00000000 5899d740 000f4aab
f7e210b0 ed028630 ed02879c c1f2e4cc c0120448 c1f2e4cc c1f2e400 ecf9bed0
Call Trace:
[<c011e83f>] __wake_up+0x29/0x3c
[<c0227baa>] __generic_unplug_device+0x14/0x2d
[<c0120448>] prepare_to_wait+0x12/0x4c
[<c0276fba>] md_do_sync+0x667/0x84f
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c0275dca>] md_thread+0x13b/0x168
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c02d8876>] ret_from_fork+0x6/0x14
[<c012051d>] autoremove_wake_function+0x0/0x2d
[<c0275c8f>] md_thread+0x0/0x168
[<c01041f5>] kernel_thread_helper+0x5/0xb
The wake_up is likely stale stack data and we are instead spinning in the
__generic_unplug_device code.
What my backport of the kernel changes on the task showed is that we didn't miss
the task exiting, so grabbing the task lock didn't help. However, loosing the
lock of the mddev while waiting for it to exit would at least free up the
ability to check /proc/mdstat, but it would not solve the underlying problem.
In other words, you would be able to see that things are still moving, but it
would still take just as long for the operation to complete.
Which brings us to what's going on here. The problem is that the linux raid
code was originall written to be run on block devices, not lvm devices. It
checked block devices to see what physical disk they belonged to, and didn't
attempt to run more than one resync/repair operation on any given physical disk.
With this lvm setup, we are technically running on top of one physical disk, so
all of our commands eventually filter down to a single block device queue, yet
the md stack doesn't know this. So, it happily runs resyncs on all the md
devices simultaneously. This means that when one resync thread calls unplug, it
works its way down the stack, and eventually unplugs the physical queue.
However, other md resync threads are still writing to their devices, so as fast
as the unplug operation is attempting to drain the queue, other resync threads
are refilling it. This keeps the threads stuck in the unplug section for an
indeterminate amount of time.
I'll have to investigate possible means of solving this issue, but I don't think
it will be easy and I don't think upstream has even tackled the problem. The md
stack simply wasn't designed to run over lvm volumes and unless the person
setting up the lvm volumes takes special care to specify the exact location of
extents on different physical volumes, lvm can remap things such that the
remapping defeats the purpose of having raid1 volumes in the first place.
Excellent writeup Doug, thanks. This has been observed before; https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=79949 Since this bug was closed in IT by the reporter - see comment #22 - I am moving this BZ to "deferred". |