Description of problem: MD-subsystem hangs during a full resync of all the configured mirrors. Afterwards the host is still running but all access to /proc/mdstat or raid-mirrors get stuck (D). Version-Release number of selected component (if applicable): 2.6.9-67 How reproducible: Requires Flexiplatform userspace software. Takes 5-30 minutes to reproduce on given hw. Steps to Reproduce: 1. Run full FlexiPlatform userspace with vanilla kernel (untainted). This means that the raid1-mirrors have been configured on top LVM-volumes and device mapper devices. 2. Manually break all the raid1-mirrors and then re-add the spares into the mirrors so that a parallel resync of all the configured mirrors occurs. 3. Before all the mirrors have been resynced, the command 'cat /proc/mdstat' hangs as well as 'mount -l'. Actual results: md-subsystem hangs Expected results: Raid1 mirror re-sync completes successfully. Additional info: Test system is 64bit HP Proliant rackmount. Some investigation already present in IT entry (171011).
Created attachment 299338 [details] sysrq-t from hang
So kjournald holds the mddev lock and is blocking in :dm_mod:dm_request __up_read
Note: pdflush is showing frame of :dm_mod:dm_request as well but with offset of +396 (return).
Customer has reproduced on U4 and U5 as well, ruling out the possibility that recent similar upstream bugs could have been backported to RHEL.
File uploaded: recreate_171011.tgz This event sent from IssueTracker by sprabhu issue 171011 it_file 128697
I used a Xen machine with an attached 20 G device to recreate the problem. The following steps have to be performed in order. 1) Create a new volume group /dev/local using the 20G device attached to the machine. 2) Run the script lv.sh. This creates 20 logical volumes with the names set to lv-1-1, lv-1-2, lv-2-1, lv-2-2, etc. This then creates 10 raid1 devices in the following manner md1: /dev/local/lv-1-1 /dev/local/lv-1-2 3) Run the script break_mirrors.sh. This is built based on the code provided by Nokia which was suitably modified for our setup. This runs the following steps in sequence for each of the raid devices. a) Mark first device of raid device faulty and remove it. b) Add first device to raid device. c) wait for resync. 4) On a different terminal run the script start_stop_array.sh This starts and stops md9 continuously. 5) On another terminal, run watch cat /proc/mdstat. At some point, you will be unable to read /proc/mdstat. It never seems to recover from this point. The script del_lv.sh can be used to break up all the raid devices and remove the logical volumes. Sachin Prabhu This event sent from IssueTracker by sprabhu issue 171011
In reply to comment #2 - I'm trying to understand how we're blocking in an up operation - that should always be non-blocking?
There's a spinlock within the semaphore - this was so long ago that i can't remember how I ended up with the thought that that may have been the case. I did try to document the trail of thought in the IT entry. If it's not obvious from there I need to look at it again..
Duh.. yeah, that makes sense. I'll take another look at this - Sachin has a reproducer configured so I can use that to work on. Thanks,
Come to think of it, it didn't make sense for obvious reasons. It's not R task, so it's not spinning there (and would not show much if it was). So that __up was most certainly stale.
Should I take the comment #6 to mean that this can be reproduced without the flexiplatform software (whatever that is)?
In reply to comment #12: yes.
Doug, This can be reproduced internally using the scripts provided. I have successfully reproduced this on a PV Xen machine and a vmware machine using standard RHEL software. Sachin Prabhu This event sent from IssueTracker by sprabhu issue 171011
OK, I've replicated what I think people are seeing. It is worth noting however, that when I replicated the problem, and left the machine sitting long enough, it cleared up. In other words, the only hang I've been able to replicate is a temporary condition (note: I'm not saying it's short lived, just not permanent...in my test scenario it clears up within about 5 or 10 minutes, which I'm certain people consider too long for a production environment, but my point is that the machine isn't actually going caput). So, here's what I know so far. This happens when you are attempting to do the STOP_ARRAY ioctl on the device while other devices are rebuilding. The more devices there are rebuilding, the more likely you are to hit the problem. When you stop an array that's in the middle of rebuilding with our RHEL4 kernel, this is the normal set of messages you see in the kernel log: md: syncing RAID array md9 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) f or reconstruction. md: using 128k window, over a total of 511936 blocks. md: md9: sync done. <=== sync isn't really done, this is because we are killing the sync thread in do_md_stop() md: md9 stopped. md: unbind<dm-17> md: export_rdev(dm-17) md: unbind<dm-16> md: export_rdev(dm-16) md: md9 stopped. In this case, the hang happens right after the sync done. message, and by all accounts the system isn't really hung, it's just waiting for the sync thread to signal that it has exited. The fact that do_md_stop() runs with the mddev locked, and the fact that it is hanging, is the direct cause of access to /proc/mdstat hanging. We could paper over the proc/mdstat problem by releasing the mddev lock around the wait_for_completion() in md_unregister_thread. In fact, I'm going to build a kernel with this change alone so that I can more readily confirm that the other md devices are still working normally and only the device we are stopping is hanging up. Once I've either confirmed or denied that, then I'll move on to the fix for the device we are stopping, which I suspect may just be pulling in the upstream fix of grabbing the task_list lock before clearing mddev->run so that the resync thread can't notice run == NULL and exit before we send it the signal to exit. Alternatively, I may backport the changes upstream did to convert the raid stack to use the kernel_thread helper functions instead of its own internal thread management. Either way, that appears to be the thing that needs fixed at the moment.
The backport did not resolve the issue. This means that my second theory about the problem is probably correct. If you look at where each thread is in the kernel when the lockup happens, here is what you'll find. All the raid1 threads will look like this: md1_raid1 S 00000018 2964 6392 1 6435 5428 (L-TLB) f5ad6f9c 00000046 f8a556ae 00000018 00000002 f67fd600 ec9f26e0 ec9f26c0 00000246 ec9f26e0 f8a56556 c1c1dde0 00000001 00000000 64b415c0 000f4acb f7e210b0 f1f052b0 f1f0541c f735cd08 c0120448 f735cd00 00000000 f5ad6fb8 Call Trace: [<f8a556ae>] unplug_slaves+0x77/0xb8 [raid1] [<f8a56556>] raid1d+0x181/0x189 [raid1] [<c0120448>] prepare_to_wait+0x12/0x4c [<c0275d92>] md_thread+0x103/0x168 [<c012051d>] autoremove_wake_function+0x0/0x2d [<c02d8876>] ret_from_fork+0x6/0x14 [<c012051d>] autoremove_wake_function+0x0/0x2d [<c0275c8f>] md_thread+0x0/0x168 [<c01041f5>] kernel_thread_helper+0x5/0xb Aka, they'll all be sleeping in unplug_slaves() All the resync threads (except the one we are attempting to stop) will look like this: md2_resync D F69344C0 2904 14718 1 14720 14716 (L-TLB) edf38e04 00000046 f65da280 f69344c0 ed943400 f1fda704 00000046 f7db2210 c0227baa f7db2210 c0227bd8 c1c1dde0 00000001 000186a0 65006100 000f4acb f7e210b0 f62ef830 f62ef99c 00000002 f601bf80 f601bfa4 f601bf80 0009c180 Call Trace: [<c0227baa>] __generic_unplug_device+0x14/0x2d [<c0227bd8>] generic_unplug_device+0x15/0x21 [<f8a558fd>] device_barrier+0x173/0x1a4 [raid1] [<c011e7b9>] default_wake_function+0x0/0xc [<c0160759>] bio_add_page+0x1a/0x1e [<f8a56643>] sync_request+0x9c/0x4b5 [raid1] [<c01174b6>] smp_apic_timer_interrupt+0x9a/0x9c [<c02d93b6>] apic_timer_interrupt+0x1a/0x20 [<c0227bd8>] generic_unplug_device+0x15/0x21 [<f8a2ecfb>] dm_table_unplug_all+0x24/0x2b [dm_mod] [<c0276d44>] md_do_sync+0x3f1/0x84f [<c011cd4a>] recalc_task_prio+0x128/0x133 [<c011cddd>] activate_task+0x88/0x95 [<c0275dca>] md_thread+0x13b/0x168 [<c012051d>] autoremove_wake_function+0x0/0x2d [<c02d8876>] ret_from_fork+0x6/0x14 [<c012051d>] autoremove_wake_function+0x0/0x2d [<c0275c8f>] md_thread+0x0/0x168 [<c01041f5>] kernel_thread_helper+0x5/0xb Aka, they are actively writing/syncing to their device and blocking on the device unplug. The thread we are attempting to kill will look like this: md9_resync D 00000000 2904 14937 1 14934 (L-TLB) ecf9be94 00000046 c011e83f 00000000 f20fb870 f62ef830 00000046 f7db2210 c0227baa 00000000 c1c1e7c0 c1c1dde0 00000001 00000000 5899d740 000f4aab f7e210b0 ed028630 ed02879c c1f2e4cc c0120448 c1f2e4cc c1f2e400 ecf9bed0 Call Trace: [<c011e83f>] __wake_up+0x29/0x3c [<c0227baa>] __generic_unplug_device+0x14/0x2d [<c0120448>] prepare_to_wait+0x12/0x4c [<c0276fba>] md_do_sync+0x667/0x84f [<c012051d>] autoremove_wake_function+0x0/0x2d [<c012051d>] autoremove_wake_function+0x0/0x2d [<c0275dca>] md_thread+0x13b/0x168 [<c012051d>] autoremove_wake_function+0x0/0x2d [<c02d8876>] ret_from_fork+0x6/0x14 [<c012051d>] autoremove_wake_function+0x0/0x2d [<c0275c8f>] md_thread+0x0/0x168 [<c01041f5>] kernel_thread_helper+0x5/0xb The wake_up is likely stale stack data and we are instead spinning in the __generic_unplug_device code. What my backport of the kernel changes on the task showed is that we didn't miss the task exiting, so grabbing the task lock didn't help. However, loosing the lock of the mddev while waiting for it to exit would at least free up the ability to check /proc/mdstat, but it would not solve the underlying problem. In other words, you would be able to see that things are still moving, but it would still take just as long for the operation to complete. Which brings us to what's going on here. The problem is that the linux raid code was originall written to be run on block devices, not lvm devices. It checked block devices to see what physical disk they belonged to, and didn't attempt to run more than one resync/repair operation on any given physical disk. With this lvm setup, we are technically running on top of one physical disk, so all of our commands eventually filter down to a single block device queue, yet the md stack doesn't know this. So, it happily runs resyncs on all the md devices simultaneously. This means that when one resync thread calls unplug, it works its way down the stack, and eventually unplugs the physical queue. However, other md resync threads are still writing to their devices, so as fast as the unplug operation is attempting to drain the queue, other resync threads are refilling it. This keeps the threads stuck in the unplug section for an indeterminate amount of time. I'll have to investigate possible means of solving this issue, but I don't think it will be easy and I don't think upstream has even tackled the problem. The md stack simply wasn't designed to run over lvm volumes and unless the person setting up the lvm volumes takes special care to specify the exact location of extents on different physical volumes, lvm can remap things such that the remapping defeats the purpose of having raid1 volumes in the first place.
Excellent writeup Doug, thanks. This has been observed before; https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=79949
Since this bug was closed in IT by the reporter - see comment #22 - I am moving this BZ to "deferred".