Description of problem: Started I/O to a simple mirror and then failed one of it's legs. After the failure was detected the I/O started up once again after about 30 seconds but any lvm commands were stuck in an flock for about 3 minutes and 20 seconds. Version-Release number of selected component (if applicable): [root@taft-04 ~]# rpm -q lvm2 lvm2-2.02.06-6.0.RHEL4 [root@taft-04 ~]# rpm -q device-mapper device-mapper-1.02.07-4.0.RHEL4 [root@taft-04 ~]# uname -ar Linux taft-04 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux How reproducible: very
*** Bug 200176 has been marked as a duplicate of this bug. ***
Jon, please report here what you did for cluster mirroring for reference. Thanks
Removed unnecessary prints in the cluster mirror code that were being printed on every failed operation. The removed messages were causing flooding on the console... slowing everything down.
One more thing that may prove insightful... If you fail a mirror device, and then do a small amount of I/O, fail-over is almost instantanious. However, if you load up the system with large writes (dd if=/dev/zero of=/dev/vg/lv bs=4M), and then fail the device, it takes a "long time" to fail-over. It almost seems as though it is waiting for the writes to complete before doing the fault handling.
Which is an obvious prove that we can't address this in the mirror code. IOW: Large IO requests got already queued to LLDs and subject to retries/timeout handling therin, causing errors being reported too late for the mirror code to disable access to the failed device. Say an LLD timeouts after 30s, the mirror code wont change the device state before an error returns that late. Until LLDs pay attention to the BIO_RW_FAILFAST flag, we can't do much about this. At least in the 2 mirror situation, it is no good solution to change the code to allow reporting success back while just one leg returns immediately and the other *might* return error much later for obvious data integrity reasons.
I'm sorry, that's not what I meant. I should have been clearer... "long time" should be read "minutes". In fact, we are dealing with a similar issue in 231230. In cluster mirroring, clvmd can actually time-out waiting for device operations (suspend/resume/activate/ deactivate) to take place. My guess is that it is the same thing in single machine mirroring. So, to be clear, the device failure is detected and dmeventd launches 'vgreduce' and _then_ it takes forever. Strangely enough, after the failure has been detected, the speed of the reconfiguration is still incredibly dependent on the amount of I/O to the system.
Jon, thanks for the clarification. Do we have evidence of 'vgreduce' trying to access the failed device in order to update its metadata and hence cause further timeouts? Ie. the list of PVs to update metadata on includes the dead PV, which it shouldn't. The flock delay reported then looks like a consequence of vgreduce holding the lock until the IO to the dead drive times out.
We certainly do. However, for my tests I have been issuing: echo offline > /sys/block/sda/device/state on each machine in order to kill the device, so the EIOs come back almost immediately. I'm pretty sure that device timeouts are not the issue, but since I don't know why the device-mapper operations are taking so long, I can't say for sure. I will try to rerun my test and capture the clvmd and messages output (along with system tap information). Although this would be a cluster mirror test, I believe it will clearly point out the delays in mirror fail-over, which should also prove useful for this single machine bug.
Created attachment 150435 [details] script that shows the problem When the script says so, kill the device (in this case sda) by doing echo offline > /sys/block/sda/device/state You will have to adapt the script to your environment. You will easily see how long the mirror takes to reconfigure - it would certainly timeout in clvmd.
Using system tap: Entering: do_suspend(fd0d) [dmeventd/25943] Entering: mirror_presuspend() [dmeventd/25943] Exiting : mirror_presuspend() [dmeventd/25943] Entering: mirror_postsuspend() [dmeventd/25943] Exiting : mirror_postsuspend() [dmeventd/25943] Exiting : do_suspend() [dmeventd/25943] 20414ms The deactivate/activate commands are taking even longer, but I wasn't profiling those... However, we can see that the do_suspend() takes 20 sec. That's just for one of many of those operations. [Note, if the time is less than one ms, the time is not printed on exit.]
Devel ACK - delays are affecting cluster mirror recovery operations.
Jon, the mirror log should still be able to flush, because it sits on a different dervice (sdc in your example). Don't see a reason for the log suspension delay so far unless the failed sda and hence the timeout handling for requets in flight in the LLD delays IOs on sdc as well because of some LLD stupidity. What's the configuration for sda-c (HBA, driver, paths etc)?
It is not device timeouts. No more talk on timeouts for failed devices. If it was, then the recovery would take (mostly) an equal amount of time whether under load or not. I've done system tap profiling... starting with suspend/resume/dtr/ctr. While those took a while, it didn't acount for all the time being spent. So, I bumped up to profiling ctl_ioctl, still not enough... I then bumped up to sys_ioctl, still not enough. If I had to guess, I would say that the load on the system is causing delay between when sys_ioctls are called and when they are run.
I think this is really part of a larger issue. That issue is that LVM operations simply take too long under load. For example, I have 3 mirrors, with file systems on top. If I put load on those 3 file systems, I get: [root@neo-06 ~]# time lvs LV VG Attr LSize Origin Snap% Move Log Copy% LogVol00 VolGroup00 -wi-ao 36.62G LogVol01 VolGroup00 -wi-ao 512.00M lv1 vg Mwi-ao 4.00G lv1_mlog 100.00 lv2 vg Mwi-ao 4.00G lv2_mlog 100.00 lv3 vg Mwi-ao 4.00G lv3_mlog 100.00 real 1m19.113s user 0m0.093s sys 0m0.026s There is no failed device here... just I/O running. How is it that it takes > 1 minute for a simple lvm command to complete? Now imagine running the much more complex 'vgreduce --removemissing' that happens durring failure.
If you do an strace on the 'lvs' you can watch as the command gets stuck trying to do reads to the heavily laden devices. If you want to fix this, you must find a way for important LVM ops to preempt the I/O of other operations (or somehow skip scanning _any_ devices.)
Ok, that are practibable realizations. Device timeouts in the error case add to the delays and belong to the study of the hole issue. Following your assumption, the LVM2 lib isn't receiving the appropriate prioity to access/update its metadata on the PVs, because those are already heavily accessed. IOW: device queues are rather full. We've got 2 options to attack tkis: a) have a redundant copy of the metadata on some unloaded storage and update it delayed on the PVs. This is an expensive approach, because we'ld need to factor the update of the PV metadata out into a seperate update daemon. b) avoid overloaded PVs by throttling device IO. This would be a much cheaper case (eg. introduce sempahores as IO counters) and hence be good for POC tests. Of course this quick approach has throughput implications. BTW: as a quick test, you might put metadata on a seperate FS or on the log device only and check for a difference.
Waiting for Jons confirmation of my analyisis still. We need the outlined changes to the tools and the appropriate stabilization to do this proper. No short term production level fix possible IMHO.
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
The plan is to review the remedy in RHEL 5 (see bug 235219). Later we can review and see if a back port is possible and if we hit this often enough. Closing for RHEL 4