Red Hat Bugzilla – Bug 199724
single node mirror leg failure causes > 3 min lvm deadlock
Last modified: 2007-11-30 17:07:26 EST
Description of problem:
Started I/O to a simple mirror and then failed one of it's legs. After the
failure was detected the I/O started up once again after about 30 seconds but
any lvm commands were stuck in an flock for about 3 minutes and 20 seconds.
Version-Release number of selected component (if applicable):
[root@taft-04 ~]# rpm -q lvm2
[root@taft-04 ~]# rpm -q device-mapper
[root@taft-04 ~]# uname -ar
Linux taft-04 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64
*** Bug 200176 has been marked as a duplicate of this bug. ***
please report here what you did for cluster mirroring for reference.
Removed unnecessary prints in the cluster mirror code that were being printed on
every failed operation. The removed messages were causing flooding on the
console... slowing everything down.
One more thing that may prove insightful...
If you fail a mirror device, and then do a small amount of I/O, fail-over is
almost instantanious. However, if you load up the system with large writes (dd
if=/dev/zero of=/dev/vg/lv bs=4M), and then fail the device, it takes a "long
time" to fail-over. It almost seems as though it is waiting for the writes to
complete before doing the fault handling.
Which is an obvious prove that we can't address this in the mirror code.
IOW: Large IO requests got already queued to LLDs and subject to retries/timeout
handling therin, causing errors being reported too late for the mirror code to
disable access to the failed device.
Say an LLD timeouts after 30s, the mirror code wont change the device state
before an error returns that late. Until LLDs pay attention to the
BIO_RW_FAILFAST flag, we can't do much about this.
At least in the 2 mirror situation, it is no good solution to change the code to
allow reporting success back while just one leg returns immediately and the
other *might* return error much later for obvious data integrity reasons.
I'm sorry, that's not what I meant. I should have been clearer...
"long time" should be read "minutes". In fact, we are dealing with a similar issue in 231230. In cluster
mirroring, clvmd can actually time-out waiting for device operations (suspend/resume/activate/
deactivate) to take place. My guess is that it is the same thing in single machine mirroring.
So, to be clear, the device failure is detected and dmeventd launches 'vgreduce' and _then_ it takes
forever. Strangely enough, after the failure has been detected, the speed of the reconfiguration is still
incredibly dependent on the amount of I/O to the system.
Jon, thanks for the clarification.
Do we have evidence of 'vgreduce' trying to access the failed device in order to
update its metadata and hence cause further timeouts? Ie. the list of PVs to
update metadata on includes the dead PV, which it shouldn't. The flock delay
reported then looks like a consequence of vgreduce holding the lock until the IO
to the dead drive times out.
We certainly do. However, for my tests I have been issuing:
echo offline > /sys/block/sda/device/state
on each machine in order to kill the device, so the EIOs come back almost
I'm pretty sure that device timeouts are not the issue, but since I don't know
why the device-mapper operations are taking so long, I can't say for sure.
I will try to rerun my test and capture the clvmd and messages output (along
with system tap information). Although this would be a cluster mirror test, I
believe it will clearly point out the delays in mirror fail-over, which should
also prove useful for this single machine bug.
Created attachment 150435 [details]
script that shows the problem
When the script says so, kill the device (in this case sda) by doing
echo offline > /sys/block/sda/device/state
You will have to adapt the script to your environment.
You will easily see how long the mirror takes to reconfigure - it would
certainly timeout in clvmd.
Using system tap:
Entering: do_suspend(fd0d) [dmeventd/25943]
Entering: mirror_presuspend() [dmeventd/25943]
Exiting : mirror_presuspend() [dmeventd/25943]
Entering: mirror_postsuspend() [dmeventd/25943]
Exiting : mirror_postsuspend() [dmeventd/25943]
Exiting : do_suspend() [dmeventd/25943] 20414ms
The deactivate/activate commands are taking even longer, but I wasn't profiling
those... However, we can see that the do_suspend() takes 20 sec. That's just
for one of many of those operations. [Note, if the time is less than one ms,
the time is not printed on exit.]
Devel ACK - delays are affecting cluster mirror recovery operations.
the mirror log should still be able to flush, because it sits on a different
dervice (sdc in your example). Don't see a reason for the log suspension delay
so far unless the failed sda and hence the timeout handling for requets in
flight in the LLD delays IOs on sdc as well because of some LLD stupidity.
What's the configuration for sda-c (HBA, driver, paths etc)?
It is not device timeouts. No more talk on timeouts for failed devices. If it
was, then the recovery would take (mostly) an equal amount of time whether under
load or not.
I've done system tap profiling... starting with suspend/resume/dtr/ctr. While
those took a while, it didn't acount for all the time being spent. So, I bumped
up to profiling ctl_ioctl, still not enough... I then bumped up to sys_ioctl,
still not enough. If I had to guess, I would say that the load on the system is
causing delay between when sys_ioctls are called and when they are run.
I think this is really part of a larger issue. That issue is that LVM
operations simply take too long under load. For example, I have 3 mirrors, with
file systems on top. If I put load on those 3 file systems, I get:
[root@neo-06 ~]# time lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
LogVol00 VolGroup00 -wi-ao 36.62G
LogVol01 VolGroup00 -wi-ao 512.00M
lv1 vg Mwi-ao 4.00G lv1_mlog 100.00
lv2 vg Mwi-ao 4.00G lv2_mlog 100.00
lv3 vg Mwi-ao 4.00G lv3_mlog 100.00
There is no failed device here... just I/O running. How is it that it takes > 1
minute for a simple lvm command to complete? Now imagine running the much more
complex 'vgreduce --removemissing' that happens durring failure.
If you do an strace on the 'lvs' you can watch as the command gets stuck trying
to do reads to the heavily laden devices. If you want to fix this, you must
find a way for important LVM ops to preempt the I/O of other operations (or
somehow skip scanning _any_ devices.)
Ok, that are practibable realizations.
Device timeouts in the error case add to the delays and belong to the study of
the hole issue.
Following your assumption, the LVM2 lib isn't receiving the appropriate prioity
to access/update its metadata on the PVs, because those are already heavily
accessed. IOW: device queues are rather full.
We've got 2 options to attack tkis:
a) have a redundant copy of the metadata on some unloaded storage and update it
delayed on the PVs. This is an expensive approach, because we'ld need to
factor the update of the PV metadata out into a seperate update daemon.
b) avoid overloaded PVs by throttling device IO. This would be a much cheaper
case (eg. introduce sempahores as IO counters) and hence be good for POC
tests. Of course this quick approach has throughput implications.
BTW: as a quick test, you might put metadata on a seperate FS or on the log
device only and check for a difference.
Waiting for Jons confirmation of my analyisis still.
We need the outlined changes to the tools and the appropriate stabilization to
do this proper. No short term production level fix possible IMHO.
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time. This request will be
reviewed for a future Red Hat Enterprise Linux release.
The plan is to review the remedy in RHEL 5 (see bug 235219). Later we can
review and see if a back port is possible and if we hit this often enough.
Closing for RHEL 4