199724 – single node mirror leg failure causes > 3 min lvm deadlock

Bug 199724 - single node mirror leg failure causes > 3 min lvm deadlock

Summary: single node mirror leg failure causes > 3 min lvm deadlock

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	4.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Heinz Mauelshagen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	200176 (view as bug list)
Depends On:
Blocks:	235219
TreeView+	depends on / blocked

Reported:	2006-07-21 16:04 UTC by Corey Marthaler
Modified:	2007-11-30 22:07 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-09-07 20:27:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
script that shows the problem (1.16 KB, text/plain) 2007-03-19 21:22 UTC, Jonathan Earl Brassow	no flags	Details
View All

Description Corey Marthaler 2006-07-21 16:04:42 UTC

Description of problem:
Started I/O to a simple mirror and then failed one of it's legs. After the
failure was detected the I/O started up once again after about 30 seconds but
any lvm commands were stuck in an flock for about 3 minutes and 20 seconds. 

Version-Release number of selected component (if applicable):
[root@taft-04 ~]# rpm -q lvm2
lvm2-2.02.06-6.0.RHEL4
[root@taft-04 ~]# rpm -q device-mapper
device-mapper-1.02.07-4.0.RHEL4
[root@taft-04 ~]# uname -ar
Linux taft-04 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux


How reproducible:
very

Comment 6 Jonathan Earl Brassow 2006-12-05 18:06:01 UTC

*** Bug 200176 has been marked as a duplicate of this bug. ***

Comment 9 Heinz Mauelshagen 2007-03-01 09:04:24 UTC

Jon,

please report here what you did for cluster mirroring for reference.

Thanks

Comment 10 Jonathan Earl Brassow 2007-03-01 15:39:47 UTC

Removed unnecessary prints in the cluster mirror code that were being printed on
every failed operation.  The removed messages were causing flooding on the
console... slowing everything down.

Comment 11 Jonathan Earl Brassow 2007-03-01 15:44:24 UTC

One more thing that may prove insightful...

If you fail a mirror device, and then do a small amount of I/O, fail-over is
almost instantanious.   However, if you load up the system with large writes (dd
if=/dev/zero of=/dev/vg/lv bs=4M), and then fail the device, it takes a "long
time" to fail-over.  It almost seems as though it is waiting for the writes to
complete before doing the fault handling.

Comment 12 Heinz Mauelshagen 2007-03-14 12:31:12 UTC

Which is an obvious prove that we can't address this in the mirror code.

IOW: Large IO requests got already queued to LLDs and subject to retries/timeout
handling therin, causing errors being reported too late for the mirror code to
disable access to the failed device.

Say an LLD timeouts after 30s, the mirror code wont change the device state
before an error returns that late. Until LLDs pay attention to the
BIO_RW_FAILFAST flag, we can't do much about this.
At least in the 2 mirror situation, it is no good solution to change the code to
allow reporting success back while just one leg returns immediately and the
other *might* return error much later for obvious data integrity reasons.

Comment 13 Jonathan Earl Brassow 2007-03-14 14:46:09 UTC

I'm sorry, that's not what I meant.  I should have been clearer...

"long time" should be read "minutes".  In fact, we are dealing with a similar issue in 231230.  In cluster 
mirroring, clvmd can actually time-out waiting for device operations (suspend/resume/activate/
deactivate) to take place.  My guess is that it is the same thing in single machine mirroring.

So, to be clear, the device failure is detected and dmeventd launches 'vgreduce' and _then_ it takes 
forever.  Strangely enough, after the failure has been detected, the speed of the reconfiguration is still 
incredibly dependent on the amount of I/O to the system.

Comment 14 Heinz Mauelshagen 2007-03-14 16:04:13 UTC

Jon, thanks for the clarification.

Do we have evidence of 'vgreduce' trying to access the failed device in order to
update its metadata and hence cause further timeouts? Ie. the list of PVs to
update metadata on includes the dead PV, which it shouldn't. The flock delay
reported then looks like a consequence of vgreduce holding the lock until the IO
to the dead drive times out.

Comment 15 Jonathan Earl Brassow 2007-03-15 01:54:06 UTC

We certainly do.  However, for my tests I have been issuing:
  echo offline > /sys/block/sda/device/state
on each machine in order to kill the device, so the EIOs come back almost
immediately.

I'm pretty sure that device timeouts are not the issue, but since I don't know
why the device-mapper operations are taking so long, I can't say for sure.

I will try to rerun my test and capture the clvmd and messages output (along
with system tap information).  Although this would be a cluster mirror test, I
believe it will clearly point out the delays in mirror fail-over, which should
also prove useful for this single machine bug.

Comment 16 Jonathan Earl Brassow 2007-03-19 21:22:24 UTC

Created attachment 150435 [details]
script that shows the problem

When the script says so, kill the device (in this case sda) by doing
echo offline > /sys/block/sda/device/state

You will have to adapt the script to your environment.

You will easily see how long the mirror takes to reconfigure - it would
certainly timeout in clvmd.

Comment 17 Jonathan Earl Brassow 2007-03-19 21:25:59 UTC

Using system tap:

Entering:      do_suspend(fd0d)                         [dmeventd/25943]
Entering:      mirror_presuspend()                      [dmeventd/25943]
Exiting :      mirror_presuspend()                      [dmeventd/25943]
Entering:      mirror_postsuspend()                     [dmeventd/25943]
Exiting :      mirror_postsuspend()                     [dmeventd/25943]
Exiting :      do_suspend()                             [dmeventd/25943] 20414ms

The deactivate/activate commands are taking even longer, but I wasn't profiling
those...  However, we can see that the do_suspend() takes 20 sec.  That's just
for one of many of those operations.  [Note, if the time is less than one ms,
the time is not printed on exit.]

Comment 18 Kiersten (Kerri) Anderson 2007-03-20 14:36:15 UTC

Devel ACK - delays are affecting cluster mirror recovery operations.

Comment 19 Heinz Mauelshagen 2007-03-22 17:57:01 UTC

Jon,
the mirror log should still be able to flush, because it sits on a different
dervice (sdc in your example). Don't see a reason for the log suspension delay
so far unless the failed sda and hence the timeout handling for requets in
flight in the LLD delays IOs on sdc as well because of some LLD stupidity.
What's the configuration for sda-c (HBA, driver, paths etc)?

Comment 20 Jonathan Earl Brassow 2007-03-22 20:44:43 UTC

It is not device timeouts.  No more talk on timeouts for failed devices.  If it
was, then the recovery would take (mostly) an equal amount of time whether under
load or not.

I've done system tap profiling... starting with suspend/resume/dtr/ctr.  While
those took a while, it didn't acount for all the time being spent.  So, I bumped
up to profiling ctl_ioctl, still not enough...  I then bumped up to sys_ioctl,
still not enough.  If I had to guess, I would say that the load on the system is
causing delay between when sys_ioctls are called and when they are run.

Comment 21 Jonathan Earl Brassow 2007-03-22 21:30:04 UTC

I think this is really part of a larger issue.  That issue is that LVM
operations simply take too long under load.  For example, I have 3 mirrors, with
file systems on top.  If I put load on those 3 file systems, I get:

[root@neo-06 ~]# time lvs
  LV       VG         Attr   LSize   Origin Snap%  Move Log      Copy%
  LogVol00 VolGroup00 -wi-ao  36.62G
  LogVol01 VolGroup00 -wi-ao 512.00M
  lv1      vg         Mwi-ao   4.00G                    lv1_mlog 100.00
  lv2      vg         Mwi-ao   4.00G                    lv2_mlog 100.00
  lv3      vg         Mwi-ao   4.00G                    lv3_mlog 100.00

real    1m19.113s
user    0m0.093s
sys     0m0.026s

There is no failed device here... just I/O running.  How is it that it takes > 1
minute for a simple lvm command to complete?  Now imagine running the much more
complex 'vgreduce --removemissing' that happens durring failure.

Comment 22 Jonathan Earl Brassow 2007-03-22 21:34:46 UTC

If you do an strace on the 'lvs' you can watch as the command gets stuck trying
to do reads to the heavily laden devices.  If you want to fix this, you must
find a way for important LVM ops to preempt the I/O of other operations (or
somehow skip scanning _any_ devices.)

Comment 23 Heinz Mauelshagen 2007-03-23 08:50:43 UTC

Ok, that are practibable realizations.
Device timeouts in the error case add to the delays and belong to the study of
the hole issue.

Following your assumption, the LVM2 lib isn't receiving the appropriate prioity
to access/update its metadata on the PVs, because those are already heavily
accessed. IOW: device queues are rather full.

We've got 2 options to attack tkis:
a) have a redundant copy of the metadata on some unloaded storage and update it
   delayed on the PVs. This is an expensive approach, because we'ld need to
   factor the update of the PV metadata out into a seperate update daemon.

b) avoid overloaded PVs by throttling device IO. This would be a much cheaper
   case (eg. introduce sempahores as IO counters) and hence be good for POC
   tests. Of course this quick approach has throughput implications.

BTW: as a quick test, you might put metadata on a seperate FS or on the log
device only and check for a difference.

Comment 24 Heinz Mauelshagen 2007-04-04 09:10:36 UTC

Waiting for Jons confirmation of my analyisis still.
We need the outlined changes to the tools and the appropriate stabilization to
do this proper. No short term production level fix possible IMHO.

Comment 26 RHEL Program Management 2007-09-07 19:44:08 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 27 Rob Kenna 2007-09-07 20:27:23 UTC

The plan is to review the remedy in RHEL 5 (see bug 235219).  Later we can
review and see if a back port is possible and if we hit this often enough. 
Closing for RHEL 4

Note You need to log in before you can comment on or make changes to this bug.