When a mirror device fails under heavy load, it can take a very long time (minutes) for each CLVM command to process. This can lead to clvmd time-outs being triggered and the mirror fault handling code to abort. The root of the problem is the need for remote nodes to have to scan the devices when doing activates/deactivates. Those scans get queued up behind all the other I/O that is happening and simply take a long time. Either we need to find a completely different way to detect stalled machines, or we need to raise the clvmd timeout (e.g. clvmd -t 100). I'm advocating the later for now and the former when we have more time to investigate.
To be clear about my request: Let's increase the clvmd timeout.
Index: LVM2/scripts/clvmd_init_rhel4 =================================================================== --- LVM2.orig/scripts/clvmd_init_rhel4 +++ LVM2/scripts/clvmd_init_rhel4 @@ -15,7 +15,7 @@ VGCHANGE="/usr/sbin/vgchange" VGSCAN="/usr/sbin/vgscan" VGDISPLAY="/usr/sbin/vgdisplay" VGS="/usr/sbin/vgs" -CLVMDOPTS="-T20" +CLVMDOPTS="-T20 -t 90" [ -f /etc/sysconfig/cluster ] && . /etc/sysconfig/cluster
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0046.html