Red Hat Bugzilla – Bug 113226
Service outage on disconnect of fibre channel cable
Last modified: 2009-04-16 16:14:40 EDT
Description of problem:
In some cases (with some drivers), when you disconnect the fibre channel cable, the
timeout for errors is high enough to produce a service outage. The member is not
shot, and services do not fail over.
This has ***NOT*** been reproduced internally, and is thus unverified.
Version-Release number of selected component (if applicable): 1.2.6,1.2.7
How reproducible: Not known.
I have not been able to reproduce this on parallel SCSI RAID arrays.
From information I received, this may be a repeat of #112300. Please
test 1.2.7 from the beta channel and post any results on the given
Dell has tested 1.2.9; it does not solve this problem. (This is a
different problem from #112300.)
Blocked on hardware availability.
In the mean-time, is Power Path enabled?
No, we are not yet testing with PowerPath
Reproduced; Clariion (fibre-atttached via QLA2312; qla2300 module).
As it turns out, this problem is not Cluster Manager specific.
The kernel sees the SCSI loop down event, but no I/O errors are
returned from the SCSI layer to processes waiting for data. The
cluster service manager ends up stuck in disk-wait /
task-uninterruptible state -- so it's completely dead.
From user-land, it appears to be the same behavior as waiting for an
NFS server to return...
I can work around this in user-land, but it could potentially
eliminate some of the advantages of RHCM 1.2. Thus, it would have to
be a tunable option.
I would think that by default, if a fibre cable got pulled/cut, you
would want it to fail-over. What advantages would your fix eliminate?
I found a simpler workaround which works (and fixes a behavioral bug),
but failover time is *big* on the above configuration (fibre timeout +
failover time + service manager cycle time ~= 105~115 seconds... ouch!)
Created attachment 97277 [details]
Patch from 1.2.9-1 to 1.2.10-0.1; fixes this bug
Suman tested this here yesterday and it did not successfully fail-
over in 1.2.10. However, the failover time as you suggest was about
~100 seconds. Is there any way to bring this in?
Could you clarify: It did not successfully failover, but failover time was about 100
Off the top of my head, reducing the time would require one of:
(a) a major workaround by the userspace code (read: feature enhancement/big
(b) changing the FC loop detection reporting mechanism in the kernel, or
(c) lowering the FC timeout.
Basically, (a) requires extra monitoring of the processes' states and killing the node
after a user-defined timeout (something less than 90 seconds, otherwise it's pointless)
if a process is truly stuck. After this user-defined timeout, the cluster still has to wait
for the fail-over period. The obvious disadvantage is that we will significantly
increase the risk of a "false-shootdown" during moderate to high I/O loads to shared
I am not aware of the amount of work for (b) or disadvantages of (c) - probably
mostly device driver modifications.
"process is truly stuck" should be "process is stuck in disk-wait"
Sorry sorry. Yes, 1.2.10 DOES cause a fail-over to happen, though it
does take 100 seconds.
Thanks - that's what I thought.
I will ensure that this patch makes its way into the next erratum, the presence of any
additional fixes as described above notwithstanding.
Lon, as far as an attempt drop the fail-over time, how does this
We create a small new watchdog-like app, which runs in the background
upon bootup. Its job is determine which HBAs are used to connect to
your storage and then monitor /var/log/messages for SCSI(#): loop
down messages. When it sees that the last status for all HBAs is
down and then waits X amount of seconds and confirms that all of the
HBAs are still down, it then invokes a reboot.
In this way, we could set X to something like 30 seconds and have
ultimate control of the fail-over time. Does this sound too loopy?
That sounds like it would work great. I suspect it would work best to
have the FC-watchdog start before the clumanager.
Naturally, it would be better for the FC drivers to report errors
immediately up through the SCSI midlayer to userland - but that's not
a quick fix, and it would require lots of engineering and QA time.
(Obvious Note: I can't give any input or review your code unless it's
licensed and distributed under the terms of the GNU General Public
Maybe this might be the sort of thing RH would like to look into one
day, but for now I don't believe Dell will be pursuing this solution.
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3