This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 113226 - Service outage on disconnect of fibre channel cable
Service outage on disconnect of fibre channel cable
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
3
All Linux
medium Severity high
: ---
: ---
Assigned To: Lon Hohberger
:
Depends On:
Blocks: 107563
  Show dependency treegraph
 
Reported: 2004-01-09 16:51 EST by Lon Hohberger
Modified: 2009-04-16 16:14 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-03-19 14:47:51 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch from 1.2.9-1 to 1.2.10-0.1; fixes this bug (2.50 KB, patch)
2004-01-27 10:04 EST, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Lon Hohberger 2004-01-09 16:51:34 EST
Description of problem: 
 
In some cases (with some drivers), when you disconnect the fibre channel cable, the 
timeout for errors is high enough to produce a service outage.  The member is not 
shot, and services do not fail over. 
 
This has ***NOT*** been reproduced internally, and is thus unverified. 
 
 
Version-Release number of selected component (if applicable): 1.2.6,1.2.7 
How reproducible: Not known.
Comment 3 Lon Hohberger 2004-01-09 16:55:27 EST
I have not been able to reproduce this on parallel SCSI RAID arrays. 
Comment 5 Lon Hohberger 2004-01-12 11:43:36 EST
From information I received, this may be a repeat of #112300.  Please
test 1.2.7 from the beta channel and post any results on the given
hardware configuration.
Comment 6 Lon Hohberger 2004-01-23 12:48:08 EST
Dell has tested 1.2.9; it does not solve this problem.  (This is a
different problem from #112300.)

Blocked on hardware availability.

In the mean-time, is Power Path enabled?
Comment 7 Gary Lerhaupt 2004-01-23 12:50:42 EST
No, we are not yet testing with PowerPath
Comment 8 Lon Hohberger 2004-01-23 13:25:31 EST
Reproduced; Clariion (fibre-atttached via QLA2312; qla2300 module).

As it turns out, this problem is not Cluster Manager specific.

The kernel sees the SCSI loop down event, but no I/O errors are
returned from the SCSI layer to processes waiting for data.  The
cluster service manager ends up stuck in disk-wait /
task-uninterruptible state -- so it's completely dead.  

From user-land, it appears to be the same behavior as waiting for an
NFS server to return...
Comment 9 Lon Hohberger 2004-01-23 13:30:02 EST
I can work around this in user-land, but it could potentially
eliminate some of the advantages of RHCM 1.2.  Thus, it would have to
be a tunable option.
Comment 10 Gary Lerhaupt 2004-01-23 14:21:46 EST
I would think that by default, if a fibre cable got pulled/cut, you 
would want it to fail-over.  What advantages would your fix eliminate?
Comment 11 Lon Hohberger 2004-01-23 16:18:51 EST
I found a simpler workaround which works (and fixes a behavioral bug),
but failover time is *big* on the above configuration (fibre timeout +
failover time + service manager cycle time ~= 105~115 seconds... ouch!)

Test packages:

http://people.redhat.com/lhh/clumanager-1.2.10-0.1.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.10-0.1.src.rpm
Comment 13 Lon Hohberger 2004-01-27 10:04:16 EST
Created attachment 97277 [details]
Patch from 1.2.9-1 to 1.2.10-0.1; fixes this bug
Comment 14 Gary Lerhaupt 2004-01-27 10:13:24 EST
Suman tested this here yesterday and it did not successfully fail-
over in 1.2.10.  However, the failover time as you suggest was about 
~100 seconds.  Is there any way to bring this in?
Comment 15 Lon Hohberger 2004-01-27 11:08:01 EST
Could you clarify: It did not successfully failover, but failover time was about 100 
seconds? 
 
Off the top of my head, reducing the time would require one of: 
(a) a major workaround by the userspace code (read: feature enhancement/big 
changes), or 
(b) changing the FC loop detection reporting mechanism in the kernel, or 
(c) lowering the FC timeout. 
 
Basically, (a) requires extra monitoring of the processes' states and killing the node 
after a user-defined timeout (something less than 90 seconds, otherwise it's pointless) 
if a process is truly stuck. After this user-defined timeout, the cluster still has to wait 
for the fail-over period.  The obvious disadvantage is that we will significantly 
increase the risk of a "false-shootdown" during moderate to high I/O loads to shared 
storage. 
 
I am not aware of the amount of work for (b) or disadvantages of (c) - probably 
mostly device driver modifications. 
Comment 16 Lon Hohberger 2004-01-27 11:11:11 EST
Clarification -  
 
"process is truly stuck" should be "process is stuck in disk-wait" 
Comment 17 Gary Lerhaupt 2004-01-27 11:33:21 EST
Sorry sorry.  Yes, 1.2.10 DOES cause a fail-over to happen, though it 
does take 100 seconds.
Comment 18 Lon Hohberger 2004-01-27 11:34:55 EST
Thanks - that's what I thought. 
Comment 19 Lon Hohberger 2004-01-27 11:39:10 EST
I will ensure that this patch makes its way into the next erratum, the presence of any 
additional fixes as described above notwithstanding. 
Comment 20 Gary Lerhaupt 2004-02-25 15:37:33 EST
Lon, as far as an attempt drop the fail-over time, how does this 
sound:

We create a small new watchdog-like app, which runs in the background 
upon bootup.  Its job is determine which HBAs are used to connect to 
your storage and then monitor /var/log/messages for SCSI(#): loop 
down messages.  When it sees that the last status for all HBAs is 
down and then waits X amount of seconds and confirms that all of the 
HBAs are still down, it then invokes a reboot.  

In this way, we could set X to something like 30 seconds and have 
ultimate control of the fail-over time.  Does this sound too loopy?
Comment 21 Lon Hohberger 2004-02-26 09:40:43 EST
That sounds like it would work great. I suspect it would work best to
have the FC-watchdog start before the clumanager.

Naturally, it would be better for the FC drivers to report errors
immediately up through the SCSI midlayer to userland - but that's not
a quick fix, and it would require lots of engineering and QA time.

(Obvious Note: I can't give any input or review your code unless it's
licensed and distributed under the terms of the GNU General Public
License.)
Comment 22 Gary Lerhaupt 2004-03-02 12:47:31 EST
Maybe this might be the sort of thing RH would like to look into one 
day, but for now I don't believe Dell will be pursuing this solution.
Comment 24 Suzanne Hillman 2004-03-19 14:47:51 EST
Verified. 
Comment 25 Lon Hohberger 2007-12-21 10:10:33 EST
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.