113226 – Service outage on disconnect of fibre channel cable

Bug 113226 - Service outage on disconnect of fibre channel cable

Summary: Service outage on disconnect of fibre channel cable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	107563
TreeView+	depends on / blocked

Reported:	2004-01-09 21:51 UTC by Lon Hohberger
Modified:	2009-04-16 20:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-03-19 19:47:51 UTC
Embargoed:

Attachments	(Terms of Use)
Patch from 1.2.9-1 to 1.2.10-0.1; fixes this bug (2.50 KB, patch) 2004-01-27 15:04 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:122	0	normal	SHIPPED_LIVE	Updated clumanager and redhat-config-cluster packages fix various bugs	2004-05-12 04:00:00 UTC

Description Lon Hohberger 2004-01-09 21:51:34 UTC

Description of problem: 
 
In some cases (with some drivers), when you disconnect the fibre channel cable, the 
timeout for errors is high enough to produce a service outage.  The member is not 
shot, and services do not fail over. 
 
This has ***NOT*** been reproduced internally, and is thus unverified. 
 
 
Version-Release number of selected component (if applicable): 1.2.6,1.2.7 
How reproducible: Not known.

Comment 3 Lon Hohberger 2004-01-09 21:55:27 UTC

I have not been able to reproduce this on parallel SCSI RAID arrays.

Comment 5 Lon Hohberger 2004-01-12 16:43:36 UTC

From information I received, this may be a repeat of #112300.  Please
test 1.2.7 from the beta channel and post any results on the given
hardware configuration.

Comment 6 Lon Hohberger 2004-01-23 17:48:08 UTC

Dell has tested 1.2.9; it does not solve this problem.  (This is a
different problem from #112300.)

Blocked on hardware availability.

In the mean-time, is Power Path enabled?

Comment 7 Gary Lerhaupt 2004-01-23 17:50:42 UTC

No, we are not yet testing with PowerPath

Comment 8 Lon Hohberger 2004-01-23 18:25:31 UTC

Reproduced; Clariion (fibre-atttached via QLA2312; qla2300 module).

As it turns out, this problem is not Cluster Manager specific.

The kernel sees the SCSI loop down event, but no I/O errors are
returned from the SCSI layer to processes waiting for data.  The
cluster service manager ends up stuck in disk-wait /
task-uninterruptible state -- so it's completely dead.  

From user-land, it appears to be the same behavior as waiting for an
NFS server to return...

Comment 9 Lon Hohberger 2004-01-23 18:30:02 UTC

I can work around this in user-land, but it could potentially
eliminate some of the advantages of RHCM 1.2.  Thus, it would have to
be a tunable option.

Comment 10 Gary Lerhaupt 2004-01-23 19:21:46 UTC

I would think that by default, if a fibre cable got pulled/cut, you 
would want it to fail-over.  What advantages would your fix eliminate?

Comment 11 Lon Hohberger 2004-01-23 21:18:51 UTC

I found a simpler workaround which works (and fixes a behavioral bug),
but failover time is *big* on the above configuration (fibre timeout +
failover time + service manager cycle time ~= 105~115 seconds... ouch!)

Test packages:

http://people.redhat.com/lhh/clumanager-1.2.10-0.1.i386.rpm
http://people.redhat.com/lhh/clumanager-1.2.10-0.1.src.rpm

Comment 13 Lon Hohberger 2004-01-27 15:04:16 UTC

Created attachment 97277 [details]
Patch from 1.2.9-1 to 1.2.10-0.1; fixes this bug

Comment 14 Gary Lerhaupt 2004-01-27 15:13:24 UTC

Suman tested this here yesterday and it did not successfully fail-
over in 1.2.10.  However, the failover time as you suggest was about 
~100 seconds.  Is there any way to bring this in?

Comment 15 Lon Hohberger 2004-01-27 16:08:01 UTC

Could you clarify: It did not successfully failover, but failover time was about 100 
seconds? 
 
Off the top of my head, reducing the time would require one of: 
(a) a major workaround by the userspace code (read: feature enhancement/big 
changes), or 
(b) changing the FC loop detection reporting mechanism in the kernel, or 
(c) lowering the FC timeout. 
 
Basically, (a) requires extra monitoring of the processes' states and killing the node 
after a user-defined timeout (something less than 90 seconds, otherwise it's pointless) 
if a process is truly stuck. After this user-defined timeout, the cluster still has to wait 
for the fail-over period.  The obvious disadvantage is that we will significantly 
increase the risk of a "false-shootdown" during moderate to high I/O loads to shared 
storage. 
 
I am not aware of the amount of work for (b) or disadvantages of (c) - probably 
mostly device driver modifications.

Comment 16 Lon Hohberger 2004-01-27 16:11:11 UTC

Clarification -  
 
"process is truly stuck" should be "process is stuck in disk-wait"

Comment 17 Gary Lerhaupt 2004-01-27 16:33:21 UTC

Sorry sorry.  Yes, 1.2.10 DOES cause a fail-over to happen, though it 
does take 100 seconds.

Comment 18 Lon Hohberger 2004-01-27 16:34:55 UTC

Thanks - that's what I thought.

Comment 19 Lon Hohberger 2004-01-27 16:39:10 UTC

I will ensure that this patch makes its way into the next erratum, the presence of any 
additional fixes as described above notwithstanding.

Comment 20 Gary Lerhaupt 2004-02-25 20:37:33 UTC

Lon, as far as an attempt drop the fail-over time, how does this 
sound:

We create a small new watchdog-like app, which runs in the background 
upon bootup.  Its job is determine which HBAs are used to connect to 
your storage and then monitor /var/log/messages for SCSI(#): loop 
down messages.  When it sees that the last status for all HBAs is 
down and then waits X amount of seconds and confirms that all of the 
HBAs are still down, it then invokes a reboot.  

In this way, we could set X to something like 30 seconds and have 
ultimate control of the fail-over time.  Does this sound too loopy?

Comment 21 Lon Hohberger 2004-02-26 14:40:43 UTC

That sounds like it would work great. I suspect it would work best to
have the FC-watchdog start before the clumanager.

Naturally, it would be better for the FC drivers to report errors
immediately up through the SCSI midlayer to userland - but that's not
a quick fix, and it would require lots of engineering and QA time.

(Obvious Note: I can't give any input or review your code unless it's
licensed and distributed under the terms of the GNU General Public
License.)

Comment 22 Gary Lerhaupt 2004-03-02 17:47:31 UTC

Maybe this might be the sort of thing RH would like to look into one 
day, but for now I don't believe Dell will be pursuing this solution.

Comment 24 Suzanne Hillman 2004-03-19 19:47:51 UTC

Verified.

Comment 25 Lon Hohberger 2007-12-21 15:10:33 UTC

Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.