Description of problem: In some cases (with some drivers), when you disconnect the fibre channel cable, the timeout for errors is high enough to produce a service outage. The member is not shot, and services do not fail over. This has ***NOT*** been reproduced internally, and is thus unverified. Version-Release number of selected component (if applicable): 1.2.6,1.2.7 How reproducible: Not known.
I have not been able to reproduce this on parallel SCSI RAID arrays.
From information I received, this may be a repeat of #112300. Please test 1.2.7 from the beta channel and post any results on the given hardware configuration.
Dell has tested 1.2.9; it does not solve this problem. (This is a different problem from #112300.) Blocked on hardware availability. In the mean-time, is Power Path enabled?
No, we are not yet testing with PowerPath
Reproduced; Clariion (fibre-atttached via QLA2312; qla2300 module). As it turns out, this problem is not Cluster Manager specific. The kernel sees the SCSI loop down event, but no I/O errors are returned from the SCSI layer to processes waiting for data. The cluster service manager ends up stuck in disk-wait / task-uninterruptible state -- so it's completely dead. From user-land, it appears to be the same behavior as waiting for an NFS server to return...
I can work around this in user-land, but it could potentially eliminate some of the advantages of RHCM 1.2. Thus, it would have to be a tunable option.
I would think that by default, if a fibre cable got pulled/cut, you would want it to fail-over. What advantages would your fix eliminate?
I found a simpler workaround which works (and fixes a behavioral bug), but failover time is *big* on the above configuration (fibre timeout + failover time + service manager cycle time ~= 105~115 seconds... ouch!) Test packages: http://people.redhat.com/lhh/clumanager-1.2.10-0.1.i386.rpm http://people.redhat.com/lhh/clumanager-1.2.10-0.1.src.rpm
Created attachment 97277 [details] Patch from 1.2.9-1 to 1.2.10-0.1; fixes this bug
Suman tested this here yesterday and it did not successfully fail- over in 1.2.10. However, the failover time as you suggest was about ~100 seconds. Is there any way to bring this in?
Could you clarify: It did not successfully failover, but failover time was about 100 seconds? Off the top of my head, reducing the time would require one of: (a) a major workaround by the userspace code (read: feature enhancement/big changes), or (b) changing the FC loop detection reporting mechanism in the kernel, or (c) lowering the FC timeout. Basically, (a) requires extra monitoring of the processes' states and killing the node after a user-defined timeout (something less than 90 seconds, otherwise it's pointless) if a process is truly stuck. After this user-defined timeout, the cluster still has to wait for the fail-over period. The obvious disadvantage is that we will significantly increase the risk of a "false-shootdown" during moderate to high I/O loads to shared storage. I am not aware of the amount of work for (b) or disadvantages of (c) - probably mostly device driver modifications.
Clarification - "process is truly stuck" should be "process is stuck in disk-wait"
Sorry sorry. Yes, 1.2.10 DOES cause a fail-over to happen, though it does take 100 seconds.
Thanks - that's what I thought.
I will ensure that this patch makes its way into the next erratum, the presence of any additional fixes as described above notwithstanding.
Lon, as far as an attempt drop the fail-over time, how does this sound: We create a small new watchdog-like app, which runs in the background upon bootup. Its job is determine which HBAs are used to connect to your storage and then monitor /var/log/messages for SCSI(#): loop down messages. When it sees that the last status for all HBAs is down and then waits X amount of seconds and confirms that all of the HBAs are still down, it then invokes a reboot. In this way, we could set X to something like 30 seconds and have ultimate control of the fail-over time. Does this sound too loopy?
That sounds like it would work great. I suspect it would work best to have the FC-watchdog start before the clumanager. Naturally, it would be better for the FC drivers to report errors immediately up through the SCSI midlayer to userland - but that's not a quick fix, and it would require lots of engineering and QA time. (Obvious Note: I can't give any input or review your code unless it's licensed and distributed under the terms of the GNU General Public License.)
Maybe this might be the sort of thing RH would like to look into one day, but for now I don't believe Dell will be pursuing this solution.
Verified.
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3