Created attachment 1210338 [details]
failover and failback effect seen from the client
Description of problem:
Using a Windows 2012r2 client, connected to a 2 gateway environment that has a single LUN. When the gateway owning the LUN is powered off, iops pause on the client for over a minute. This is too long for most applications.
Version-Release number of selected component (if applicable):
Each test shows the same outcome
Steps to Reproduce:
1. Windows 2012r2 (maybe RHEL too?) client using one LUN
2. use fio to generate load on the LUN from the client
3. poweroff the gateway node that 'owns' the lun
failover time is > 1 minute
Path failover should complete within 20-30 seconds
there may be spme best practice changes for Windows that we need to adopt
If linux is slow to failover too, could you attach the kernel logs from the initiator, and gw machines?
If only windows is slow, could you just attach the windows system event viewer logs. Open "Event Viewer" -> "Windows Logs -> "System". Just save those events as a .evtx and attach here.
Also if only windows is slow, could you add the output of
Created attachment 1210362 [details]
Failover and failback with a RHEL client
PS C:\Users\Administrator> get-MPIOSetting
PathVerificationState : Disabled
PathVerificationPeriod : 30
PDORemovePeriod : 20
RetryCount : 3
RetryInterval : 1
UseCustomPathRecoveryTime : Disabled
CustomPathRecoveryTime : 40
DiskTimeoutValue : 60
Also added the event records
looks like the disktimeout is the big reason - defaulting to 60 seconds?
Created attachment 1210382 [details]
windows event records
FYI - Linux (RHEL) failover was fine
Issue could not be reproduce in Mike's lab - so is likely environmental in nature. Mike investigating further.
It looks like Paul is hitting the worst case where the command times out. The initiator does not detect the iscsi level connection failure, so we have to go through scsi level recovery. The timeouts we hit are:
DiskTimeout (60 secs) + SRB timeout (15) + Task Management (20) + Link Down (15)
I think we can safely lower the Disk and SRB Timeout. We cannot control the task management timeout. I think Link Down is pretty low already.
DiskTimeout (25 secs) + SRB timeout (5) + Task Management (20) + Link Down (15)
This still puts us at 65 seconds.
We can get it sub 60 seconds by also setting EnableNOPOut = 1. This will detect the iSCSI target is down, so we do not have to go through the scsi task management process.
To set the disk timeout set:
DiskTimeout = 25.
To set the iscsi settings set:
EnableNOPOut = 1
SRBTimeoutDelta = 5
(In reply to Mike Christie from comment #9)
> We can get it sub 60 seconds by also setting EnableNOPOut = 1. This will
> detect the iSCSI target is down, so we do not have to go through the scsi
> task management process.
It looks like this is not always true. With IO in flight the worst case timeout is going to be 65 secs with the values suggested in the previous comment.
Just a update.
Hemanth, there are actually two bugs you will want to test for.
1. For a clean shutdown using reboot, the target does not cleanly shutdown connections, so we can end up hitting the DiskTimeout. If we cleanly shutdown conns, we can get the failover time to around 20 - 30 seconds which is closer to linux.
I am working on patch for this.
2. For the unclean shutdown, the worst case seems to be the DiskTimeout expiring documented in comment #9.
2.A for general DiskTimeout errors, we should lower the values to the ones in comment #9.
2.B Those settings will help the unclean shutdown case, but I am also looking for possibly a TCP timer to maybe help.
I will update the initiator setup doc for 2 A and B.
Patch was merged in version ceph-iscsi-config-1.3-1.el7cp.
Created attachment 1219556 [details]
Failover on N/W Failure
Failed the primary GW Node's N/W and the Failover happened within 15 secs..
Failover is not talking 60sec now..
Refer the attachment for the Performance monitor stats on Windows..
Will update the same after reboot..
Created attachment 1219557 [details]
Failover on Reboot
Rebooted the primary GW Node and the Failover happened within 30 secs..
Refer the attached screenshot
Moving to Verified..
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.