Bug 1384748
| Summary: | iSCSI failover time is too long when a gateway is shutdown | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Paul Cuzner <pcuzner> | ||||||||||||
| Component: | RBD | Assignee: | Mike Christie <mchristi> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Hemanth Kumar <hyelloji> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | unspecified | ||||||||||||||
| Version: | 2.0 | CC: | ceph-eng-bugs, hnallurv, kdreyer | ||||||||||||
| Target Milestone: | rc | ||||||||||||||
| Target Release: | 2.1 | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | ceph-iscsi-config-1.3-1.el7cp | Doc Type: | If docs needed, set a value | ||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2016-11-22 19:32:47 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1379890 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Paul Cuzner
2016-10-14 04:33:44 UTC
If linux is slow to failover too, could you attach the kernel logs from the initiator, and gw machines? If only windows is slow, could you just attach the windows system event viewer logs. Open "Event Viewer" -> "Windows Logs -> "System". Just save those events as a .evtx and attach here. Also if only windows is slow, could you add the output of Get-MPIOSetting from powershell. Created attachment 1210362 [details]
Failover and failback with a RHEL client
PS C:\Users\Administrator> get-MPIOSetting PathVerificationState : Disabled PathVerificationPeriod : 30 PDORemovePeriod : 20 RetryCount : 3 RetryInterval : 1 UseCustomPathRecoveryTime : Disabled CustomPathRecoveryTime : 40 DiskTimeoutValue : 60 Also added the event records looks like the disktimeout is the big reason - defaulting to 60 seconds? Created attachment 1210382 [details]
windows event records
FYI - Linux (RHEL) failover was fine Issue could not be reproduce in Mike's lab - so is likely environmental in nature. Mike investigating further. It looks like Paul is hitting the worst case where the command times out. The initiator does not detect the iscsi level connection failure, so we have to go through scsi level recovery. The timeouts we hit are:
DiskTimeout (60 secs) + SRB timeout (15) + Task Management (20) + Link Down (15)
I think we can safely lower the Disk and SRB Timeout. We cannot control the task management timeout. I think Link Down is pretty low already.
DiskTimeout (25 secs) + SRB timeout (5) + Task Management (20) + Link Down (15)
This still puts us at 65 seconds.
We can get it sub 60 seconds by also setting EnableNOPOut = 1. This will detect the iSCSI target is down, so we do not have to go through the scsi task management process.
To set the disk timeout set:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Disk\
DiskTimeout = 25.
To set the iscsi settings set:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\<Instance Number>\Parameters
EnableNOPOut = 1
SRBTimeoutDelta = 5
(In reply to Mike Christie from comment #9) > We can get it sub 60 seconds by also setting EnableNOPOut = 1. This will > detect the iSCSI target is down, so we do not have to go through the scsi > task management process. It looks like this is not always true. With IO in flight the worst case timeout is going to be 65 secs with the values suggested in the previous comment. Just a update. Hemanth, there are actually two bugs you will want to test for. 1. For a clean shutdown using reboot, the target does not cleanly shutdown connections, so we can end up hitting the DiskTimeout. If we cleanly shutdown conns, we can get the failover time to around 20 - 30 seconds which is closer to linux. I am working on patch for this. 2. For the unclean shutdown, the worst case seems to be the DiskTimeout expiring documented in comment #9. 2.A for general DiskTimeout errors, we should lower the values to the ones in comment #9. 2.B Those settings will help the unclean shutdown case, but I am also looking for possibly a TCP timer to maybe help. I will update the initiator setup doc for 2 A and B. Patch was merged in version ceph-iscsi-config-1.3-1.el7cp. Created attachment 1219556 [details]
Failover on N/W Failure
Hi Paul,
Failed the primary GW Node's N/W and the Failover happened within 15 secs..
Failover is not talking 60sec now..
Refer the attachment for the Performance monitor stats on Windows..
Will update the same after reboot..
Created attachment 1219557 [details]
Failover on Reboot
Rebooted the primary GW Node and the Failover happened within 30 secs..
Refer the attached screenshot
Moving to Verified.. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2815.html |