863211 – PRD33 - predictable host timeouts for ha/fencing (backend - vdsNotResponding takes too long before fence host)

Bug 863211 - PRD33 - predictable host timeouts for ha/fencing (backend - vdsNotResponding takes too long before fence host)

Summary: PRD33 - predictable host timeouts for ha/fencing (backend - vdsNotResponding ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.1.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Juan Hernández
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:	sla
Depends On:	917666
Blocks:	1019470 1044088
TreeView+	depends on / blocked

Reported:	2012-10-04 17:15 UTC by Pavel Stehlik
Modified:	2018-12-03 17:51 UTC (History)
CC List:	17 users (show)
Fixed In Version:	is5
Doc Type:	Enhancement
Doc Text:	This update introduces two parameters that control the time required to react when there are connection issues. The vdsConnectionTimeout parameter sets the amount of time allowed for establishing connections for hosts before timing out. The vdsRetries paramter sets the number of times to retry host operations when there are I/O errors.
Clone Of:
Clones:	917666 (view as bug list)
Environment:
Last Closed:	2014-01-21 17:30:25 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine-drop (46.62 KB, text/x-log) 2012-10-04 17:15 UTC, Pavel Stehlik	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:0038	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.3.0 update	2014-01-21 22:03:06 UTC
oVirt gerrit	15673	0	None	None	None	Never

Description Pavel Stehlik 2012-10-04 17:15:34 UTC

Created attachment 621747 [details]
engine-drop

Description of problem:
 Once the host network is in trouble it takes ages (~8-9 mins) to vdsNotResponding shows up ,reboot problematic host & restart VMs (HA) on another host.

Version-Release number of selected component (if applicable):
si19.1

How reproducible:


Steps to Reproduce:
1. have 2 or more hosts in cluster + HA machine on non-SPM host
2. iptables -I INPUT -s ENGINE_IP -p tcp -j DROP  (on host)
3. 
  
Actual results:
2012-10-04 10:21:59,673 ERROR ..GetCapabilitiesVDS failed, but takes another 2 mins till the host is marked as Non-responsive.

Expected results:
there are 2 options in the config:
TimeoutToResetVdsInSeconds
VDSAttemptsToResetCount
Those two should be the limit boundaries.

Additional info:
on host:
=========
iptables -I INPUT -s ENGINE_IP -p tcp --dport 54321 -j DROP

date on engine:
===============
Thu Oct  4 10:15:57 CEST 2012

engine.log:
=============
2012-10-04 10:18:57,642 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-60) XML RPC error in command GetStatsVDS ( HostName = slot-7 ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-04 10:18:57,643 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-60) vds::refreshVdsStats Failed getVdsStats,  vds = 0b5eb534-08fc-11e2-a34d-001a4a013f65 : slot-7, error = VDSNetworkException: VDSNetworkException: 
2012-10-04 10:18:57,646 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-60) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 0b5eb534-08fc-11e2-a34d-001a4a013f65 : slot-7, VDS Network Error, continuing.
VDSNetworkException: 

2012-10-04 10:20:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-47) Autorecovering hosts is disabled, skipping
2012-10-04 10:20:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-47) Autorecovering storage domains is disabled, skipping

2012-10-04 10:21:59,673 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-74) XML RPC error in command GetCapabilitiesVDS ( HostName = slot-7 ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-04 10:21:59,674 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-74) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 0b5eb534-08fc-11e2-a34d-001a4a013f65 : slot-7, VDS Network Error, continuing.
VDSNetworkException: 

2012-10-04 10:25:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-5) Autorecovering hosts is disabled, skipping
2012-10-04 10:25:00,001 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-5) Autorecovering storage domains is disabled, skipping
2012-10-04 10:25:01,699 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-79) XML RPC error in command GetCapabilitiesVDS ( HostName = slot-7 ), the error was: java.util.concurrent.TimeoutException, TimeoutException:  
2012-10-04 10:25:01,699 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-79) VDS::handleNetworkException Server failed to respond,  vds_id = 0b5eb534-08fc-11e2-a34d-001a4a013f65, vds_name = slot-7, error = VDSNetworkException: 
2012-10-04 10:25:01,725 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (pool-4-thread-50) ResourceManager::vdsNotResponding entered for Host 0b5eb534-08fc-11e2-a34d-001a4a013f65, 10.34.63.137
2012-10-04 10:25:01,800 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (pool-4-thread-50) Executing <Status> Power Management command, Proxy Host:srh-211, Agent:ipmilan, Target Host:slot-7, Management IP:dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, User:root, Options:
2012-10-04 10:25:01,801 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-50) START, FenceVdsVDSCommand(HostName = srh-211, HostId = 04be734c-0d41-11e2-a7ed-001a4a013f65, targetVdsId = 0b5eb534-08fc-11e2-a34d-001a4a013f65, action = Status, ip = dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, port = , type = ipmilan, user = root, password = ******, options = ''), log id: d93775
2012-10-04 10:25:02,065 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-50) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: on, log id: d93775
2012-10-04 10:25:02,072 INFO  [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-4-thread-50) Running command: VdsNotRespondingTreatmentCommand internal: true. Entities affected :  ID: 0b5eb534-08fc-11e2-a34d-001a4a013f65 Type: VDS
2012-10-04 10:25:02,125 INFO  [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-50) [46f55832] Running command: StopVdsCommand internal: true. Entities affected :  ID: 0b5eb534-08fc-11e2-a34d-001a4a013f65 Type: VDS
2012-10-04 10:25:02,144 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [46f55832] START, SetVdsStatusVDSCommand(HostName = slot-7, HostId = 0b5eb534-08fc-11e2-a34d-001a4a013f65, status=Reboot, nonOperationalReason=NONE), log id: 2d8753c4
2012-10-04 10:25:02,168 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [46f55832] FINISH, SetVdsStatusVDSCommand, log id: 2d8753c4
2012-10-04 10:25:02,262 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (pool-4-thread-50) [46f55832] Executing <Stop> Power Management command, Proxy Host:srh-211, Agent:ipmilan, Target Host:slot-7, Management IP:dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, User:root, Options:
2012-10-04 10:25:02,264 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-50) [46f55832] START, FenceVdsVDSCommand(HostName = srh-211, HostId = 04be734c-0d41-11e2-a7ed-001a4a013f65, targetVdsId = 0b5eb534-08fc-11e2-a34d-001a4a013f65, action = Stop, ip = dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, port = , type = ipmilan, user = root, password = ******, options = ''), log id: 65f30b8c

Comment 2 Juan Hernández 2012-10-08 16:18:53 UTC

Some findings (no conclusions yet):

1. When the hypervisor stops responding each connection attempt should fail after the usual TCP connection timeout. In RHEL that is 1 minute approx, the TPC stack will try to send the initial SYN packet up to 6 times, with scaling timeouts of 1, 2, 4, 8, 16 and 32 seconds, that makes a total of 63 seconds.

As far as I know this can only be changed globally for everything running in the machine using the net.ipv4.tcp_syn_retries kernel parameter. For example, the following will reduce the maximum timeout to 31 seconds:

sysctl -w net.ipv4.tcp_syn_retries 4

I think we shouldn't mess with this setting, although in reliable LAN environments it shouldn't be a big problem.

2. The HTTP library that we use (commons-httpclient) automatically retries request up to 3 times when it detects an IO error before sending the request headers. That means that when the hypervisor stops responding we may have to wait up to 250 seconds before the IO error is reported to the upper layers. It may be good idea to reduce this number of retries so that the error is propagated earlier.

3. The engine automatically cancels requests to VDSM after 180 seconds (the default value of the "vdsTimeout" option). This means that when the hypervisor is not responding the request will be cancelled before the HTTP library stops retrying.

4. The engine considers a hypervisor failed if two consecutive requests return with a network error, so that makes for a total of 360 seconds (6 minutes).

I am still trying to understand where the rest of those 8~9 minutes come from.

Comment 3 Yaniv Kaul 2012-10-08 18:23:24 UTC

(In reply to comment #2)
> Some findings (no conclusions yet):
> 
> 1. When the hypervisor stops responding each connection attempt should fail
> after the usual TCP connection timeout. In RHEL that is 1 minute approx, the
> TPC stack will try to send the initial SYN packet up to 6 times, with
> scaling timeouts of 1, 2, 4, 8, 16 and 32 seconds, that makes a total of 63
> seconds.
> 

I suspect one of the 'advantages' of persistent sessions is that you'd have a shorter timeout (but I'm not sure). Packets will just be re-transmitted - not sure if it's as long as it will with the syn packets.

> As far as I know this can only be changed globally for everything running in
> the machine using the net.ipv4.tcp_syn_retries kernel parameter. For
> example, the following will reduce the maximum timeout to 31 seconds:
> 
>   sysctl -w net.ipv4.tcp_syn_retries 4
> 
> I think we shouldn't mess with this setting, although in reliable LAN
> environments it shouldn't be a big problem.
> 
> 2. The HTTP library that we use (commons-httpclient) automatically retries
> request up to 3 times when it detects an IO error before sending the request
> headers. That means that when the hypervisor stops responding we may have to
> wait up to 250 seconds before the IO error is reported to the upper layers.
> It may be good idea to reduce this number of retries so that the error is
> propagated earlier.
> 
> 3. The engine automatically cancels requests to VDSM after 180 seconds (the
> default value of the "vdsTimeout" option). This means that when the
> hypervisor is not responding the request will be cancelled before the HTTP
> library stops retrying.
> 
> 4. The engine considers a hypervisor failed if two consecutive requests
> return with a network error, so that makes for a total of 360 seconds (6
> minutes).
> 
> I am still trying to understand where the rest of those 8~9 minutes come
> from.

Comment 4 Juan Hernández 2012-10-08 18:25:30 UTC

The logic we currently use to decide if the host has to be reset is that it has to fail at least 2 times in a row (the default value of VDSAttemptsToResetCount) and has to be failing for at least 30 seconds (the default value of TimeoutToResetVdsInSeconds). The first condition, combined with the timeouts in comment #2 mean that it will take at least 6 minutes to reset the host if it just stops dropping packets.

I think we should do the following:

1. Reduce the TCP connection timeout that we currently use. It is 180 seconds because it is copied from the request timeout. I think it is OK to have 3 minutes of timeout per VDSM operation, but it is unreasonable to have 180 seconds for connections (in fact it is 63 seconds, as the kernel ignores anything larger than that default).

As we are doing a polling of the hosts every VdsRefreshRate seconds (2 by default) I think that is reasonable to use that same parameter as the connection timeout.

2. Don't let the HTTP client retry failed operations.

With these two changes the fencing takes approx TimeoutToResetVdsInSeconds to start. Then it takes some time to perform/validate it.

Comment 5 Juan Hernández 2012-10-08 18:52:34 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > Some findings (no conclusions yet):
> > 
> > 1. When the hypervisor stops responding each connection attempt should fail
> > after the usual TCP connection timeout. In RHEL that is 1 minute approx, the
> > TPC stack will try to send the initial SYN packet up to 6 times, with
> > scaling timeouts of 1, 2, 4, 8, 16 and 32 seconds, that makes a total of 63
> > seconds.
> > 
> 
> I suspect one of the 'advantages' of persistent sessions is that you'd have
> a shorter timeout (but I'm not sure). Packets will just be re-transmitted -
> not sure if it's as long as it will with the syn packets.

Almost everything are advantages with persistent connections. But for detecting dead hosts in particular it might be worse, because the application has no way to differentiate an operation that takes long to execute from a dead host. TCP timeouts don't help much with that either, as the maximum retransmission timeout is very large, and as far as I know there is no way for the application to change them. According to "man tcp" it can be between 13 and 30 minutes (see the "tcp_retry2" parameter).

TCP keepalives could help, but I would need to check how well that is supported in Java.

Comment 6 Juan Hernández 2012-10-08 19:14:33 UTC

The following patch implements the changes in comment #4:

http://gerrit.ovirt.org/8410

Comment 11 Juan Hernández 2013-06-13 14:19:33 UTC

A new change has been submitted for review that changes the default connection timeout from 180 seconds to 2 seconds and the number of retries from 3 to 0 (no retries):

  http://gerrit.ovirt.org/15673

The effect of this change can be tested with a build containing the change, but also with the current build and changing the parameters manually:

  # engine-config -s vdsTimeout=2
  # engine-config -s vdsRetries=0
  # service ovirt-engine restart

Comment 13 Juan Hernández 2013-06-16 12:23:35 UTC

I made a mistake in the instructions in comment 11, the parameter that has to be changed is vdsConnectionTimeout, not vdsTimeout, so to do the tests revert the change to vdsTimeout:

  # engine-config -s vdsTimeout=180

Then change vdsConnectionTimeout and restart the engine again:

  # engine-config -s vdsConnectionTimeout=2
  # engine-config -s vdsRetries=0
  # service ovirt-engine restart

Comment 15 Artyom 2013-07-14 12:38:37 UTC

Verified on is5
host vdsm-4.11.0-121.git082925a.el6

engine-config -g vdsRetries
vdsRetries: 0 version: general

engine-config -g vdsConnectionTimeout
vdsConnectionTimeout: 2 version: general

To host take 1 min to move to non-responsive state

Comment 16 Vinzenz Feenstra [evilissimo] 2013-07-24 12:04:38 UTC

(In reply to Artyom from comment #15)
> Verified on is5
> host vdsm-4.11.0-121.git082925a.el6
> 
> engine-config -g vdsRetries
> vdsRetries: 0 version: general
> 
> engine-config -g vdsConnectionTimeout
> vdsConnectionTimeout: 2 version: general
> 
> To host take 1 min to move to non-responsive state

According to the initial description it takes ca 8-9 minutes to have the high availability VM started on a different host, when the host the HA-VM is running gets non-responsive.
In my testing scenario is the host already not reachable for more than 15minutes and the VM is still not started on a different host.

Artyom can you please confirm that you also verified that the VM is starting on a different host?

Comment 17 Doron Fediuck 2013-07-24 16:31:54 UTC

(In reply to Vinzenz Feenstra [evilissimo] from comment #16)
> (In reply to Artyom from comment #15)
> > Verified on is5
> > host vdsm-4.11.0-121.git082925a.el6
> > 
> > engine-config -g vdsRetries
> > vdsRetries: 0 version: general
> > 
> > engine-config -g vdsConnectionTimeout
> > vdsConnectionTimeout: 2 version: general
> > 
> > To host take 1 min to move to non-responsive state
> 
> According to the initial description it takes ca 8-9 minutes to have the
> high availability VM started on a different host, when the host the HA-VM is
> running gets non-responsive.
> In my testing scenario is the host already not reachable for more than
> 15minutes and the VM is still not started on a different host.
> 
> Artyom can you please confirm that you also verified that the VM is starting
> on a different host?

Vinzez,
Can you please verify the below values-

engine-config -g vdsRetries
vdsRetries: 0 version: general
 
engine-config -g vdsConnectionTimeout
vdsConnectionTimeout: 2 version: general

Comment 18 Artyom 2013-07-25 06:20:45 UTC

In your scenario what problem drop host to non responsive state, with network or with vdsm? If problem with network, host must have working power management and another host on rhevm to fence problematic host, after fencing problematic host ha vm moves to another host in cluster, if problem not in network rhevm just do soft fencing to problematic host and no need in working power management, vm stay on same host.

soft fencing taking about 1-2 minutes
fence - to non responsive state take 1-2 minute

Comment 19 Charlie 2013-11-28 00:07:53 UTC

This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 20 Lee Yarwood 2014-01-07 16:47:23 UTC

*** Bug 1044089 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2014-01-21 17:30:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html

Note You need to log in before you can comment on or make changes to this bug.