Created attachment 621747 [details] engine-drop Description of problem: Once the host network is in trouble it takes ages (~8-9 mins) to vdsNotResponding shows up ,reboot problematic host & restart VMs (HA) on another host. Version-Release number of selected component (if applicable): si19.1 How reproducible: Steps to Reproduce: 1. have 2 or more hosts in cluster + HA machine on non-SPM host 2. iptables -I INPUT -s ENGINE_IP -p tcp -j DROP (on host) 3. Actual results: 2012-10-04 10:21:59,673 ERROR ..GetCapabilitiesVDS failed, but takes another 2 mins till the host is marked as Non-responsive. Expected results: there are 2 options in the config: TimeoutToResetVdsInSeconds VDSAttemptsToResetCount Those two should be the limit boundaries. Additional info: on host: ========= iptables -I INPUT -s ENGINE_IP -p tcp --dport 54321 -j DROP date on engine: =============== Thu Oct 4 10:15:57 CEST 2012 engine.log: ============= 2012-10-04 10:18:57,642 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-60) XML RPC error in command GetStatsVDS ( HostName = slot-7 ), the error was: java.util.concurrent.TimeoutException, TimeoutException: 2012-10-04 10:18:57,643 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-60) vds::refreshVdsStats Failed getVdsStats, vds = 0b5eb534-08fc-11e2-a34d-001a4a013f65 : slot-7, error = VDSNetworkException: VDSNetworkException: 2012-10-04 10:18:57,646 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-60) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 0b5eb534-08fc-11e2-a34d-001a4a013f65 : slot-7, VDS Network Error, continuing. VDSNetworkException: 2012-10-04 10:20:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-47) Autorecovering hosts is disabled, skipping 2012-10-04 10:20:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-47) Autorecovering storage domains is disabled, skipping 2012-10-04 10:21:59,673 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-74) XML RPC error in command GetCapabilitiesVDS ( HostName = slot-7 ), the error was: java.util.concurrent.TimeoutException, TimeoutException: 2012-10-04 10:21:59,674 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-74) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = 0b5eb534-08fc-11e2-a34d-001a4a013f65 : slot-7, VDS Network Error, continuing. VDSNetworkException: 2012-10-04 10:25:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-5) Autorecovering hosts is disabled, skipping 2012-10-04 10:25:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-5) Autorecovering storage domains is disabled, skipping 2012-10-04 10:25:01,699 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-79) XML RPC error in command GetCapabilitiesVDS ( HostName = slot-7 ), the error was: java.util.concurrent.TimeoutException, TimeoutException: 2012-10-04 10:25:01,699 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-79) VDS::handleNetworkException Server failed to respond, vds_id = 0b5eb534-08fc-11e2-a34d-001a4a013f65, vds_name = slot-7, error = VDSNetworkException: 2012-10-04 10:25:01,725 INFO [org.ovirt.engine.core.bll.VdsEventListener] (pool-4-thread-50) ResourceManager::vdsNotResponding entered for Host 0b5eb534-08fc-11e2-a34d-001a4a013f65, 10.34.63.137 2012-10-04 10:25:01,800 INFO [org.ovirt.engine.core.bll.FencingExecutor] (pool-4-thread-50) Executing <Status> Power Management command, Proxy Host:srh-211, Agent:ipmilan, Target Host:slot-7, Management IP:dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, User:root, Options: 2012-10-04 10:25:01,801 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-50) START, FenceVdsVDSCommand(HostName = srh-211, HostId = 04be734c-0d41-11e2-a7ed-001a4a013f65, targetVdsId = 0b5eb534-08fc-11e2-a34d-001a4a013f65, action = Status, ip = dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, port = , type = ipmilan, user = root, password = ******, options = ''), log id: d93775 2012-10-04 10:25:02,065 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-50) FINISH, FenceVdsVDSCommand, return: Test Succeeded, Host Status is: on, log id: d93775 2012-10-04 10:25:02,072 INFO [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-4-thread-50) Running command: VdsNotRespondingTreatmentCommand internal: true. Entities affected : ID: 0b5eb534-08fc-11e2-a34d-001a4a013f65 Type: VDS 2012-10-04 10:25:02,125 INFO [org.ovirt.engine.core.bll.StopVdsCommand] (pool-4-thread-50) [46f55832] Running command: StopVdsCommand internal: true. Entities affected : ID: 0b5eb534-08fc-11e2-a34d-001a4a013f65 Type: VDS 2012-10-04 10:25:02,144 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [46f55832] START, SetVdsStatusVDSCommand(HostName = slot-7, HostId = 0b5eb534-08fc-11e2-a34d-001a4a013f65, status=Reboot, nonOperationalReason=NONE), log id: 2d8753c4 2012-10-04 10:25:02,168 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-50) [46f55832] FINISH, SetVdsStatusVDSCommand, log id: 2d8753c4 2012-10-04 10:25:02,262 INFO [org.ovirt.engine.core.bll.FencingExecutor] (pool-4-thread-50) [46f55832] Executing <Stop> Power Management command, Proxy Host:srh-211, Agent:ipmilan, Target Host:slot-7, Management IP:dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, User:root, Options: 2012-10-04 10:25:02,264 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (pool-4-thread-50) [46f55832] START, FenceVdsVDSCommand(HostName = srh-211, HostId = 04be734c-0d41-11e2-a7ed-001a4a013f65, targetVdsId = 0b5eb534-08fc-11e2-a34d-001a4a013f65, action = Stop, ip = dell-pem1000e-01-bl7.mgmt.eng.brq.redhat.com, port = , type = ipmilan, user = root, password = ******, options = ''), log id: 65f30b8c
Some findings (no conclusions yet): 1. When the hypervisor stops responding each connection attempt should fail after the usual TCP connection timeout. In RHEL that is 1 minute approx, the TPC stack will try to send the initial SYN packet up to 6 times, with scaling timeouts of 1, 2, 4, 8, 16 and 32 seconds, that makes a total of 63 seconds. As far as I know this can only be changed globally for everything running in the machine using the net.ipv4.tcp_syn_retries kernel parameter. For example, the following will reduce the maximum timeout to 31 seconds: sysctl -w net.ipv4.tcp_syn_retries 4 I think we shouldn't mess with this setting, although in reliable LAN environments it shouldn't be a big problem. 2. The HTTP library that we use (commons-httpclient) automatically retries request up to 3 times when it detects an IO error before sending the request headers. That means that when the hypervisor stops responding we may have to wait up to 250 seconds before the IO error is reported to the upper layers. It may be good idea to reduce this number of retries so that the error is propagated earlier. 3. The engine automatically cancels requests to VDSM after 180 seconds (the default value of the "vdsTimeout" option). This means that when the hypervisor is not responding the request will be cancelled before the HTTP library stops retrying. 4. The engine considers a hypervisor failed if two consecutive requests return with a network error, so that makes for a total of 360 seconds (6 minutes). I am still trying to understand where the rest of those 8~9 minutes come from.
(In reply to comment #2) > Some findings (no conclusions yet): > > 1. When the hypervisor stops responding each connection attempt should fail > after the usual TCP connection timeout. In RHEL that is 1 minute approx, the > TPC stack will try to send the initial SYN packet up to 6 times, with > scaling timeouts of 1, 2, 4, 8, 16 and 32 seconds, that makes a total of 63 > seconds. > I suspect one of the 'advantages' of persistent sessions is that you'd have a shorter timeout (but I'm not sure). Packets will just be re-transmitted - not sure if it's as long as it will with the syn packets. > As far as I know this can only be changed globally for everything running in > the machine using the net.ipv4.tcp_syn_retries kernel parameter. For > example, the following will reduce the maximum timeout to 31 seconds: > > sysctl -w net.ipv4.tcp_syn_retries 4 > > I think we shouldn't mess with this setting, although in reliable LAN > environments it shouldn't be a big problem. > > 2. The HTTP library that we use (commons-httpclient) automatically retries > request up to 3 times when it detects an IO error before sending the request > headers. That means that when the hypervisor stops responding we may have to > wait up to 250 seconds before the IO error is reported to the upper layers. > It may be good idea to reduce this number of retries so that the error is > propagated earlier. > > 3. The engine automatically cancels requests to VDSM after 180 seconds (the > default value of the "vdsTimeout" option). This means that when the > hypervisor is not responding the request will be cancelled before the HTTP > library stops retrying. > > 4. The engine considers a hypervisor failed if two consecutive requests > return with a network error, so that makes for a total of 360 seconds (6 > minutes). > > I am still trying to understand where the rest of those 8~9 minutes come > from.
The logic we currently use to decide if the host has to be reset is that it has to fail at least 2 times in a row (the default value of VDSAttemptsToResetCount) and has to be failing for at least 30 seconds (the default value of TimeoutToResetVdsInSeconds). The first condition, combined with the timeouts in comment #2 mean that it will take at least 6 minutes to reset the host if it just stops dropping packets. I think we should do the following: 1. Reduce the TCP connection timeout that we currently use. It is 180 seconds because it is copied from the request timeout. I think it is OK to have 3 minutes of timeout per VDSM operation, but it is unreasonable to have 180 seconds for connections (in fact it is 63 seconds, as the kernel ignores anything larger than that default). As we are doing a polling of the hosts every VdsRefreshRate seconds (2 by default) I think that is reasonable to use that same parameter as the connection timeout. 2. Don't let the HTTP client retry failed operations. With these two changes the fencing takes approx TimeoutToResetVdsInSeconds to start. Then it takes some time to perform/validate it.
(In reply to comment #3) > (In reply to comment #2) > > Some findings (no conclusions yet): > > > > 1. When the hypervisor stops responding each connection attempt should fail > > after the usual TCP connection timeout. In RHEL that is 1 minute approx, the > > TPC stack will try to send the initial SYN packet up to 6 times, with > > scaling timeouts of 1, 2, 4, 8, 16 and 32 seconds, that makes a total of 63 > > seconds. > > > > I suspect one of the 'advantages' of persistent sessions is that you'd have > a shorter timeout (but I'm not sure). Packets will just be re-transmitted - > not sure if it's as long as it will with the syn packets. Almost everything are advantages with persistent connections. But for detecting dead hosts in particular it might be worse, because the application has no way to differentiate an operation that takes long to execute from a dead host. TCP timeouts don't help much with that either, as the maximum retransmission timeout is very large, and as far as I know there is no way for the application to change them. According to "man tcp" it can be between 13 and 30 minutes (see the "tcp_retry2" parameter). TCP keepalives could help, but I would need to check how well that is supported in Java.
The following patch implements the changes in comment #4: http://gerrit.ovirt.org/8410
A new change has been submitted for review that changes the default connection timeout from 180 seconds to 2 seconds and the number of retries from 3 to 0 (no retries): http://gerrit.ovirt.org/15673 The effect of this change can be tested with a build containing the change, but also with the current build and changing the parameters manually: # engine-config -s vdsTimeout=2 # engine-config -s vdsRetries=0 # service ovirt-engine restart
I made a mistake in the instructions in comment 11, the parameter that has to be changed is vdsConnectionTimeout, not vdsTimeout, so to do the tests revert the change to vdsTimeout: # engine-config -s vdsTimeout=180 Then change vdsConnectionTimeout and restart the engine again: # engine-config -s vdsConnectionTimeout=2 # engine-config -s vdsRetries=0 # service ovirt-engine restart
Verified on is5 host vdsm-4.11.0-121.git082925a.el6 engine-config -g vdsRetries vdsRetries: 0 version: general engine-config -g vdsConnectionTimeout vdsConnectionTimeout: 2 version: general To host take 1 min to move to non-responsive state
(In reply to Artyom from comment #15) > Verified on is5 > host vdsm-4.11.0-121.git082925a.el6 > > engine-config -g vdsRetries > vdsRetries: 0 version: general > > engine-config -g vdsConnectionTimeout > vdsConnectionTimeout: 2 version: general > > To host take 1 min to move to non-responsive state According to the initial description it takes ca 8-9 minutes to have the high availability VM started on a different host, when the host the HA-VM is running gets non-responsive. In my testing scenario is the host already not reachable for more than 15minutes and the VM is still not started on a different host. Artyom can you please confirm that you also verified that the VM is starting on a different host?
(In reply to Vinzenz Feenstra [evilissimo] from comment #16) > (In reply to Artyom from comment #15) > > Verified on is5 > > host vdsm-4.11.0-121.git082925a.el6 > > > > engine-config -g vdsRetries > > vdsRetries: 0 version: general > > > > engine-config -g vdsConnectionTimeout > > vdsConnectionTimeout: 2 version: general > > > > To host take 1 min to move to non-responsive state > > According to the initial description it takes ca 8-9 minutes to have the > high availability VM started on a different host, when the host the HA-VM is > running gets non-responsive. > In my testing scenario is the host already not reachable for more than > 15minutes and the VM is still not started on a different host. > > Artyom can you please confirm that you also verified that the VM is starting > on a different host? Vinzez, Can you please verify the below values- engine-config -g vdsRetries vdsRetries: 0 version: general engine-config -g vdsConnectionTimeout vdsConnectionTimeout: 2 version: general
In your scenario what problem drop host to non responsive state, with network or with vdsm? If problem with network, host must have working power management and another host on rhevm to fence problematic host, after fencing problematic host ha vm moves to another host in cluster, if problem not in network rhevm just do soft fencing to problematic host and no need in working power management, vm stay on same host. soft fencing taking about 1-2 minutes fence - to non responsive state take 1-2 minute
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
*** Bug 1044089 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html