Created attachment 610281 [details] engine logs Description of problem: When blocking the connection between engine and an HSM host using iptables (DROP), it takes about 10 minutes for the state of the host to change to non-responsive. These 2 commands ran in quick succession: [root@gadi-rhevm ~]# date Thu Sep 6 13:56:39 IDT 2012 [root@gadi-rhevm ~]# iptables -A OUTPUT -d green-vdsa.qa.lab.tlv.redhat.com -j DROP --------engine log lines:------------------- 2012-09-06 13:55:14,296 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-70) [37518ccb] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. E ntities affected : ID: ed2d2eb2-f7fb-11e1-a776-001a4a169705 Type: VDS 2012-09-06 13:55:14,318 INFO [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-70) [59ba097e] Running command: HandleVdsVersionCommand internal: true. Entities affected : ID: ed2d2eb2-f 7fb-11e1-a776-001a4a169705 Type: VDS 2012-09-06 13:59:42,136 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-95) XML RPC error in command ListVDS ( Vds: green-vdsa.qa.lab.tlv.redhat.com ), the error was: jav a.util.concurrent.TimeoutException, TimeoutException: 2012-09-06 13:59:42,155 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-95) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = ed2d2eb2-f7fb-11e1-a776-001a4a169705 : g reen-vdsa.qa.lab.tlv.redhat.com, VDS Network Error, continuing. VDSNetworkException: 2012-09-06 14:00:00,000 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-75) Autorecovering hosts is disabled, skipping 2012-09-06 14:00:00,000 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-75) Autorecovering storage domains is disabled, skipping 2012-09-06 14:02:44,160 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-13) XML RPC error in command GetCapabilitiesVDS ( Vds: green-vdsa.qa.lab.tlv.redhat.com ), the err or was: java.util.concurrent.TimeoutException, TimeoutException: 2012-09-06 14:02:44,160 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-13) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = ed2d2eb2-f7fb-11e1-a776-001a4a169705 : g reen-vdsa.qa.lab.tlv.redhat.com, VDS Network Error, continuing. VDSNetworkException: 2012-09-06 14:05:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-38) Autorecovering hosts is disabled, skipping 2012-09-06 14:05:00,001 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-38) Autorecovering storage domains is disabled, skipping 2012-09-06 14:05:46,165 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (QuartzScheduler_Worker-81) XML RPC error in command GetCapabilitiesVDS ( Vds: green-vdsa.qa.lab.tlv.redhat.com ), the err or was: java.util.concurrent.TimeoutException, TimeoutException: 2012-09-06 14:05:46,165 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-81) VDS::handleNetworkException Server failed to respond, vds_id = ed2d2eb2-f7fb-11e1-a776-001a4a169705, vds_name = green-vdsa.qa.lab.tlv.redhat.com, error = VDSNetworkException: 2012-09-06 14:05:46,243 INFO [org.ovirt.engine.core.bll.VdsEventListener] (pool-4-thread-46) ResourceManager::vdsNotResponding entered for Host ed2d2eb2-f7fb-11e1-a776-001a4a169705, 10.35.102.10 2012-09-06 14:05:46,283 WARN [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-4-thread-46) [3d51e8ba] CanDoAction of action VdsNotRespondingTreatment failed. Reasons:VDS_FENCING_DISABLED Version-Release number of selected component (if applicable): rhevm-3.1.0-15.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1. Block connection from engine to host in iptables (DROP) 2. wait for engine to move host to non-responsive state Actual results: It takes about 10 minutes for the state change Expected results: Shorter timeout is expected when connection to host is blocked/not working. Additional info:
Note that you've not 'blocked' the connection in the classic meaning - you are dropping packets, not rejecting. If you change the iptables command to REJECT, does it still take 10 minutes? Not that 10 mintues is OK anyway, but it will take a while, because drop means that TCP connections do re-transmit until they give up eventually.
(In reply to comment #1) > Note that you've not 'blocked' the connection in the classic meaning - you > are dropping packets, not rejecting. If you change the iptables command to > REJECT, does it still take 10 minutes? Not that 10 mintues is OK anyway, but > it will take a while, because drop means that TCP connections do re-transmit > until they give up eventually. Using REJECT with iptables takes 1 minute for the status to change: [root@gadi-rhevm ~]# date Sun Sep 9 09:20:34 IDT 2012 [root@gadi-rhevm ~]# iptables -A OUTPUT -d green-vdsa.qa.lab.tlv.redhat.com -j REJECT 012-09-09 09:21:23,602 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-2) [1af80802] ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = ed2d2eb2-f7fb-11e1-a776-001a4a169705 : green-vdsa.qa.lab.tlv.redhat.com, VDS Network Error, continuing. VDSNetworkException: 2012-09-09 09:21:29,606 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-83) ResourceManager::refreshVdsRunTimeInfo::Failed to refresh VDS , vds = ed2d2eb2-f7fb-11e1-a776-001a4a169705 : green-vdsa.qa.lab.tlv.redhat.com, VDS Network Error, continuing. VDSNetworkException: 2012-09-09 09:21:35,612 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-37) VDS::handleNetworkException Server failed to respond, vds_id = ed2d2eb2-f7fb-11e1-a776-001a4a169705, vds_name = green-vdsa.qa.lab.tlv.redhat.com, error = VDSNetworkException: 2012-09-09 09:21:35,641 INFO [org.ovirt.engine.core.bll.VdsEventListener] (pool-4-thread-47) ResourceManager::vdsNotResponding entered for Host ed2d2eb2-f7fb-11e1-a776-001a4a169705, 10.35.102.10 2012-09-09 09:21:35,685 WARN [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-4-thread-47) [7e71d8f9] CanDoAction of action VdsNotRespondingTreatment failed. Reasons:VDS_FENCING_DISABLED
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug. in 3.3 there were some changes to have more predictable timeouts - see bug 863211