Description of problem: During the HA tests hosts was marked as NonResponsive because of timeouts. Version-Release number of selected component (if applicable): I'm opening bug against vdsm but I'm in doubt if it's vdsm's fault to be honest rhevm-backend-3.5.4.2-1.3.el6ev.noarch vdsm-4.16.26-1.el6ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Install 8 hypervisors and add them to manager. Each HV has 1-2 VMs running (it doesn't matter if PM is configured or not) 2. Block network communication on 4 HVs Actual results: All hypervisors are marked as nonResponsive including healthy ones Expected results: Only 4 problematic hypervisors should be marked as NonResponsive and fenced out (if configured) Additional info: engine.log contains many TimeoutExceptions which are normal because of half of the hosts don't have network connectivity. But this one are not normal: 2015-10-10 04:43:36,952 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-44) Command ListVDSCommand(HostName = rhev4, HostId = efd6bcec-37b1-41a9-9152-be5ce956ae0c, vds=Host[rhev4,efd6bcec-37b1-41a9-9152-be5ce956ae0c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues <== There are not corresponding records in vdsm.log 2015-10-10 04:43:44,737 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to /10.65.161.73 <== IP address of rhev4 host and this connection attempt was logged (see below) [....] 2015-10-10 04:46:50,987 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-24) HostName = rhev4 2015-10-10 04:46:50,987 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-24) Command GetCapabilitiesVDSCommand(HostName = rhev4, HostId = efd6bcec-37b1-41a9-9152-be5ce956ae0c, vds=Host[rhev4,efd6bcec-37b1-41a9-9152-be5ce956ae0c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues vdsm log: Detector thread::DEBUG::2015-10-10 04:43:44,731::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.65.161.80:46605 <== Connection added but stomp command was not parsed Thread-27::DEBUG::2015-10-10 04:43:45,883::fileSD::262::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/rhev/data-center/mnt/10.65.161.80:_domain_iso/9e2e5b05-22fc-422a-bf17-37534e38f9bc/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-27::DEBUG::2015-10-10 04:43:45,891::fileSD::262::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n328 bytes (328 B) copied, 0.000199751 s, 1.6 MB/s\n'; <rc> = 0 Thread-864::DEBUG::2015-10-10 04:43:46,364::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present Thread-890::DEBUG::2015-10-10 04:43:46,471::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present Thread-22::DEBUG::2015-10-10 04:43:51,667::blockSD::596::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/dev/a4ef59c4-a97a-498d-84a0-a294fdc83c50/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None) Thread-22::DEBUG::2015-10-10 04:43:51,676::blockSD::596::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0 records out\n4096 bytes (4.1 kB) copied, 0.000221506 s, 18.5 MB/s\n'; <rc> = 0 [.. repeated...] Detector thread::DEBUG::2015-10-10 04:44:14,892::protocoldetector::201::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.65.161.80:46605
Looking at the logs I can see that there is "Heartbeat exeeded" which is causing reconnections. Is the network configured properly and can you please play with heartbeat timeout in order to solve this issue?
(In reply to Piotr Kliczewski from comment #14) > Looking at the logs I can see that there is "Heartbeat exeeded" which is > causing reconnections. Is the network configured properly and can you please > play with heartbeat timeout in order to solve this issue? Don't we have network statistics collected, to see if on either engine or host side we are seeing packet drops, re-ordering, etc.?
I am not aware of any stats that we could use to diagnose this kind of issue.
It looks like it is related to BZ #1310563. We had similar issue that when we attempted to connect it blocked other host messages to be processed. This issue was fixed in 3.6 and master. Do we want to have the fix backported to 3.5?
*** This bug has been marked as a duplicate of bug 1310563 ***