Bug 1273754 - Host is marked as non-responsive because of JSON RPC timeout
Host is marked as non-responsive because of JSON RPC timeout
Status: CLOSED DUPLICATE of bug 1310563
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.5.4
Unspecified Unspecified
unspecified Severity high
: ovirt-4.1.0-alpha
: ---
Assigned To: Piotr Kliczewski
Pavel Stehlik
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-21 03:54 EDT by Pavel Zhukov
Modified: 2017-03-19 12:11 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-06 05:14:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Pavel Zhukov 2015-10-21 03:54:56 EDT
Description of problem:
During the HA tests hosts was marked as NonResponsive because of timeouts.

Version-Release number of selected component (if applicable):
I'm opening bug against vdsm but I'm in doubt if it's vdsm's fault to be honest
rhevm-backend-3.5.4.2-1.3.el6ev.noarch
vdsm-4.16.26-1.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install 8 hypervisors and add them to manager. Each HV has 1-2 VMs running (it doesn't matter if PM is configured or not)
2. Block network communication on 4 HVs


Actual results:
All hypervisors are marked as nonResponsive including healthy ones

Expected results:
Only 4 problematic hypervisors should be marked as NonResponsive and fenced out (if configured)

Additional info:

engine.log contains many TimeoutExceptions which are normal because of half of the hosts don't have network connectivity. But this one are not normal:

2015-10-10 04:43:36,952 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-44) Command ListVDSCommand(HostName = rhev4, HostId = efd6bcec-37b1-41a9-9152-be5ce956ae0c, vds=Host[rhev4,efd6bcec-37b1-41a9-9152-be5ce956ae0c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues <== There are not corresponding records in vdsm.log 

2015-10-10 04:43:44,737 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to /10.65.161.73 <== IP address of rhev4 host and this connection attempt was logged (see below)
[....]

2015-10-10 04:46:50,987 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-24) HostName = rhev4
2015-10-10 04:46:50,987 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-24) Command GetCapabilitiesVDSCommand(HostName = rhev4, HostId = efd6bcec-37b1-41a9-9152-be5ce956ae0c, vds=Host[rhev4,efd6bcec-37b1-41a9-9152-be5ce956ae0c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues


vdsm log:

Detector thread::DEBUG::2015-10-10 04:43:44,731::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.65.161.80:46605 <== Connection added but stomp command was not parsed
Thread-27::DEBUG::2015-10-10 04:43:45,883::fileSD::262::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/rhev/data-center/mnt/10.65.161.80:_domain_iso/9e2e5b05-22fc-422a-bf17-37534e38f9bc/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-27::DEBUG::2015-10-10 04:43:45,891::fileSD::262::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n328 bytes (328 B) copied, 0.000199751 s, 1.6 MB/s\n'; <rc> = 0
Thread-864::DEBUG::2015-10-10 04:43:46,364::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present
Thread-890::DEBUG::2015-10-10 04:43:46,471::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present
Thread-22::DEBUG::2015-10-10 04:43:51,667::blockSD::596::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/dev/a4ef59c4-a97a-498d-84a0-a294fdc83c50/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-22::DEBUG::2015-10-10 04:43:51,676::blockSD::596::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0 records out\n4096 bytes (4.1 kB) copied, 0.000221506 s, 18.5 MB/s\n'; <rc> = 0

[.. repeated...] 

Detector thread::DEBUG::2015-10-10 04:44:14,892::protocoldetector::201::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.65.161.80:46605
Comment 14 Piotr Kliczewski 2015-10-26 06:58:16 EDT
Looking at the logs I can see that there is "Heartbeat exeeded" which is causing reconnections. Is the network configured properly and can you please play with heartbeat timeout in order to solve this issue?
Comment 15 Yaniv Kaul 2015-10-26 07:35:18 EDT
(In reply to Piotr Kliczewski from comment #14)
> Looking at the logs I can see that there is "Heartbeat exeeded" which is
> causing reconnections. Is the network configured properly and can you please
> play with heartbeat timeout in order to solve this issue?

Don't we have network statistics collected, to see if on either engine or host side we are seeing packet drops, re-ordering, etc.?
Comment 16 Piotr Kliczewski 2015-10-26 08:18:48 EDT
I am not aware of any stats that we could use to diagnose this kind of issue.
Comment 46 Piotr Kliczewski 2016-04-05 08:51:48 EDT
It looks like it is related to BZ #1310563. We had similar issue that when we attempted to connect it blocked other host messages to be processed. This issue was fixed in 3.6 and master. Do we want to have the fix backported to 3.5?
Comment 50 Piotr Kliczewski 2016-04-06 05:14:35 EDT

*** This bug has been marked as a duplicate of bug 1310563 ***

Note You need to log in before you can comment on or make changes to this bug.