1273754 – Host is marked as non-responsive because of JSON RPC timeout

Bug 1273754 - Host is marked as non-responsive because of JSON RPC timeout

Summary: Host is marked as non-responsive because of JSON RPC timeout

Keywords:
Status:	CLOSED DUPLICATE of bug 1310563
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.1.0-alpha
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-21 07:54 UTC by Pavel Zhukov
Modified:	2019-10-10 10:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-06 09:14:35 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pavel Zhukov 2015-10-21 07:54:56 UTC

Description of problem:
During the HA tests hosts was marked as NonResponsive because of timeouts.

Version-Release number of selected component (if applicable):
I'm opening bug against vdsm but I'm in doubt if it's vdsm's fault to be honest
rhevm-backend-3.5.4.2-1.3.el6ev.noarch
vdsm-4.16.26-1.el6ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install 8 hypervisors and add them to manager. Each HV has 1-2 VMs running (it doesn't matter if PM is configured or not)
2. Block network communication on 4 HVs


Actual results:
All hypervisors are marked as nonResponsive including healthy ones

Expected results:
Only 4 problematic hypervisors should be marked as NonResponsive and fenced out (if configured)

Additional info:

engine.log contains many TimeoutExceptions which are normal because of half of the hosts don't have network connectivity. But this one are not normal:

2015-10-10 04:43:36,952 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-44) Command ListVDSCommand(HostName = rhev4, HostId = efd6bcec-37b1-41a9-9152-be5ce956ae0c, vds=Host[rhev4,efd6bcec-37b1-41a9-9152-be5ce956ae0c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues <== There are not corresponding records in vdsm.log 

2015-10-10 04:43:44,737 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) Connecting to /10.65.161.73 <== IP address of rhev4 host and this connection attempt was logged (see below)
[....]

2015-10-10 04:46:50,987 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-24) HostName = rhev4
2015-10-10 04:46:50,987 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-24) Command GetCapabilitiesVDSCommand(HostName = rhev4, HostId = efd6bcec-37b1-41a9-9152-be5ce956ae0c, vds=Host[rhev4,efd6bcec-37b1-41a9-9152-be5ce956ae0c]) execution failed. Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues


vdsm log:

Detector thread::DEBUG::2015-10-10 04:43:44,731::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 10.65.161.80:46605 <== Connection added but stomp command was not parsed
Thread-27::DEBUG::2015-10-10 04:43:45,883::fileSD::262::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/rhev/data-center/mnt/10.65.161.80:_domain_iso/9e2e5b05-22fc-422a-bf17-37534e38f9bc/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-27::DEBUG::2015-10-10 04:43:45,891::fileSD::262::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n328 bytes (328 B) copied, 0.000199751 s, 1.6 MB/s\n'; <rc> = 0
Thread-864::DEBUG::2015-10-10 04:43:46,364::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present
Thread-890::DEBUG::2015-10-10 04:43:46,471::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present
Thread-22::DEBUG::2015-10-10 04:43:51,667::blockSD::596::Storage.Misc.excCmd::(getReadDelay) /bin/dd if=/dev/a4ef59c4-a97a-498d-84a0-a294fdc83c50/metadata iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
Thread-22::DEBUG::2015-10-10 04:43:51,676::blockSD::596::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '1+0 records in\n1+0 records out\n4096 bytes (4.1 kB) copied, 0.000221506 s, 18.5 MB/s\n'; <rc> = 0

[.. repeated...] 

Detector thread::DEBUG::2015-10-10 04:44:14,892::protocoldetector::201::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 10.65.161.80:46605

Comment 14 Piotr Kliczewski 2015-10-26 10:58:16 UTC

Looking at the logs I can see that there is "Heartbeat exeeded" which is causing reconnections. Is the network configured properly and can you please play with heartbeat timeout in order to solve this issue?

Comment 15 Yaniv Kaul 2015-10-26 11:35:18 UTC

(In reply to Piotr Kliczewski from comment #14)
> Looking at the logs I can see that there is "Heartbeat exeeded" which is
> causing reconnections. Is the network configured properly and can you please
> play with heartbeat timeout in order to solve this issue?

Don't we have network statistics collected, to see if on either engine or host side we are seeing packet drops, re-ordering, etc.?

Comment 16 Piotr Kliczewski 2015-10-26 12:18:48 UTC

I am not aware of any stats that we could use to diagnose this kind of issue.

Comment 46 Piotr Kliczewski 2016-04-05 12:51:48 UTC

It looks like it is related to BZ #1310563. We had similar issue that when we attempted to connect it blocked other host messages to be processed. This issue was fixed in 3.6 and master. Do we want to have the fix backported to 3.5?

Comment 50 Piotr Kliczewski 2016-04-06 09:14:35 UTC


*** This bug has been marked as a duplicate of bug 1310563 ***

Note You need to log in before you can comment on or make changes to this bug.