Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1513360

Summary:	Frequent heartbeat exceeded, hosts not responding
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Michal Skrivanek <michal.skrivanek>
Component:	vdsm-jsonrpc-java	Assignee:	Piotr Kliczewski <pkliczew>
Status:	CLOSED ERRATA	QA Contact:	Petr Matyáš <pmatyas>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.7	CC:	lsurette, michal.skrivanek, mperina, pkliczew, pstehlik, rbalakri, Rhev-m-bugs, srevivo, ykaul
Target Milestone:	ovirt-4.2.1	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-15 17:56:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1488259

Description Michal Skrivanek 2017-11-15 09:51:14 UTC

Default setting seems unsuitable for non-ideal network conditions. Hosts are frequently flipping to NonResponsive/Connecting despite VMs working fine as well as general connectivity between the DC and engine host. 
Latency is higher than usual, but there are no clear requirements specified anywhere

Comment 3 Piotr Kliczewski 2017-11-20 09:30:08 UTC

Martin we could increase the value but there will be still environments where it will be not enough.
Michal any suggestions what would be desired value?

Comment 4 Michal Skrivanek 2017-11-20 10:43:32 UTC

Not sure. It is 30s currently, right?
Maybe a different behavior would work better? Fire some auditlog events first?

Comment 5 Piotr Kliczewski 2017-11-20 10:55:39 UTC

(In reply to Michal Skrivanek from comment #4)
> Not sure. It is 30s currently, right?

Yes, now we have 30 seconds

> Maybe a different behavior would work better? Fire some auditlog events
> first?

Pleas extend your suggestion I do not know what you mean.

Comment 6 Michal Skrivanek 2017-11-21 09:20:04 UTC

i mean that there should be a clear warning in audit log way way before we move host to Not Responding.
30s for a response is bad, but still bearable as long as the call succeeds. Perhaps have a lower threshold for warnings? E.g. some summary audit log with calls in the last hour which took longer than 10s?
I wonder if we can/should perhaps watch the monitoring threads results instead or in addition.

Comment 7 Piotr Kliczewski 2017-11-27 11:19:59 UTC

Please note that we are dealing with networking. At the moment there is no means to notify higher engine layers about no replies from the host. Monitoring threads are not good place to figure out such things since they are too high in the call stack.

If we want to notify the user about potential network issues prior moving to NonReponding it would mean that we need to make RFE out of this bug and think about how to do it well in order not to annoy the users if the network is not stable. This would mean that we should not backport it to the stable branch.

I would suggest to increase heartbeat interval as suggested by Michal and open rfe to work on alerting.

Michal, please suggest the value based on your experience.

Comment 8 Piotr Kliczewski 2017-12-01 10:21:29 UTC

Based on conversation with Martin we are not going to change the interval but we will send heartbeats more frequently from vdsm and for now we will log lack of messages in half the interval time.

Comment 9 Martin Perina 2017-12-01 10:49:45 UTC

It also needs to be mentioned that this may help only in case of network issues when 1st heartbeat is lost, so engine will still wait for 2nd heartbeat and only if both heartbeats are lost heartbeat exceeded will be raised.

But it will not help for cases like BZ1488338 where VDSM is blocked and unable to send heartbeats at all.

Comment 11 Petr Matyáš 2018-01-17 15:25:04 UTC

Verified on vdsm-jsonrpc-java-1.4.11-1.el7ev.noarch

Comment 14 errata-xmlrpc 2018-05-15 17:56:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1516

Comment 15 Franta Kust 2019-05-16 13:05:55 UTC

BZ<2>Jira Resync