Bug 1513360 - Frequent heartbeat exceeded, hosts not responding
Summary: Frequent heartbeat exceeded, hosts not responding
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm-jsonrpc-java
Version: 4.1.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Piotr Kliczewski
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: 1488259
TreeView+ depends on / blocked
 
Reported: 2017-11-15 09:51 UTC by Michal Skrivanek
Modified: 2020-08-03 15:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-15 17:56:18 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:1516 0 None None None 2018-05-15 17:56:46 UTC
oVirt gerrit 84985 0 'None' MERGED heartbeat: change frequency 2020-12-08 10:14:28 UTC
oVirt gerrit 85671 0 'None' MERGED jsonrpc: version update 2020-12-08 10:14:28 UTC

Description Michal Skrivanek 2017-11-15 09:51:14 UTC
Default setting seems unsuitable for non-ideal network conditions. Hosts are frequently flipping to NonResponsive/Connecting despite VMs working fine as well as general connectivity between the DC and engine host. 
Latency is higher than usual, but there are no clear requirements specified anywhere

Comment 3 Piotr Kliczewski 2017-11-20 09:30:08 UTC
Martin we could increase the value but there will be still environments where it will be not enough.
Michal any suggestions what would be desired value?

Comment 4 Michal Skrivanek 2017-11-20 10:43:32 UTC
Not sure. It is 30s currently, right?
Maybe a different behavior would work better? Fire some auditlog events first?

Comment 5 Piotr Kliczewski 2017-11-20 10:55:39 UTC
(In reply to Michal Skrivanek from comment #4)
> Not sure. It is 30s currently, right?

Yes, now we have 30 seconds

> Maybe a different behavior would work better? Fire some auditlog events
> first?

Pleas extend your suggestion I do not know what you mean.

Comment 6 Michal Skrivanek 2017-11-21 09:20:04 UTC
i mean that there should be a clear warning in audit log way way before we move host to Not Responding.
30s for a response is bad, but still bearable as long as the call succeeds. Perhaps have a lower threshold for warnings? E.g. some summary audit log with calls in the last hour which took longer than 10s?
I wonder if we can/should perhaps watch the monitoring threads results instead or in addition.

Comment 7 Piotr Kliczewski 2017-11-27 11:19:59 UTC
Please note that we are dealing with networking. At the moment there is no means to notify higher engine layers about no replies from the host. Monitoring threads are not good place to figure out such things since they are too high in the call stack.

If we want to notify the user about potential network issues prior moving to NonReponding it would mean that we need to make RFE out of this bug and think about how to do it well in order not to annoy the users if the network is not stable. This would mean that we should not backport it to the stable branch.

I would suggest to increase heartbeat interval as suggested by Michal and open rfe to work on alerting.

Michal, please suggest the value based on your experience.

Comment 8 Piotr Kliczewski 2017-12-01 10:21:29 UTC
Based on conversation with Martin we are not going to change the interval but we will send heartbeats more frequently from vdsm and for now we will log lack of messages in half the interval time.

Comment 9 Martin Perina 2017-12-01 10:49:45 UTC
It also needs to be mentioned that this may help only in case of network issues when 1st heartbeat is lost, so engine will still wait for 2nd heartbeat and only if both heartbeats are lost heartbeat exceeded will be raised.

But it will not help for cases like BZ1488338 where VDSM is blocked and unable to send heartbeats at all.

Comment 11 Petr Matyáš 2018-01-17 15:25:04 UTC
Verified on vdsm-jsonrpc-java-1.4.11-1.el7ev.noarch

Comment 14 errata-xmlrpc 2018-05-15 17:56:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1516

Comment 15 Franta Kust 2019-05-16 13:05:55 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.