Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1513360 - Frequent heartbeat exceeded, hosts not responding
Frequent heartbeat exceeded, hosts not responding
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm-jsonrpc-java (Show other bugs)
4.1.7
Unspecified Unspecified
high Severity high
: ovirt-4.2.1
: ---
Assigned To: Piotr Kliczewski
Petr Matyáš
:
Depends On:
Blocks: 1488259
  Show dependency treegraph
 
Reported: 2017-11-15 04:51 EST by Michal Skrivanek
Modified: 2018-05-15 13:56 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-05-15 13:56:18 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 84985 master MERGED heartbeat: change frequency 2017-12-04 05:23 EST
oVirt gerrit 85671 ovirt-engine-4.1 MERGED jsonrpc: version update 2017-12-22 02:45 EST
Red Hat Product Errata RHEA-2018:1516 None None None 2018-05-15 13:56 EDT

  None (edit)
Description Michal Skrivanek 2017-11-15 04:51:14 EST
Default setting seems unsuitable for non-ideal network conditions. Hosts are frequently flipping to NonResponsive/Connecting despite VMs working fine as well as general connectivity between the DC and engine host. 
Latency is higher than usual, but there are no clear requirements specified anywhere
Comment 3 Piotr Kliczewski 2017-11-20 04:30:08 EST
Martin we could increase the value but there will be still environments where it will be not enough.
Michal any suggestions what would be desired value?
Comment 4 Michal Skrivanek 2017-11-20 05:43:32 EST
Not sure. It is 30s currently, right?
Maybe a different behavior would work better? Fire some auditlog events first?
Comment 5 Piotr Kliczewski 2017-11-20 05:55:39 EST
(In reply to Michal Skrivanek from comment #4)
> Not sure. It is 30s currently, right?

Yes, now we have 30 seconds

> Maybe a different behavior would work better? Fire some auditlog events
> first?

Pleas extend your suggestion I do not know what you mean.
Comment 6 Michal Skrivanek 2017-11-21 04:20:04 EST
i mean that there should be a clear warning in audit log way way before we move host to Not Responding.
30s for a response is bad, but still bearable as long as the call succeeds. Perhaps have a lower threshold for warnings? E.g. some summary audit log with calls in the last hour which took longer than 10s?
I wonder if we can/should perhaps watch the monitoring threads results instead or in addition.
Comment 7 Piotr Kliczewski 2017-11-27 06:19:59 EST
Please note that we are dealing with networking. At the moment there is no means to notify higher engine layers about no replies from the host. Monitoring threads are not good place to figure out such things since they are too high in the call stack.

If we want to notify the user about potential network issues prior moving to NonReponding it would mean that we need to make RFE out of this bug and think about how to do it well in order not to annoy the users if the network is not stable. This would mean that we should not backport it to the stable branch.

I would suggest to increase heartbeat interval as suggested by Michal and open rfe to work on alerting.

Michal, please suggest the value based on your experience.
Comment 8 Piotr Kliczewski 2017-12-01 05:21:29 EST
Based on conversation with Martin we are not going to change the interval but we will send heartbeats more frequently from vdsm and for now we will log lack of messages in half the interval time.
Comment 9 Martin Perina 2017-12-01 05:49:45 EST
It also needs to be mentioned that this may help only in case of network issues when 1st heartbeat is lost, so engine will still wait for 2nd heartbeat and only if both heartbeats are lost heartbeat exceeded will be raised.

But it will not help for cases like BZ1488338 where VDSM is blocked and unable to send heartbeats at all.
Comment 11 Petr Matyáš 2018-01-17 10:25:04 EST
Verified on vdsm-jsonrpc-java-1.4.11-1.el7ev.noarch
Comment 14 errata-xmlrpc 2018-05-15 13:56:18 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1516

Note You need to log in before you can comment on or make changes to this bug.