Default setting seems unsuitable for non-ideal network conditions. Hosts are frequently flipping to NonResponsive/Connecting despite VMs working fine as well as general connectivity between the DC and engine host. Latency is higher than usual, but there are no clear requirements specified anywhere
Martin we could increase the value but there will be still environments where it will be not enough. Michal any suggestions what would be desired value?
Not sure. It is 30s currently, right? Maybe a different behavior would work better? Fire some auditlog events first?
(In reply to Michal Skrivanek from comment #4) > Not sure. It is 30s currently, right? Yes, now we have 30 seconds > Maybe a different behavior would work better? Fire some auditlog events > first? Pleas extend your suggestion I do not know what you mean.
i mean that there should be a clear warning in audit log way way before we move host to Not Responding. 30s for a response is bad, but still bearable as long as the call succeeds. Perhaps have a lower threshold for warnings? E.g. some summary audit log with calls in the last hour which took longer than 10s? I wonder if we can/should perhaps watch the monitoring threads results instead or in addition.
Please note that we are dealing with networking. At the moment there is no means to notify higher engine layers about no replies from the host. Monitoring threads are not good place to figure out such things since they are too high in the call stack. If we want to notify the user about potential network issues prior moving to NonReponding it would mean that we need to make RFE out of this bug and think about how to do it well in order not to annoy the users if the network is not stable. This would mean that we should not backport it to the stable branch. I would suggest to increase heartbeat interval as suggested by Michal and open rfe to work on alerting. Michal, please suggest the value based on your experience.
Based on conversation with Martin we are not going to change the interval but we will send heartbeats more frequently from vdsm and for now we will log lack of messages in half the interval time.
It also needs to be mentioned that this may help only in case of network issues when 1st heartbeat is lost, so engine will still wait for 2nd heartbeat and only if both heartbeats are lost heartbeat exceeded will be raised. But it will not help for cases like BZ1488338 where VDSM is blocked and unable to send heartbeats at all.
Verified on vdsm-jsonrpc-java-1.4.11-1.el7ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1516
BZ<2>Jira Resync