Bug 1384197

Summary: Frequent heartbeat exceeded timeouts and hosts going non-responsive
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: ovirt-engineAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED NOTABUG QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0.3CC: gklein, lsurette, mperina, pkliczew, rbalakri, Rhev-m-bugs, srevivo, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-18 09:16:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Gordon Watson 2016-10-12 19:16:38 UTC
Description of problem:

The customer has three hosts in the only cluster in this RHEV 4.0 environment. From time to time, all encounter "Heartbeat exeeded" timeouts and, within that time-frame, sometimes go non-responsive.

They have 'vdsHeartbeatInSeconds' currently set to 20.

The customer has disabled Power Management to prevent hosts from getting fenced.

RHEV-M runs inside a VM within a VMWare environment.


Version-Release number of selected component (if applicable):

RHEV-M 4.0.3
RHVH-4.0 with vdsm-4.18.11-1.el7


How reproducible:

Not. It's random.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Details to follow.