Bug 1030441

Summary: Handle crash of both ha services: agent and broker.
Product: Red Hat Enterprise Virtualization Manager Reporter: Leonid Natapov <lnatapov>
Component: ovirt-hosted-engine-haAssignee: Martin Sivák <msivak>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: alukiano, daniel.helgenberger, dfediuck, ebenahar, gpadgett, juwu, mavital, msivak, rgolan, sbonazzo, scohen, sherold
Target Milestone: ovirt-3.6.1Keywords: Triaged
Target Release: 3.6.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
With this update, systemd is configured to restart the HA services(ovirt-ha-agent and ovirt-ha-broker) in case the services crash. The HA services are part of the high availability solution for the Manager virtual machine and must be highly available themselves.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-09 19:48:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Leonid Natapov 2013-11-14 13:03:34 UTC
Currently we don't really handle the situation when both services (agent and broker) crashed. We should handle such scenario.

Comment 1 Doron Fediuck 2013-11-14 16:45:25 UTC
HA wise, the VM will keep running so it should be fine.
What we need is a way to improve it and notify the user / admin.

Comment 2 Itamar Heim 2013-11-15 06:09:36 UTC
the services don't have/need a watchdog?

Comment 4 Greg Padgett 2013-12-03 15:30:04 UTC
(In reply to Itamar Heim from comment #2)
> the services don't have/need a watchdog?

Probably need one, and don't have one yet.  Using watchdog.d and/or systemd could fill in some gaps.  We'd then need notifications, which I think we can leverage the broker's notification system for (with some self-monitoring).

Comment 6 Martin Sivák 2015-10-06 11:20:16 UTC
This might be fixed by us using systemd now without requiring any code change.

Comment 7 Artyom 2015-10-11 12:52:14 UTC
Description of bug very informative, so how we must handle crash of both services?

Checked on ovirt-hosted-engine-ha-1.3.0-1.el7ev.noarch
1) Finish deployment of hosted-engine
2) Kill both service ovirt-ha-agent and ovirt-ha-broker
3) Wait 5 minutes
4) Services still down

Comment 8 Martin Sivák 2015-10-12 08:15:01 UTC
Artyom: How exactly did you kill those services?

Comment 9 Artyom 2015-10-12 08:39:16 UTC
kill -9 pid_of_ovirt-ha-broker pid_of_ovirt-ha-agent

Comment 10 Sandro Bonazzola 2015-10-26 12:44:20 UTC
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 11 Martin Sivák 2015-10-26 13:59:18 UTC
This was already merged, however we might have a small issue on centos 7, where the systemd v. 208 does not support on-abnormal. This will be remedied once centos 7.2 is released with new systemd.

Comment 12 Artyom 2015-11-05 11:17:55 UTC
I checked it on ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
Problem still exist

Comment 13 Martin Sivák 2015-11-18 09:50:36 UTC
*** Bug 1275606 has been marked as a duplicate of this bug. ***

Comment 14 Artyom 2015-11-26 15:37:05 UTC
Verifie on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) pkill -9 ovirt-ha-agent ovirt-ha-broker
2) check services after minute, both services up

Comment 16 errata-xmlrpc 2016-03-09 19:48:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0422.html