Red Hat Bugzilla – Bug 1030441
Handle crash of both ha services: agent and broker.
Last modified: 2016-03-09 14:48:13 EST
Currently we don't really handle the situation when both services (agent and broker) crashed. We should handle such scenario.
HA wise, the VM will keep running so it should be fine.
What we need is a way to improve it and notify the user / admin.
the services don't have/need a watchdog?
(In reply to Itamar Heim from comment #2)
> the services don't have/need a watchdog?
Probably need one, and don't have one yet. Using watchdog.d and/or systemd could fill in some gaps. We'd then need notifications, which I think we can leverage the broker's notification system for (with some self-monitoring).
This might be fixed by us using systemd now without requiring any code change.
Description of bug very informative, so how we must handle crash of both services?
Checked on ovirt-hosted-engine-ha-1.3.0-1.el7ev.noarch
1) Finish deployment of hosted-engine
2) Kill both service ovirt-ha-agent and ovirt-ha-broker
3) Wait 5 minutes
4) Services still down
Artyom: How exactly did you kill those services?
kill -9 pid_of_ovirt-ha-broker pid_of_ovirt-ha-agent
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to
- 3.6.1 if severity >= high
- 4.0 if severity < high
This was already merged, however we might have a small issue on centos 7, where the systemd v. 208 does not support on-abnormal. This will be remedied once centos 7.2 is released with new systemd.
I checked it on ovirt-hosted-engine-ha-188.8.131.52-1.el7ev.noarch
Problem still exist
*** Bug 1275606 has been marked as a duplicate of this bug. ***
Verifie on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) pkill -9 ovirt-ha-agent ovirt-ha-broker
2) check services after minute, both services up
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.