Bug 1266099 - HA VMs are not restarted if hosted engine VM is on the same host and this host will crash
HA VMs are not restarted if hosted engine VM is on the same host and this hos...
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core (Show other bugs)
3.5.4
Unspecified Unspecified
urgent Severity urgent (vote)
: ovirt-3.6.3
: 3.6.3.3
Assigned To: Martin Perina
Artyom
: Triaged
: 1303897 (view as bug list)
Depends On:
Blocks: Gluster-HC-1 RHEV_36_HTB
  Show dependency treegraph
 
Reported: 2015-09-24 09:11 EDT by Martin Perina
Modified: 2016-05-23 07:03 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-11 02:23:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mperina: ovirt‑3.6.z?
rule-engine: ovirt‑4.0.0+
mgoldboi: blocker+
rule-engine: planning_ack+
oourfali: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 53436 master MERGED core: Enable fencing of previous host which HE VM was running on 2016-02-17 08:52 EST
oVirt gerrit 53658 ovirt-engine-3.6 MERGED core: Enable fencing of previous host which HE VM was running on 2016-02-17 09:45 EST
oVirt gerrit 53669 ovirt-engine-3.6.3 MERGED core: Enable fencing of previous host which HE VM was running on 2016-02-18 02:38 EST

  None (edit)
Description Martin Perina 2015-09-24 09:11:32 EDT
Description of problem:

After startup of engine there's an internval during which fencing is
disabled. It's called DisableFenceAtStartupInSec and by default it's
set to 5 minutes. It can be changed using

   engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which
could happen in during power issues in whole DC: when both engine and
hosts are started, for huge hosts it may take a lot of time until become
up and VDSM start to communicate with engine. So usually engine is started
first and without this interval engine will start fencing for hosts which
are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine
if there's not just communication issue between engine and host, so we
cannot restart HA VMs on another host. The only thing we can do is to
offer "Mark host as rebooted" manual option to administrator. If
administrator execution this option, we try to restart HA VMs on different
host ASAP, because admin took the responsibility of validation that VMs
are really not running.


When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment
   after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status


If some host become Non Resposive during DisableFenceAtStartupInSec interval
we skip fencing and administator will see message in Events tab that host
is Non Responsive, but fencing is disabled due to startup interval. So
administrator have to take care of such host manually.


Now what happened in your case:

 1. Hosted engine VM is running on host1 with other VMs
 2. Status of host1 and host2 is Up
 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine
    is running to detect issue with host1 and change its status to Non Responsive
 4. In the meantime hosted engine VM is started on host2 -> it will read host
    status from DB, but all hosts are up -> it will try to communicate with host1,
    but it's unreachable -> so it changes host1 status Non Responsive and starts
    Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted
    because engine is still in DisableFenceAtStartupInSec


Version-Release number of selected component (if applicable):

oVirt 3.5.3

How reproducible:

100%

Steps to Reproduce:
1. Install hosted engine into into 2 host environment (host1 and host2)
2. Run HA VM and hosted engine VM on host 1
3. Crash host1

Actual results:

hosted engine VM is restarted on host2, but HA VMs from host1 are not, because host1 is not fenced

Expected results:

hosted engine VM and HA VMs are restarted on host2

Additional info:
Comment 1 Red Hat Bugzilla Rules Engine 2015-10-19 06:49:44 EDT
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 2 Oved Ourfali 2016-02-03 04:36:58 EST
*** Bug 1303897 has been marked as a duplicate of this bug. ***
Comment 9 Martin Perina 2016-02-17 08:34:02 EST
We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to verify it until today :-(
Comment 10 Yaniv Lavi 2016-02-17 08:42:20 EST
(In reply to Martin Perina from comment #9)
> We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to
> verify it until today :-(

We will have a build next week probably if this is indeed urgent.
Comment 11 Martin Perina 2016-02-17 08:47:20 EST
So changing back to 3.6.3
Comment 12 Martin Perina 2016-02-17 09:58:20 EST
Not yet in 3.6.3 branch
Comment 13 Artyom 2016-02-25 10:28:17 EST
Verified on rhevm-backend-3.6.3.2-0.1.el6.noarch

Scenario 1:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host does not have PM)
4) Wait until HE vm start on second host, first host will dropped to non-responding state and HA vms to unknown state
5) Confirm host reboot on first host
6) HA vms start on second host
PASS

Scenario 2:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host has PM)
4) I see that engine try to fence host, but failed because
2016-02-25 15:33:34,353 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-34) [] Failed to run Fence script on vds 'hosted_engine_2'.
2016-02-25 15:33:34,398 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

I will verify this one and but opened https://bugzilla.redhat.com/show_bug.cgi?id=1312039 connect to PM issue.

Note You need to log in before you can comment on or make changes to this bug.