Bug 1266099

Summary: HA VMs are not restarted if hosted engine VM is on the same host and this host will crash
Product: [oVirt] ovirt-engine Reporter: Martin Perina <mperina>
Component: Backend.CoreAssignee: Martin Perina <mperina>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.4CC: bugs, gklein, mgoldboi, mtessun, oourfali, pkliczew, sabose, ykaul
Target Milestone: ovirt-3.6.3Keywords: Triaged
Target Release: 3.6.3.3Flags: mperina: ovirt-3.6.z?
rule-engine: ovirt-4.0.0+
mgoldboi: blocker+
rule-engine: planning_ack+
oourfali: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-11 07:23:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258386, 1283458    

Description Martin Perina 2015-09-24 13:11:32 UTC
Description of problem:

After startup of engine there's an internval during which fencing is
disabled. It's called DisableFenceAtStartupInSec and by default it's
set to 5 minutes. It can be changed using

   engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which
could happen in during power issues in whole DC: when both engine and
hosts are started, for huge hosts it may take a lot of time until become
up and VDSM start to communicate with engine. So usually engine is started
first and without this interval engine will start fencing for hosts which
are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine
if there's not just communication issue between engine and host, so we
cannot restart HA VMs on another host. The only thing we can do is to
offer "Mark host as rebooted" manual option to administrator. If
administrator execution this option, we try to restart HA VMs on different
host ASAP, because admin took the responsibility of validation that VMs
are really not running.


When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment
   after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status


If some host become Non Resposive during DisableFenceAtStartupInSec interval
we skip fencing and administator will see message in Events tab that host
is Non Responsive, but fencing is disabled due to startup interval. So
administrator have to take care of such host manually.


Now what happened in your case:

 1. Hosted engine VM is running on host1 with other VMs
 2. Status of host1 and host2 is Up
 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine
    is running to detect issue with host1 and change its status to Non Responsive
 4. In the meantime hosted engine VM is started on host2 -> it will read host
    status from DB, but all hosts are up -> it will try to communicate with host1,
    but it's unreachable -> so it changes host1 status Non Responsive and starts
    Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted
    because engine is still in DisableFenceAtStartupInSec


Version-Release number of selected component (if applicable):

oVirt 3.5.3

How reproducible:

100%

Steps to Reproduce:
1. Install hosted engine into into 2 host environment (host1 and host2)
2. Run HA VM and hosted engine VM on host 1
3. Crash host1

Actual results:

hosted engine VM is restarted on host2, but HA VMs from host1 are not, because host1 is not fenced

Expected results:

hosted engine VM and HA VMs are restarted on host2

Additional info:

Comment 1 Red Hat Bugzilla Rules Engine 2015-10-19 10:49:44 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 2 Oved Ourfali 2016-02-03 09:36:58 UTC
*** Bug 1303897 has been marked as a duplicate of this bug. ***

Comment 9 Martin Perina 2016-02-17 13:34:02 UTC
We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to verify it until today :-(

Comment 10 Yaniv Lavi 2016-02-17 13:42:20 UTC
(In reply to Martin Perina from comment #9)
> We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to
> verify it until today :-(

We will have a build next week probably if this is indeed urgent.

Comment 11 Martin Perina 2016-02-17 13:47:20 UTC
So changing back to 3.6.3

Comment 12 Martin Perina 2016-02-17 14:58:20 UTC
Not yet in 3.6.3 branch

Comment 13 Artyom 2016-02-25 15:28:17 UTC
Verified on rhevm-backend-3.6.3.2-0.1.el6.noarch

Scenario 1:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host does not have PM)
4) Wait until HE vm start on second host, first host will dropped to non-responding state and HA vms to unknown state
5) Confirm host reboot on first host
6) HA vms start on second host
PASS

Scenario 2:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host has PM)
4) I see that engine try to fence host, but failed because
2016-02-25 15:33:34,353 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-34) [] Failed to run Fence script on vds 'hosted_engine_2'.
2016-02-25 15:33:34,398 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

I will verify this one and but opened https://bugzilla.redhat.com/show_bug.cgi?id=1312039 connect to PM issue.