Bug 1266099

Summary:	HA VMs are not restarted if hosted engine VM is on the same host and this host will crash
Product:	[oVirt] ovirt-engine	Reporter:	Martin Perina <mperina>
Component:	Backend.Core	Assignee:	Martin Perina <mperina>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Artyom <alukiano>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.5.4	CC:	bugs, gklein, mgoldboi, mtessun, oourfali, pkliczew, sabose, ykaul
Target Milestone:	ovirt-3.6.3	Keywords:	Triaged
Target Release:	3.6.3.3	Flags:	mperina: ovirt-3.6.z? rule-engine: ovirt-4.0.0+ mgoldboi: blocker+ rule-engine: planning_ack+ oourfali: devel_ack+ mavital: testing_ack+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-03-11 07:23:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1258386, 1283458

Description Martin Perina 2015-09-24 13:11:32 UTC

Description of problem:

After startup of engine there's an internval during which fencing is
disabled. It's called DisableFenceAtStartupInSec and by default it's
set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which
could happen in during power issues in whole DC: when both engine and
hosts are started, for huge hosts it may take a lot of time until become
up and VDSM start to communicate with engine. So usually engine is started
first and without this interval engine will start fencing for hosts which
are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine
if there's not just communication issue between engine and host, so we
cannot restart HA VMs on another host. The only thing we can do is to
offer "Mark host as rebooted" manual option to administrator. If
administrator execution this option, we try to restart HA VMs on different
host ASAP, because admin took the responsibility of validation that VMs
are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment
after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval
we skip fencing and administator will see message in Events tab that host
is Non Responsive, but fencing is disabled due to startup interval. So
administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs
2. Status of host1 and host2 is Up
3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine
is running to detect issue with host1 and change its status to Non Responsive
4. In the meantime hosted engine VM is started on host2 -> it will read host
status from DB, but all hosts are up -> it will try to communicate with host1,
but it's unreachable -> so it changes host1 status Non Responsive and starts
Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted
because engine is still in DisableFenceAtStartupInSec

Version-Release number of selected component (if applicable):

oVirt 3.5.3

How reproducible:

100%

Steps to Reproduce:
1. Install hosted engine into into 2 host environment (host1 and host2)
2. Run HA VM and hosted engine VM on host 1
3. Crash host1

Actual results:

hosted engine VM is restarted on host2, but HA VMs from host1 are not, because host1 is not fenced

Expected results:

hosted engine VM and HA VMs are restarted on host2

Additional info:

Comment 1 Red Hat Bugzilla Rules Engine 2015-10-19 10:49:44 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 2 Oved Ourfali 2016-02-03 09:36:58 UTC

*** Bug 1303897 has been marked as a duplicate of this bug. ***

Comment 9 Martin Perina 2016-02-17 13:34:02 UTC

We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to verify it until today :-(

Comment 10 Yaniv Lavi 2016-02-17 13:42:20 UTC

(In reply to Martin Perina from comment #9)
> We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to
> verify it until today :-(

We will have a build next week probably if this is indeed urgent.

Comment 11 Martin Perina 2016-02-17 13:47:20 UTC

So changing back to 3.6.3

Comment 12 Martin Perina 2016-02-17 14:58:20 UTC

Not yet in 3.6.3 branch

Comment 13 Artyom 2016-02-25 15:28:17 UTC

Verified on rhevm-backend-3.6.3.2-0.1.el6.noarch

Scenario 1:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host does not have PM)
4) Wait until HE vm start on second host, first host will dropped to non-responding state and HA vms to unknown state
5) Confirm host reboot on first host
6) HA vms start on second host
PASS

Scenario 2:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host has PM)
4) I see that engine try to fence host, but failed because
2016-02-25 15:33:34,353 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-34) [] Failed to run Fence script on vds 'hosted_engine_2'.
2016-02-25 15:33:34,398 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

I will verify this one and but opened https://bugzilla.redhat.com/show_bug.cgi?id=1312039 connect to PM issue.