1266099 – HA VMs are not restarted if hosted engine VM is on the same host and this host will crash

Bug 1266099 - HA VMs are not restarted if hosted engine VM is on the same host and this host will crash

Summary: HA VMs are not restarted if hosted engine VM is on the same host and this hos...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	3.5.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.3
Target Release:	3.6.3.3
Assignee:	Martin Perina
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1303897 (view as bug list)
Depends On:
Blocks:	Gluster-HC-1 RHEV_36_HTB
TreeView+	depends on / blocked

Reported:	2015-09-24 13:11 UTC by Martin Perina
Modified:	2020-08-13 08:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-03-11 07:23:15 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	mperina: ovirt-3.6.z? rule-engine: ovirt-4.0.0+ mgoldboi: blocker+ rule-engine: planning_ack+ oourfali: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	53436	'None'	MERGED	core: Enable fencing of previous host which HE VM was running on	2020-06-10 13:29:07 UTC
oVirt gerrit	53658	'None'	MERGED	core: Enable fencing of previous host which HE VM was running on	2020-06-10 13:29:07 UTC
oVirt gerrit	53669	'None'	MERGED	core: Enable fencing of previous host which HE VM was running on	2020-06-10 13:29:07 UTC

Description Martin Perina 2015-09-24 13:11:32 UTC

Description of problem:

After startup of engine there's an internval during which fencing is
disabled. It's called DisableFenceAtStartupInSec and by default it's
set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which
could happen in during power issues in whole DC: when both engine and
hosts are started, for huge hosts it may take a lot of time until become
up and VDSM start to communicate with engine. So usually engine is started
first and without this interval engine will start fencing for hosts which
are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine
if there's not just communication issue between engine and host, so we
cannot restart HA VMs on another host. The only thing we can do is to
offer "Mark host as rebooted" manual option to administrator. If
administrator execution this option, we try to restart HA VMs on different
host ASAP, because admin took the responsibility of validation that VMs
are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment
after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval
we skip fencing and administator will see message in Events tab that host
is Non Responsive, but fencing is disabled due to startup interval. So
administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs
2. Status of host1 and host2 is Up
3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine
is running to detect issue with host1 and change its status to Non Responsive
4. In the meantime hosted engine VM is started on host2 -> it will read host
status from DB, but all hosts are up -> it will try to communicate with host1,
but it's unreachable -> so it changes host1 status Non Responsive and starts
Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted
because engine is still in DisableFenceAtStartupInSec

Version-Release number of selected component (if applicable):

oVirt 3.5.3

How reproducible:

100%

Steps to Reproduce:
1. Install hosted engine into into 2 host environment (host1 and host2)
2. Run HA VM and hosted engine VM on host 1
3. Crash host1

Actual results:

hosted engine VM is restarted on host2, but HA VMs from host1 are not, because host1 is not fenced

Expected results:

hosted engine VM and HA VMs are restarted on host2

Additional info:

Comment 1 Red Hat Bugzilla Rules Engine 2015-10-19 10:49:44 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 2 Oved Ourfali 2016-02-03 09:36:58 UTC

*** Bug 1303897 has been marked as a duplicate of this bug. ***

Comment 9 Martin Perina 2016-02-17 13:34:02 UTC

We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to verify it until today :-(

Comment 10 Yaniv Lavi 2016-02-17 13:42:20 UTC

(In reply to Martin Perina from comment #9)
> We didn't, we wanted to get this merged into 3.6.3, but I wasn't able to
> verify it until today :-(

We will have a build next week probably if this is indeed urgent.

Comment 11 Martin Perina 2016-02-17 13:47:20 UTC

So changing back to 3.6.3

Comment 12 Martin Perina 2016-02-17 14:58:20 UTC

Not yet in 3.6.3 branch

Comment 13 Artyom 2016-02-25 15:28:17 UTC

Verified on rhevm-backend-3.6.3.2-0.1.el6.noarch

Scenario 1:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host does not have PM)
4) Wait until HE vm start on second host, first host will dropped to non-responding state and HA vms to unknown state
5) Confirm host reboot on first host
6) HA vms start on second host
PASS

Scenario 2:
==========
1) Deploy HE on two hosts
2) Start two HA vms on host with HE vm
3) Poweroff host(host has PM)
4) I see that engine try to fence host, but failed because
2016-02-25 15:33:34,353 ERROR [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (org.ovirt.thread.pool-6-thread-34) [] Failed to run Fence script on vds 'hosted_engine_2'.
2016-02-25 15:33:34,398 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-34) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

I will verify this one and but opened https://bugzilla.redhat.com/show_bug.cgi?id=1312039 connect to PM issue.

Note You need to log in before you can comment on or make changes to this bug.