Bug 1150600 - Unexpected Migration of HostedEngine
Summary: Unexpected Migration of HostedEngine
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-hosted-engine-ha
Version: 3.5
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
: 3.5.3
Assignee: Martin Sivák
QA Contact: Artyom
URL:
Whiteboard: sla
: 1223555 (view as bug list)
Depends On:
Blocks: 1221053
TreeView+ depends on / blocked
 
Reported: 2014-10-08 13:44 UTC by Zordrak
Modified: 2016-02-10 19:42 UTC (History)
10 users (show)

Fixed In Version: ovirt-hosted-engine-ha-1.2.6-2
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-06-15 08:39:32 UTC
oVirt Team: SLA
Embargoed:


Attachments (Terms of Use)
Source Host agent.log (277.99 KB, text/plain)
2014-10-08 13:44 UTC, Zordrak
no flags Details
Source Host broker.log (1.10 MB, text/plain)
2014-10-08 13:45 UTC, Zordrak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 34152 0 master ABANDONED respect the timeout when checking the engine health Never
oVirt gerrit 34857 0 master MERGED respect the timeout in EngineUpBadHealth state (better fix) Never
oVirt gerrit 40820 0 ovirt-hosted-engine-ha-1.2 MERGED respect the timeout in EngineUpBadHealth state (better fix) Never

Description Zordrak 2014-10-08 13:44:01 UTC
Created attachment 945018 [details]
Source Host agent.log

HA HostedEngine unexpectedly migrated to other host in cluster.

Engine UI reports Engine as shutting down. No obvious problems with function. Engine restarts on alternate host. Appears to be HA score-based migration. No obvious reason for score dropping.

ovirt-hosted-engine-ha/agent.log attached

Comment 1 Zordrak 2014-10-08 13:45:27 UTC
Created attachment 945019 [details]
Source Host broker.log

Comment 2 Martin Sivák 2014-10-08 13:55:46 UTC
This is caused by the logic in the EngineBadHealth state. It starts a grace period timer. Unfortunately it also sets the score to 0 and when the state machine sees other hosts with higher score (and everything is higher than 0) it immediately moves to EngineStop in an attempt to migrate the VM to a "better" host.

We got to EngineBadHealth because a webadmin liveness check failed. That could be caused by higher cpu load or just by the engine taking its time to respond. This is the reason we have the grace timer and so we should give it the time to recover.


The solution might be to split the state into two: report real score while in EngineDeterminingHealth state and move to EngineBadHealth (score: 0) when the grace timer runs out. The rest will behave the same.

Comment 3 Doron Fediuck 2014-10-14 12:22:01 UTC
If we got into EngineBadHealth it means that something went wrong with
the guest or host, and if we have a better host we should move there.
So the end result is not unexpected, but actually worked as needed.

What we need to improve here is the scoring to make sure the grace period
is honored.

Comment 4 Sandro Bonazzola 2015-01-21 16:13:44 UTC
oVirt 3.5.1 has been released and since this bug is targeted 3.5.1 and in modified state, it should be included in this release.
Please re-target and move nack to modified if this assumption is not valid for this bug.

Comment 5 Martin Sivák 2015-05-26 16:22:43 UTC
*** Bug 1223555 has been marked as a duplicate of this bug. ***

Comment 6 Artyom 2015-06-04 12:49:29 UTC
Verified on ovirt-hosted-engine-ha-1.2.6-2.el7ev.noarch
Now if hosts pass to state state=EngineUpBadHealth, score not dropped to zero and we have 5 minutes until HA agent will move host to state EngineStop.

Comment 7 Sandro Bonazzola 2015-06-15 08:39:32 UTC
This is an automated message.
oVirt 3.5.3 has been released on June 15th 2015 and should include the fix for this BZ. Moving to closed current release.


Note You need to log in before you can comment on or make changes to this bug.