Created attachment 945018 [details] Source Host agent.log HA HostedEngine unexpectedly migrated to other host in cluster. Engine UI reports Engine as shutting down. No obvious problems with function. Engine restarts on alternate host. Appears to be HA score-based migration. No obvious reason for score dropping. ovirt-hosted-engine-ha/agent.log attached
Created attachment 945019 [details] Source Host broker.log
This is caused by the logic in the EngineBadHealth state. It starts a grace period timer. Unfortunately it also sets the score to 0 and when the state machine sees other hosts with higher score (and everything is higher than 0) it immediately moves to EngineStop in an attempt to migrate the VM to a "better" host. We got to EngineBadHealth because a webadmin liveness check failed. That could be caused by higher cpu load or just by the engine taking its time to respond. This is the reason we have the grace timer and so we should give it the time to recover. The solution might be to split the state into two: report real score while in EngineDeterminingHealth state and move to EngineBadHealth (score: 0) when the grace timer runs out. The rest will behave the same.
If we got into EngineBadHealth it means that something went wrong with the guest or host, and if we have a better host we should move there. So the end result is not unexpected, but actually worked as needed. What we need to improve here is the scoring to make sure the grace period is honored.
oVirt 3.5.1 has been released and since this bug is targeted 3.5.1 and in modified state, it should be included in this release. Please re-target and move nack to modified if this assumption is not valid for this bug.
*** Bug 1223555 has been marked as a duplicate of this bug. ***
Verified on ovirt-hosted-engine-ha-1.2.6-2.el7ev.noarch Now if hosts pass to state state=EngineUpBadHealth, score not dropped to zero and we have 5 minutes until HA agent will move host to state EngineStop.
This is an automated message. oVirt 3.5.3 has been released on June 15th 2015 and should include the fix for this BZ. Moving to closed current release.