Bug 1504032

Summary: Engine-health monitor should expect new sanlock error message
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Andrej Krejcir <akrejcir>
Component: BrokerAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: ---CC: akrejcir, alukiano, bugs
Target Milestone: ovirt-4.2.0Keywords: Triaged, ZStream
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: planning_ack+
msivak: devel_ack+
alukiano: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1504150 (view as bug list) Environment:
Last Closed: 2017-12-20 10:56:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1504150    

Description Andrej Krejcir 2017-10-19 10:59:53 UTC
Description of problem:

VDSM now uses a newer version of libvirt, which reports a different error message when it fails to acquire the sanlock. The engine-health submonitor has to be changed to react to this new error message.

Version-Release number of selected component (if applicable):
VDSM: v4.19.35

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted engine on 2 hosts
2. Set global maintenance mode and shut down the engine VM
3. Cancel global maintenance mode
4. Wait for engine VM to start

Actual results:
On the host that does not run the VM, the agent moves from EngineStarting state to EngineMaybeAway.

Expected results:
The agent moves from EngineStarting to EngineForceStop.
There is an INFO line in agent log: "Another host already took over.."

Comment 2 Nikolai Sednev 2017-10-29 15:52:04 UTC
On host that is not running SHE-VM I see state transition from EngineStarting=>EngineDown.
I don't see state ever changed to "EngineForceStop" though.
Andrej, please provide your input.

Comment 3 Nikolai Sednev 2017-10-29 15:55:30 UTC
I also see "MainThread::INFO::2017-10-29 17:42:31,827::states::771::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(cons
ume) Another host already took over.." in logs on host that reports "state=EngineDown".

Comment 4 Andrej Krejcir 2017-10-30 11:26:16 UTC
Strange, in the code, there is no transition from EngineStarting to EngineDown state directly.

If the agent log shows "Another host already took over..", then the right code path was chosen, so the patch works. After the line, there should be a debug line containing "Processing engine state EngineForceStop".

Comment 5 Nikolai Sednev 2017-10-30 12:23:42 UTC
Just followed reproduction steps from description for ovirt-hosted-engine-setup-2.2.0-0.0.master.20171016160008.git55723e8.el7.centos.noarch, which was deployed over Gluster storage 3.12 on pair of hosts  and got "EngineForceStop" message for a blink of a second on host, which was not running (starting) SHE-VM.

I've had VM running on host A, then after cancellation of global maintenance mode, VM got started on host B instead of host A, so then I've seen "EngineForceStop" on host A and then it changed to "EngineDown".

Moving to verified.

Comment 6 Sandro Bonazzola 2017-12-20 10:56:28 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.