1794089 – Ha-broker fails to recognize when HE VM's storage is already locked - QEMU error message has changed

Bug 1794089 - Ha-broker fails to recognize when HE VM's storage is already locked - QEMU error message has changed

Summary: Ha-broker fails to recognize when HE VM's storage is already locked - QEMU er...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Broker
Sub Component:
Version:	2.3.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.4.0
Target Release:	2.4.2
Assignee:	Andrej Krejcir
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-22 15:59 UTC by Andrej Krejcir
Modified:	2020-05-20 20:04 UTC (History)
CC List:	2 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-2.4.2
Clone Of:
Environment:
Last Closed:	2020-05-20 20:04:23 UTC
oVirt Team:	Integration
Embargoed:
Flags:	sbonazzo: ovirt-4.4? mtessun: planning_ack+ sbonazzo: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	106435	0	master	MERGED	agent: Recognize new error message from QEMU	2020-10-15 06:31:25 UTC

Description Andrej Krejcir 2020-01-22 15:59:20 UTC

Description of problem:
When the HE VM fails to start on a host, because it is already running elsewhere, QEMU returns a specific error message.
The ha-broker compares this error message with a few hard-coded strings to detect if the VM failed because of locked storage or some other reason.

Since qemu-kvm-ev version 2.12.0-33.1, the error message has changed and the ha-broker no longer recognizes this failure case.


Version-Release number of selected component (if applicable):
qemu-kvm-ev   10:2.12.0-33.1.el7_7.4 
ovirt-hosted-engine-ha   2.3.6


Steps to Reproduce:
1. Have HE deployed on 2 hosts.
2. Set global maintenance
3. Shutdown the HE VM
4. Once the VM is down, clear the global maintenance mode.
5. Wait for the VM to start on one of the hosts.
6. Check the agent.log on the other one.

Actual results:
The agent misinterprets the reason why the VM failed to start and moves to EngineUnexpectedlyDown state.

Expected results:
The agent correctly detects that VM is running elsewhere, and moves to EngineForceStop state.

Comment 1 Nikolai Sednev 2020-04-06 19:10:07 UTC

Engine got started without any issues, while I followed exact reproduction steps from https://bugzilla.redhat.com/show_bug.cgi?id=1794089#c0.
Engine initially ran on puma18, after reproduction steps it got started on secondary ha-host puma19.

MainThread::INFO::2020-04-06 21:57:47,687::states::736::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up..
MainThread::INFO::2020-04-06 21:57:47,825::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2020-04-06 21:57:57,853::states::736::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up..
MainThread::INFO::2020-04-06 21:57:58,002::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2020-04-06 21:58:08,031::states::736::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up..
MainThread::INFO::2020-04-06 21:58:08,147::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2020-04-06 21:58:18,180::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'>
MainThread::INFO::2020-04-06 21:58:18,218::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored
MainThread::INFO::2020-04-06 21:58:18,331::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)
MainThread::INFO::2020-04-06 21:58:27,356::states::401::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost
MainThread::INFO::2020-04-06 21:58:27,474::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)




puma18 ~]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!



--== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 4342
Score                              : 3400
Engine status                      : {"vm": "up", "health": "good", "detail": "Up"}
Hostname                           : puma18.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : bbf43f4c
conf_on_shared_storage             : True
local_conf_timestamp               : 4342
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4342 (Mon Apr  6 21:55:06 2020)
        host-id=1
        score=3400
        vm_conf_refresh_time=4342 (Mon Apr  6 21:55:06 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 11337
Score                              : 3400
Engine status                      : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : puma19.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 8909896b
conf_on_shared_storage             : True
local_conf_timestamp               : 11337
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=11337 (Mon Apr  6 21:54:57 2020)
        host-id=2
        score=3400
        vm_conf_refresh_time=11337 (Mon Apr  6 21:54:57 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!
puma18 ~]# hosted-engine --vm-shutdown


--== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 4362
Score                              : 3400
Engine status                      : {"vm": "up", "health": "bad", "detail": "Powering down", "reason": "failed liveliness check"}
Hostname                           : puma18.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 02c3a731
conf_on_shared_storage             : True
local_conf_timestamp               : 4362
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4362 (Mon Apr  6 21:55:27 2020)
        host-id=1
        score=3400
        vm_conf_refresh_time=4362 (Mon Apr  6 21:55:27 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


--== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 11357
Score                              : 3400
Engine status                      : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : puma19.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : e401dbf5
conf_on_shared_storage             : True
local_conf_timestamp               : 11357
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=11357 (Mon Apr  6 21:55:17 2020)
        host-id=2
        score=3400
        vm_conf_refresh_time=11357 (Mon Apr  6 21:55:17 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


!! Cluster is in GLOBAL MAINTENANCE mode !!

[root@puma18 ~]# virsh -r list --all
 Id   Name           State
-------------------------------
 -    HostedEngine   shut off

[root@puma18 ~]# virsh -r list --all
 Id   Name           State
-------------------------------
 -    HostedEngine   shut off

[root@puma18 ~]# virsh -r list --all
 Id   Name           State
-------------------------------
 -    HostedEngine   shut off

[root@puma18 ~]# hosted-engine --vm-status


!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 4442
Score                              : 3400
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"}
Hostname                           : puma18.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 0da71d4c
conf_on_shared_storage             : True
local_conf_timestamp               : 4442
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4442 (Mon Apr  6 21:56:47 2020)
        host-id=1
        score=3400
        vm_conf_refresh_time=4442 (Mon Apr  6 21:56:47 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False


--== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 11437
Score                              : 3400
Engine status                      : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : puma19.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 75cf755d
conf_on_shared_storage             : True
local_conf_timestamp               : 11437
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=11437 (Mon Apr  6 21:56:37 2020)
        host-id=2
        score=3400
        vm_conf_refresh_time=11437 (Mon Apr  6 21:56:37 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=GlobalMaintenance
        stopped=False

--== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 4462
Score                              : 3400
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"}
Hostname                           : puma18.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 4ee64603
conf_on_shared_storage             : True
local_conf_timestamp               : 4462
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=4462 (Mon Apr  6 21:57:07 2020)
        host-id=1
        score=3400
        vm_conf_refresh_time=4462 (Mon Apr  6 21:57:07 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


--== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 11467
Score                              : 3400
Engine status                      : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : puma19.scl.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 2e430e8c
conf_on_shared_storage             : True
local_conf_timestamp               : 11467
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=11467 (Mon Apr  6 21:57:07 2020)
        host-id=2
        score=3400
        vm_conf_refresh_time=11467 (Mon Apr  6 21:57:07 2020)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStart
        stopped=False


Tested on:
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
rhvm-appliance.x86_64 2:4.4-20200403.0.el8ev
Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 (Ootpa)

Comment 2 Sandro Bonazzola 2020-05-20 20:04:23 UTC

This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.