Bug 1815589 - The HA agent trying to start HE VM in source host after successful HE migration
Summary: The HA agent trying to start HE VM in source host after successful HE migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.3.8
Hardware: All
OS: Linux
high
high
Target Milestone: ovirt-4.4.5
: 4.4.5
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-20 16:22 UTC by nijin ashok
Modified: 2024-03-25 15:45 UTC (History)
11 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.4.6-1.el8ev
Doc Type: Bug Fix
Doc Text:
Previously, following a successful migration on the Self-hosted Engine, he HA agent on the source host immediately moved to the state EngineDown, and shorly thereafter tried to start the engine locally, if the destination host didn't update the shared storage quickly enough, marking the Manager virtual machine as being up. As a result, starting the virtual machine failed due to a shared lock held by the destination host. This also resulted in generating false alarms and notifications. In this release, the HA agent first moves to the state EngineMaybeAway, providing the destination host more time to update the shared storage with the updated state. As a result, no notifications or false alarms are generated. Note: in scenarios where the virtual machine needs to be started on the source host, this fix slightly increases the time it takes the Manager virtual machine on the source host to start.
Clone Of:
Environment:
Last Closed: 2021-04-14 11:38:43 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4925501 0 None None None 2020-03-24 17:21:32 UTC
Red Hat Product Errata RHSA-2021:1184 0 None None None 2021-04-14 11:39:25 UTC
oVirt gerrit 113196 0 master MERGED agent: On EngineUp and vm down, go to EngineMaybeAway 2021-02-19 18:00:08 UTC

Description nijin ashok 2020-03-20 16:22:04 UTC
Description of problem:

Because of a race condition in monitoring, the HA agent is trying to start the HE VM in source host after successful migration.

The migration was completed successfully on 2020-03-19 10:55:11.

===
MainThread::DEBUG::2020-03-19 22:10:45,809::brokerlink::106::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status {"health": "good", "vm": "up", "detail": "Migration Source"}

MainThread::INFO::2020-03-19 22:10:46,053::hosted_engine::512::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)

MainThread::DEBUG::2020-03-19 22:10:55,074::brokerlink::106::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(get_monitor_status) Success, status {"reason": "VM migrated away successfully", "health": "bad", "vm": "down", "detail": "Down"}
===

I have added two extra logging in my test environment to see the best_host, lm and new_data as below. 

===
state_machine.py
    136         # Compare the best engine remote values with the local state
    137         lm = stats.local
    138         self.logger.info("custom_log: lm status is %s",lm)
    139 
    140         if (
    141             not alive_hosts or engine_status_score(lm["engine-health"]) >=
    142             engine_status_score(best_engine["engine-status"])
    143         ):
    144             new_data["best_engine_status"] = lm["engine-health"]
    145             new_data["best_engine_host_id"] = self.hosted_engine.host_id
    146         else:
    147             new_data["best_engine_status"] = best_engine["engine-status"]
    148             new_data["best_engine_host_id"] = best_engine["host-id"]
    149             self.logger.info("custom_log: best_engine is %s",best_engine)
.............
.............
.............
    194             # The rest of changes as computed by the code above
    195             **new_data)
    196         self.logger.info("custom_log: new_data is %s",new_data)
    197         return new_data
====


As per that, even after a successful migration, the agent selected the best_host as the source host.

===
MainThread::INFO::2020-03-19 22:10:55,081::state_machine::138::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) custom_log:lm status is {'engine-health': {'reason': 'VM migrated away successfully', 'health': 'bad', 'vm': 'down', 'detail': 'Down'}, 'bridge': True, 'network': 1.0, 'mem-free': 1414.0, 'maintenance': False, 'cpu-load': 0.0, 'storage-domain': True}

MainThread::INFO::2020-03-19 22:10:55,082::state_machine::195::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) custom_log: new_data is HostedEngineData(host_id=2, history=(StatsData(metadata_too_new=False, cluster={'maintenance': False}, host_id=2, hosts={1: {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=105619 (Thu Mar 19 22:10:45 2020)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=105619 (Thu Mar 19 22:10:45 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'vm255-50.example.com', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down', 'detail': 'Migration Destination'}, ....

best_engine_status={'reason': 'VM migrated away successfully', 'health': 'bad', 'vm': 'down', 'detail': 'Down'}, best_engine_host_id=2, best
_score_host={'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=105619 (Thu Mar 19 22:10:45 2020)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=105619 (Thu Mar 19 22:10:45 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'vm255-50.example.com', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down', 'detail': 'Migration Destination'}, 'score': 3000, 'stopped': False, 'maintenance': False, 'crc32': '4a5808c6', 'local_conf_timestamp': 105619, 'host-ts': 105619}, alive_hosts=[1])

MainThread::DEBUG::2020-03-19 22:10:55,101::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Processing engine state <ovirt_hosted_engine_ha.agent.states.EngineDown object at 0x7f9863db8460>
MainThread::INFO::2020-03-19 22:10:55,332::brokerlink::71::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineDown) sent? sent

MainThread::INFO::2020-03-19 22:10:55,563::hosted_engine::512::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3000)

MainThread::INFO::2020-03-19 22:11:05,587::states::467::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3000), attempting to start engine VM

MainThread::DEBUG::2020-03-19 22:11:05,595::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Processing engine state <ovirt_hosted_engine_ha.agent.states.EngineStart object at 0x7f985adeeeb0>
MainThread::INFO::2020-03-19 22:11:05,731::brokerlink::71::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent

MainThread::INFO::2020-03-19 22:11:06,231::hosted_engine::847::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`
====

Since the status of the VM is down in  "best_engine_status" as it points to the source host, it will try to start the HE VM in the "EngineDown transition". The sanlock blocks the VM startup operation as expected.

The race is after the successful migration, the source host monitoring thread read the metadata before the destination host writes it's status to the storage. So it assumes on both hosts, the VM is down.

=====
MainThread::INFO::2020-03-19 22:10:45,610::state_machine::138::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) custom_log: lm status is {'engine-health': {'reason': 'bad vm status', 'health': 'bad', 'vm': 'down', 'detail': 'Migration Destination'}, 'bridge': True, 'network': 1.0, 'mem-free': 1485.0, 'maintenance': False, 'cpu-load': 0.0995, 'storage-domain': True}

MainThread::INFO::2020-03-19 22:10:45,839::hosted_engine::512::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3000)
MainThread::DEBUG::2020-03-19 22:10:45,840::hosted_engine::720::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_generate_local_blocks) Generated 2 blocks:
1|1|105619|1|3000|{"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "Migration Destination"}|vm255-50.example.com|0|0|4a5808c6|1|105619
<\0 padding>
metadata_parse_version=1
metadata_feature_version=1
timestamp=105619 (Thu Mar 19 22:10:45 2020)
host-id=1
score=3000
vm_conf_refresh_time=105619 (Thu Mar 19 22:10:45 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
MainThread::DEBUG::2020-03-19 22:10:45,840::brokerlink::114::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(put_stats_on_storage) Storing blocks on storage

MainThread::INFO::2020-03-19 22:10:55,996::state_machine::138::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) custom_log: lm status is {'engine-health': {'health': 'good', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'network': 1.0, 'mem-free': 1501.0, 'maintenance': False, 'cpu-load': 0.0995, 'storage-domain': True}
====

At 22:10:45, the migration was still not complete and it writes the status as "health": "bad", "vm": "down" and which was read by the source host at 2020-03-19 22:10:55. The destination host writes the status in the metadata after successful migration at 22:10:57.

===
MainThread::INFO::2020-03-19 22:10:57,058::hosted_engine::512::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3000)
MainThread::DEBUG::2020-03-19 22:10:57,059::hosted_engine::720::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_generate_local_blocks) Generated 2 blocks:
1|1|105629|1|3000|{"health": "good", "vm": "up", "detail": "Up"}|vm255-50.example.com|0|0|765a69b3|1|105630
<\0 padding>
metadata_parse_version=1
metadata_feature_version=1
timestamp=105629 (Thu Mar 19 22:10:55 2020)
host-id=1
score=3000
vm_conf_refresh_time=105630 (Thu Mar 19 22:10:56 2020)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
====

I think we might have to add an extra check for manual migration in EngineDown transition by checking if the reason is "VM migrated away successfully"?

Version-Release number of selected component (if applicable):

vdsm-4.30.40-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch

How reproducible:

I am able to reproduce 2 out of 5 times when I did manual migration of HE VM.

Steps to Reproduce:

1. Manually migrate the HE VM and check if the source host fall into a bad state, showing the HA-score as "0", when polled with hosted-engine --vm-status. From the RHVM Web UI, this is seen as a Grey Crown with a red exclamation mark on top, next to the hostname, indicating that the host is no longer capable of running the HE

Actual results:

The HA agent trying to start HE VM in source host after successful HE migration 

Expected results:

It should not attempt to start the HE VM in the source host since the migration was successful.

Additional info:

Comment 7 Yedidyah Bar David 2021-01-27 12:32:30 UTC
I see several different issues here:

1. Needlessly, even if temporarily, having score 0, red exclamation marks in the UI, and perhaps associated monitoring notifications.

2. The risk that sanlock won't do its job and we'll run into a split brain.

Can you clarify exactly what you want to fix, and what the proposed fix is, if at all?

Also:

Currently, the state machine maintained by ha-agent [1] has no notion of migration whatsoever. "migrat" does not appear there.
One might claim that the proper solution is to add there one or more states involving migration - perhaps EngineMigratingAway, EngineMigratingIn. We'll then have to carefully decide both when to enter these states and where to go to from them (latter is easier, I guess). In the past, we did have something similar, and it was removed by [2], as part of the decision to stop supporting migration initiated by the hosts themselves - a feature that caused various problems in the past, see the linked bugs - in favor of requiring all migrations to be initiated (and monitored) by the engine itself (with the help of vdsm/libvirt, but no involvement of HA agent/broker). I asked Simone in private about partially reverting [2], restoring only the monitoring code, and he said this wasn't that robust either. I also consider simply something like moving from EngineUp not directly to EngineDown but to ReinitializeFSM or EngineUnexpectedlyDown, which will provide the other host some more time to update the status.

[1] https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/agent/states.py

[2] https://gerrit.ovirt.org/q/I42c810987d24e05ec2002ec38ecf2bff1f134290

Comment 8 Yedidyah Bar David 2021-02-01 06:27:47 UTC
QE: I only tested the positive case. I ran a loop putting the active host to maintenance thus causing the VM to migrate. It's enough to run a loop of migrations, no need for maintenance.

In a reproduction (meaning, with older versions), you see this in agent.log:

    Another host already took over

With a fixed version, you should see no such lines.

Please check how this affects the negative case - e.g. try to kill the VM in the middle of migration (kill both source and destination qemu processes) and see how long it takes for one of the hosts to start the VM (if at all). Thanks.

Comment 10 Nikolai Sednev 2021-02-09 18:36:58 UTC
First state on source host is state=EngineMaybeAway, then its moved to state=EngineDown as changed by design.
I also didn't seen score being dropped to 0 on source host during migration.
I saw though " Another host already took over.." once, but only once, on alma04 secondary ha-host:
MainThread::INFO::2021-02-09 19:10:15,189::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine
::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2021-02-09 19:10:15,189::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine
::(_monitoring_loop) Best remote host alma03.qa.lab.tlv.redhat.com (id: 1, score: 3400)
MainThread::INFO::2021-02-09 19:10:25,213::states::745::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(cons
ume) Another host already took over..
MainThread::INFO::2021-02-09 19:10:25,214::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngi
ne::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
 'ovirt_hosted_engine_ha.agent.states.EngineForceStop'>
MainThread::INFO::2021-02-09 19:10:25,242::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) 
Success, was notification of state_transition (EngineStarting-EngineForceStop) sent? ignored

In case of killing the qemu process on both hosts during migration of HE-VM from alma04 to alma03, killing qemu process doesn't influence its restart, its coming up right away, but the score is getting back from failure at 20:04:16,556 and then back to normal at 20:13:49,020, hence I'd say that 10-11 minutes is the cooldown:

After killing the qemu process on both hosts during migration they both receive 0 score and then getting back to normal after 
alma04:MainThread::INFO::2021-02-09 20:04:16,556::states::379::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm was unexpectedly shut down
alma03:MainThread::INFO::2021-02-09 20:05:02,419::states::783::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(cons
ume) Engine vm was unexpectedly shut down



alma04 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 10291
Score                              : 0
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"}
Hostname                           : alma03.qa.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : bd4c58b5
conf_on_shared_storage             : True
local_conf_timestamp               : 10291
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=10291 (Tue Feb  9 20:06:02 2021)
        host-id=1
        score=0
        vm_conf_refresh_time=10291 (Tue Feb  9 20:06:02 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 05:00:01 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 10181
Score                              : 0
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"}
Hostname                           : alma04.qa.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 647baf44
conf_on_shared_storage             : True
local_conf_timestamp               : 10181
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=10181 (Tue Feb  9 20:06:07 2021)
        host-id=2
        score=0
        vm_conf_refresh_time=10181 (Tue Feb  9 20:06:07 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 04:57:20 1970

alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 11194
Score                              : 3400
Engine status                      : {"vm": "down_unexpected", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname                           : alma03.qa.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 2e67ee71
conf_on_shared_storage             : True
local_conf_timestamp               : 11194
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=11194 (Tue Feb  9 20:21:05 2021)
        host-id=1
        score=3400
        vm_conf_refresh_time=11194 (Tue Feb  9 20:21:05 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

Host ID                            : 2
Host timestamp                     : 11074
Score                              : 3400
Engine status                      : {"vm": "up", "health": "good", "detail": "Up"}
Hostname                           : alma04.qa.lab.tlv.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 0c288621
conf_on_shared_storage             : True
local_conf_timestamp               : 11075
Status up-to-date                  : True
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=11074 (Tue Feb  9 20:21:00 2021)
        host-id=2
        score=3400
        vm_conf_refresh_time=11075 (Tue Feb  9 20:21:00 2021)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False

alma04:MainThread::INFO::2021-02-09 20:13:38,839::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUnexpectedlyDown (score: 0)
MainThread::INFO::2021-02-09 20:13:48,900::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUnexpectedlyDown-EngineDown) sent? ignored
MainThread::INFO::2021-02-09 20:13:49,020::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)

Moving to verified as there is no more exclamation mark on top seen via RHVM Web UI, during manual engine migration and the score is not getting dropped to 0 via CLI.

Tested on:
ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch
vdsm-4.40.50.4-1.el8ev.x86_64
ansible-2.9.17-1.el8ae.noarch
ovirt-ansible-collection-1.3.0-1.el8ev.noarch
rhvm-appliance-4.4-20201117.0.el8ev.x86_64
Linux 4.18.0-240.15.1.el8_3.x86_64 #1 SMP Wed Feb 3 03:12:15 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.3 (Ootpa)

Comment 11 Yedidyah Bar David 2021-02-15 11:03:26 UTC
(In reply to Nikolai Sednev from comment #10)
> First state on source host is state=EngineMaybeAway, then its moved to
> state=EngineDown as changed by design.
> I also didn't seen score being dropped to 0 on source host during migration.
> I saw though " Another host already took over.." once, but only once, on
> alma04 secondary ha-host:
> MainThread::INFO::2021-02-09
> 19:10:15,189::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine
> ::(_monitoring_loop) Current state EngineStarting (score: 3400)
> MainThread::INFO::2021-02-09
> 19:10:15,189::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine
> ::(_monitoring_loop) Best remote host alma03.qa.lab.tlv.redhat.com (id: 1,
> score: 3400)
> MainThread::INFO::2021-02-09
> 19:10:25,213::states::745::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(cons
> ume) Another host already took over..

I hoped this won't happen even once. Can you please attach all relevant logs
from all hosts? Thanks.

Comment 12 Yedidyah Bar David 2021-02-15 11:14:01 UTC
(In reply to Yedidyah Bar David from comment #11)
> (In reply to Nikolai Sednev from comment #10)
> > First state on source host is state=EngineMaybeAway, then its moved to
> > state=EngineDown as changed by design.
> > I also didn't seen score being dropped to 0 on source host during migration.
> > I saw though " Another host already took over.." once, but only once, on
> > alma04 secondary ha-host:
> > MainThread::INFO::2021-02-09
> > 19:10:15,189::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.
> > HostedEngine
> > ::(_monitoring_loop) Current state EngineStarting (score: 3400)
> > MainThread::INFO::2021-02-09
> > 19:10:15,189::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.
> > HostedEngine
> > ::(_monitoring_loop) Best remote host alma03.qa.lab.tlv.redhat.com (id: 1,
> > score: 3400)
> > MainThread::INFO::2021-02-09
> > 19:10:25,213::states::745::ovirt_hosted_engine_ha.agent.hosted_engine.
> > HostedEngine::(cons
> > ume) Another host already took over..
> 
> I hoped this won't happen even once. Can you please attach all relevant logs
> from all hosts? Thanks.

To clarify: This does make sense on other flows. In particular, during the
initial hosted-engine deployment, the local HE VM it shut down, and then we
let the HA agents start it "by themselves". It does make sense that they'll
both try to start it and only one will win. I consider the fix for current bug
to apply mainly for the flow for which it was designed, which is for migration,
although nothing in the patch is specific for this case.

Comment 13 Nikolai Sednev 2021-02-15 11:27:33 UTC
(In reply to Yedidyah Bar David from comment #11)
> (In reply to Nikolai Sednev from comment #10)
> > First state on source host is state=EngineMaybeAway, then its moved to
> > state=EngineDown as changed by design.
> > I also didn't seen score being dropped to 0 on source host during migration.
> > I saw though " Another host already took over.." once, but only once, on
> > alma04 secondary ha-host:
> > MainThread::INFO::2021-02-09
> > 19:10:15,189::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.
> > HostedEngine
> > ::(_monitoring_loop) Current state EngineStarting (score: 3400)
> > MainThread::INFO::2021-02-09
> > 19:10:15,189::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.
> > HostedEngine
> > ::(_monitoring_loop) Best remote host alma03.qa.lab.tlv.redhat.com (id: 1,
> > score: 3400)
> > MainThread::INFO::2021-02-09
> > 19:10:25,213::states::745::ovirt_hosted_engine_ha.agent.hosted_engine.
> > HostedEngine::(cons
> > ume) Another host already took over..
> 
> I hoped this won't happen even once. Can you please attach all relevant logs
> from all hosts? Thanks.

Not at the moment, servers are down for maintenance and this issue appears once the second ha-host being added for the first time.

Comment 14 Yedidyah Bar David 2021-02-15 12:38:03 UTC
(In reply to Yedidyah Bar David from comment #12)
> To clarify: This does make sense on other flows. In particular, during the
> initial hosted-engine deployment, the local HE VM it shut down, and then we
> let the HA agents start it "by themselves". It does make sense that they'll
> both try to start it and only one will win. I consider the fix for current
> bug
> to apply mainly for the flow for which it was designed, which is for
> migration,
> although nothing in the patch is specific for this case.

Sorry for the noise - above is irrelevant, as there is only one host then.

(In reply to Nikolai Sednev from comment #13)

> Not at the moment, servers are down for maintenance and this issue appears
> once the second ha-host being added for the first time.

Yes, I saw that too, and it's ok IMO - and anyway, is not the flow of
current bug.

It happened for me as well in this flow - right after adding another host.
Still not sure about the exact cause. It's a different flow. Feel free to
open another bug if you think it's worth it. I do think, though, that the
logs are not detailed enough for allowing clearly understanding why this
happens, which might be a worthy bug as well.

Comment 16 Eli Marcus 2021-03-25 17:44:01 UTC
Hi Didi, please review this updated doc text: 

Previously, following a successful migration on the Self-hosted Engine, he HA agent on the source host immediately moved to the state EngineDown, and shorly thereafter tried to start the engine locally, if the destination host didn't update the shared storage quickly enough, marking the Manager virtual machine as being up.
As a result, starting the virtual machine failed due to a shared lock held by the destination host. This also resulted in generating false alarms and notifications.

In this release, the HA agent first moves to the state EngineMaybeAway, providing the destination host more time to update the shared storage with the updated state. As a result, no notifications or false alarms are generated. Note that this slightly increases the time it takes the Manager virtual machine on the source host to start.

Comment 17 Yedidyah Bar David 2021-04-04 07:57:35 UTC
Mostly ok, but your last sentence is missing an important conditional:

"Note that this slightly increases the time it takes the Manager virtual machine on the source host to start."

On the common case, the vm does not need to start at all on the source host, so it simply never starts.

Copying my own last statement from previous version:

"Please note, that this does increase slightly the time it would take for the engine VM to be started on the source host, if this actually is needed - if the migration seems to have succeeded from the source hosts's POV, but later proved to have failed or being problematic somehow.".

Please amend as applicable.

To clarify: The flow in which this sentence applies, is:

1. A migration of the engine VM from host S to host T is started.
2. The migration finishes, from host S's POV (meaning, qemu/libvirt/vdsm think it finished), successfully.
3. On host T, migration actually failed, for some reason.
4. Eventually, host S should notice that the engine is down and try starting the engine vm on itself.

Without the patch for current bug, the time between (3.) and (4.) is shorter, than with the patch.

Comment 18 Eli Marcus 2021-04-04 13:01:25 UTC
How is this phrasing for the additional note?

"Note: in scenarios where the virtual machine needs to be started on the source host, this fix slightly increases the time it takes the Manager virtual
machine on the source host to start."


(In reply to Yedidyah Bar David from comment #17)
> Mostly ok, but your last sentence is missing an important conditional:
> 
> "Note that this slightly increases the time it takes the Manager virtual
> machine on the source host to start."
> 
> On the common case, the vm does not need to start at all on the source host,
> so it simply never starts.
> 
> Copying my own last statement from previous version:
> 
> "Please note, that this does increase slightly the time it would take for
> the engine VM to be started on the source host, if this actually is needed -
> if the migration seems to have succeeded from the source hosts's POV, but
> later proved to have failed or being problematic somehow.".
> 
> Please amend as applicable.
> 
> To clarify: The flow in which this sentence applies, is:
> 
> 1. A migration of the engine VM from host S to host T is started.
> 2. The migration finishes, from host S's POV (meaning, qemu/libvirt/vdsm
> think it finished), successfully.
> 3. On host T, migration actually failed, for some reason.
> 4. Eventually, host S should notice that the engine is down and try starting
> the engine vm on itself.
> 
> Without the patch for current bug, the time between (3.) and (4.) is
> shorter, than with the patch.

Comment 19 Yedidyah Bar David 2021-04-04 13:23:33 UTC
(In reply to Eli Marcus from comment #18)
> How is this phrasing for the additional note?
> 
> "Note: in scenarios where the virtual machine needs to be started on the
> source host, this fix slightly increases the time it takes the Manager
> virtual
> machine on the source host to start."

OK for me. Thanks!

Comment 24 errata-xmlrpc 2021-04-14 11:38:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1184


Note You need to log in before you can comment on or make changes to this bug.