Bug 1629454 - Host stuck in preparing to maintenance after agent restart
Summary: Host stuck in preparing to maintenance after agent restart
Keywords:
Status: CLOSED DUPLICATE of bug 1665934
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.2.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.3.2
: ---
Assignee: Simone Tiraboschi
QA Contact: Liran Rotenberg
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-16 13:28 UTC by Liran Rotenberg
Modified: 2019-02-20 08:49 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-20 08:49:57 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)

Description Liran Rotenberg 2018-09-16 13:28:06 UTC
Description of problem:
During upgrade of a cluster, host 1 succeed to upgrade. HE VM and SPM passed to host 2.
Then, host 2 started to upgrade, therefore trying to enter into maintenance mode.
SPM passed to host 3(agent is down on this host), but the agent on host 1 was in ReinitializeFSM state.
Then, host 2 agent entered to local maintenance mode:

MainThread::INFO::2018-09-06 01:34:01,982::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
MainThread::WARNING::2018-09-06 01:34:02,411::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:34:02,411::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) 

When host 1 agent came back everything was "ok":
MainThread::INFO::2018-09-06 01:34:58,154::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineForceStop-ReinitializeFSM) sent? sent
MainThread::INFO::2018-09-06 01:34:58,168::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0)
MainThread::INFO::2018-09-06 01:35:08,327::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? sent
MainThread::INFO::2018-09-06 01:35:08,776::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
MainThread::INFO::2018-09-06 01:35:18,802::states::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host lynx02.lab.eng.tlv2.redhat.com (id 2)
MainThread::INFO::2018-09-06 01:35:19,241::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)

While on host 2 agent keep being in local maintenance.

Host 2 stayed on "preparing to maintenance" since then.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
vdsm-4.20.39-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.8.x86_64
ovirt-engine-4.2.6.4-0.1.el7ev.noarch

How reproducible:
Unknown

Steps to Reproduce:
1. Set only 2 hosts with HA.
2. Set host 1 with HE VM and SPM.
3. Upgrade host 1, reboot and verify.
4. Upgrade host 2.

Actual results:
Upgrade failed on host 2, due to unsuccessful enter into maintenance mode. 
Host 2 stuck since then in preparing to maintenance (unusable).

Expected results:
HE VM should migrate away to host 1, then host 2 will successfully enter into maintenance mode and the upgrade.

Comment 2 Simone Tiraboschi 2018-09-17 09:47:43 UTC
2018-09-06 01:33:11,457 host 2 triggered a EngineUp-LocalMaintenanceMigrateVm but it didn't migrated.

MainThread::INFO::2018-09-06 01:33:00,909::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)
MainThread::INFO::2018-09-06 01:33:10,935::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
MainThread::INFO::2018-09-06 01:33:11,457::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-LocalMaintenanceMigrateVm) sent? sent
MainThread::WARNING::2018-09-06 01:33:11,879::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:33:11,880::states::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode
MainThread::INFO::2018-09-06 01:33:11,880::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0)
MainThread::INFO::2018-09-06 01:33:11,981::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-ReinitializeFSM) sent? sent
MainThread::WARNING::2018-09-06 01:33:11,982::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:33:11,982::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0)
MainThread::INFO::2018-09-06 01:33:22,007::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
MainThread::INFO::2018-09-06 01:33:22,191::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-LocalMaintenance) sent? sent
MainThread::WARNING::2018-09-06 01:33:22,596::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:33:22,597::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)
MainThread::INFO::2018-09-06 01:33:31,619::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
MainThread::WARNING::2018-09-06 01:33:32,044::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:33:32,045::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)
MainThread::INFO::2018-09-06 01:33:42,072::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
MainThread::WARNING::2018-09-06 01:33:42,505::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:33:42,505::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)
MainThread::INFO::2018-09-06 01:33:51,527::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
MainThread::WARNING::2018-09-06 01:33:51,953::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
MainThread::INFO::2018-09-06 01:33:51,954::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)
MainThread::INFO::2018-09-06 01:34:01,980::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2018-09-06 01:34:01,981::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host lynx01.lab.eng.tlv2.redhat.com (id 1): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=136 (Thu Sep  6 01:33:45 2018)\nhost-id=1\nscore=0\nvm_conf_refresh_time=138 (Thu Sep  6 01:33:46 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=ReinitializeFSM\nstopped=False\n', 'hostname': 'lynx01.lab.eng.tlv2.redhat.com', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 0, 'stopped': False, 'maintenance': False, 'crc32': '540785a1', 'local_conf_timestamp': 138, 'host-ts': 136}
MainThread::INFO::2018-09-06 01:34:01,981::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host lynx03.lab.eng.tlv2.redhat.com (id 3): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=60183 (Wed Sep  5 17:00:37 2018)\nhost-id=3\nscore=0\nvm_conf_refresh_time=60181 (Wed Sep  5 17:00:35 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=AgentStopped\nstopped=True\n', 'hostname': 'lynx03.lab.eng.tlv2.redhat.com', 'alive': False, 'host-id': 3, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 0, 'stopped': True, 'maintenance': False, 'crc32': 'c7049afc', 'local_conf_timestamp': 60181, 'host-ts': 60183}
MainThread::INFO::2018-09-06 01:34:01,981::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'health': 'good', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 17186.0, 'maintenance': True, 'cpu-load': 0.0489, 'gateway': 1.0, 'storage-domain': True}

Comment 3 Simone Tiraboschi 2018-09-17 11:49:01 UTC
The issue is that we had a 

MainThread::INFO::2018-09-06 01:33:11,880::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0)
MainThread::INFO::2018-09-06 01:33:11,981::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-ReinitializeFSM) sent? sent


instead of a 
MainThread::INFO::2018-09-05 16:12:25,904::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0)
MainThread::INFO::2018-09-05 16:12:26,015::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-EngineMigratingAway) sent? sent


as expected.

Comment 4 Michal Skrivanek 2018-09-17 12:02:21 UTC
seems related to HE agent only, reassigning

Comment 5 Sandro Bonazzola 2018-12-10 11:39:08 UTC
Are you able to rreproduce this one with latest build?

Comment 6 Sandro Bonazzola 2018-12-13 12:53:33 UTC
Postponing to 4.3.0 not being identified as blocker for 4.2.8

Comment 7 Liran Rotenberg 2019-01-14 13:58:05 UTC
I tried to reproduce it but encounter a new bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1665934

IMO, this bug exists, with a low or even a very low reproduce rate.

Comment 8 Sandro Bonazzola 2019-02-18 07:54:52 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 9 Sandro Bonazzola 2019-02-20 08:49:57 UTC
Marking as duplicate of 1665934, probably same root cause.

*** This bug has been marked as a duplicate of bug 1665934 ***


Note You need to log in before you can comment on or make changes to this bug.