Description of problem: During upgrade of a cluster, host 1 succeed to upgrade. HE VM and SPM passed to host 2. Then, host 2 started to upgrade, therefore trying to enter into maintenance mode. SPM passed to host 3(agent is down on this host), but the agent on host 1 was in ReinitializeFSM state. Then, host 2 agent entered to local maintenance mode: MainThread::INFO::2018-09-06 01:34:01,982::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::WARNING::2018-09-06 01:34:02,411::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:34:02,411::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) When host 1 agent came back everything was "ok": MainThread::INFO::2018-09-06 01:34:58,154::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineForceStop-ReinitializeFSM) sent? sent MainThread::INFO::2018-09-06 01:34:58,168::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2018-09-06 01:35:08,327::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? sent MainThread::INFO::2018-09-06 01:35:08,776::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-09-06 01:35:18,802::states::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host lynx02.lab.eng.tlv2.redhat.com (id 2) MainThread::INFO::2018-09-06 01:35:19,241::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) While on host 2 agent keep being in local maintenance. Host 2 stayed on "preparing to maintenance" since then. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch vdsm-4.20.39-1.el7ev.x86_64 libvirt-3.9.0-14.el7_5.8.x86_64 ovirt-engine-4.2.6.4-0.1.el7ev.noarch How reproducible: Unknown Steps to Reproduce: 1. Set only 2 hosts with HA. 2. Set host 1 with HE VM and SPM. 3. Upgrade host 1, reboot and verify. 4. Upgrade host 2. Actual results: Upgrade failed on host 2, due to unsuccessful enter into maintenance mode. Host 2 stuck since then in preparing to maintenance (unusable). Expected results: HE VM should migrate away to host 1, then host 2 will successfully enter into maintenance mode and the upgrade.
2018-09-06 01:33:11,457 host 2 triggered a EngineUp-LocalMaintenanceMigrateVm but it didn't migrated. MainThread::INFO::2018-09-06 01:33:00,909::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2018-09-06 01:33:10,935::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::INFO::2018-09-06 01:33:11,457::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-LocalMaintenanceMigrateVm) sent? sent MainThread::WARNING::2018-09-06 01:33:11,879::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:33:11,880::states::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode MainThread::INFO::2018-09-06 01:33:11,880::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0) MainThread::INFO::2018-09-06 01:33:11,981::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-ReinitializeFSM) sent? sent MainThread::WARNING::2018-09-06 01:33:11,982::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:33:11,982::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2018-09-06 01:33:22,007::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::INFO::2018-09-06 01:33:22,191::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-LocalMaintenance) sent? sent MainThread::WARNING::2018-09-06 01:33:22,596::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:33:22,597::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) MainThread::INFO::2018-09-06 01:33:31,619::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::WARNING::2018-09-06 01:33:32,044::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:33:32,045::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) MainThread::INFO::2018-09-06 01:33:42,072::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::WARNING::2018-09-06 01:33:42,505::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:33:42,505::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) MainThread::INFO::2018-09-06 01:33:51,527::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::WARNING::2018-09-06 01:33:51,953::hosted_engine::601::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-09-06 01:33:51,954::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) MainThread::INFO::2018-09-06 01:34:01,980::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2018-09-06 01:34:01,981::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host lynx01.lab.eng.tlv2.redhat.com (id 1): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=136 (Thu Sep 6 01:33:45 2018)\nhost-id=1\nscore=0\nvm_conf_refresh_time=138 (Thu Sep 6 01:33:46 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=ReinitializeFSM\nstopped=False\n', 'hostname': 'lynx01.lab.eng.tlv2.redhat.com', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 0, 'stopped': False, 'maintenance': False, 'crc32': '540785a1', 'local_conf_timestamp': 138, 'host-ts': 136} MainThread::INFO::2018-09-06 01:34:01,981::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host lynx03.lab.eng.tlv2.redhat.com (id 3): {'conf_on_shared_storage': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=60183 (Wed Sep 5 17:00:37 2018)\nhost-id=3\nscore=0\nvm_conf_refresh_time=60181 (Wed Sep 5 17:00:35 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=AgentStopped\nstopped=True\n', 'hostname': 'lynx03.lab.eng.tlv2.redhat.com', 'alive': False, 'host-id': 3, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 0, 'stopped': True, 'maintenance': False, 'crc32': 'c7049afc', 'local_conf_timestamp': 60181, 'host-ts': 60183} MainThread::INFO::2018-09-06 01:34:01,981::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {'engine-health': {'health': 'good', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 17186.0, 'maintenance': True, 'cpu-load': 0.0489, 'gateway': 1.0, 'storage-domain': True}
The issue is that we had a MainThread::INFO::2018-09-06 01:33:11,880::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0) MainThread::INFO::2018-09-06 01:33:11,981::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-ReinitializeFSM) sent? sent instead of a MainThread::INFO::2018-09-05 16:12:25,904::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0) MainThread::INFO::2018-09-05 16:12:26,015::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-EngineMigratingAway) sent? sent as expected.
seems related to HE agent only, reassigning
Are you able to rreproduce this one with latest build?
Postponing to 4.3.0 not being identified as blocker for 4.2.8
I tried to reproduce it but encounter a new bug: https://bugzilla.redhat.com/show_bug.cgi?id=1665934 IMO, this bug exists, with a low or even a very low reproduce rate.
Moving to 4.3.2 not being identified as blocker for 4.3.1.
Marking as duplicate of 1665934, probably same root cause. *** This bug has been marked as a duplicate of bug 1665934 ***