Bug 1567615

Summary: Hosted engine - engine restarted during hosted engine VM migration -ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Avihai <aefrat>
Component: AgentAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 2.2.10CC: aefrat, akrejcir, bugs, michal.skrivanek, msivak, ratamir, stirabos, ylavi
Target Milestone: ovirt-4.2.3Keywords: Automation, Triaged
Target Release: ---Flags: ylavi: ovirt-4.2+
ylavi: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-ha-2.2.11-1.el7ev Doc Type: Bug Fix
Doc Text:
Cause: In the hosted engine agent, the transition to global or local maintenance state happened before clearing a timeout. Consequence: If the agent was in a state that used timeout, for example 'EngineStarting' state, and a maintenance mode was set, the timeout would never be cleared. If the maintenance mode would be active for a longer time, the timeout would be reached. As a result, when the maintenance mode is cleared, the HE VM would not start. The agent would not wait for the VM to start and immediately destroy it, because the timeout is reached. Fix: Clearing a timeout happens before transition to maintenance state. Result: The VM can be started.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-10 06:30:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1458711    
Attachments:
Description Flags
relevant engine , vdsm logs, agent log
none
host3 vdsm.log.13.xz none

Description Avihai 2018-04-15 14:03:30 UTC
Created attachment 1422193 [details]
relevant engine , vdsm logs, agent log

Description of problem:
Automation test TestCase9854 move each of the 3 hosts , one by one ,to maintenance & then back to active again .

host_mixed_3 as the active hosted engine VM .
When maintanance is called to this host it starts to migrate the hosted engine VM to host_mixed_2 but something goes wrong as the VM/Engine goes down at host_mixed_2 , see logs below.

****Engine log- Hosted engine VM goes down & engine restarted ****
2018-04-14 19:40:35,852+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] START, DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='a6c12094-82b1-433c-b0b5-117d1de7ddc6', vmId='5eeb9463-2eb7-488e-b7e0-dd08130d91f4', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 403c2ff
2018-04-14 19:40:37,222+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] FINISH, DestroyVDSCommand, log id: 403c2ff
2018-04-14 19:40:37,222+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM '5eeb9463-2eb7-488e-b7e0-dd08130d91f4'(HostedEngine) moved from 'MigratingFrom' --> 'Down'
2018-04-14 19:40:37,223+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] Handing over VM '5eeb9463-2eb7-488e-b7e0-dd08130d91f4'(HostedEngine) to Host 'null'. Setting VM to status 'MigratingTo'
2018-04-14 19:40:37,378+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '5eeb9463-2eb7-488e-b7e0-dd08130d91f4'(HostedEngine) was unexpectedly detected as 'Up' on VDS 'e2c751be-92b5-425a-8728-59fbb5fa5cda'(host_mixed_2) (expected on 'null')
2018-04-14 19:40:39,885+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] Updated host status from 'Preparing for Maintenance' to 'Maintenance' in database, host 'host_mixed_3'(a6c12094-82b1-433c-b0b5-117d1de7ddc6)
2018-04-14 19:40:39,917+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-7101) [] Clearing cache of pool: 'dd6607aa-3f27-11e8-9d4b-00163e7be006' for problematic entities of VDS: 'host_mixed_3'.


******VDSM host3 vdsm.log.13.xz log**********************
2018-04-14 19:40:37,355+0300 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:683)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in dispatchLibvirtEvents
    v.onLibvirtLifecycleEvent(event, detail, None)
AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'

*** VDSM host2 log ******************
2018-04-14 19:35:58,864+0300 ERROR (jsonrpc/2) [virt.vm] (vmId='5eeb9463-2eb7-488e-b7e0-dd08130d91f4') Alias not found for device type lease during migration at destination host (vm:5538)

2018-04-14 19:41:17,056+0300 ERROR (periodic/56) [virt.vmstats] VM metrics collection failed (vmstats:264)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 197, in send_metrics
    data[prefix + '.cpu.usage'] = stat['cpuUsage']
KeyError: 'cpuUsage'

*********Host2 agent.log.2018-04-13 ******
MainThread::INFO::2018-04-14 19:40:31,143::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
MainThread::INFO::2018-04-14 19:40:40,159::states::471::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm unexpectedly running locally, monitoring vm
MainThread::INFO::2018-04-14 19:40:41,127::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStarting) sent? sent
MainThread::INFO::2018-04-14 19:40:41,585::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-04-14 19:40:41,694::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineStop) sent? sent
MainThread::INFO::2018-04-14 19:40:41,713::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400)
MainThread::INFO::2018-04-14 19:40:50,737::hosted_engine::1007::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown`
MainThread::INFO::2018-04-14 19:40:51,085::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: 
MainThread::INFO::2018-04-14 19:40:51,086::hosted_engine::1013::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: 
MainThread::ERROR::2018-04-14 19:40:51,086::hosted_engine::1021::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost



Version-Release number of selected component (if applicable):
ovirt-hosted_engine-ha-2.2.10-1
ovirt-engine-4.2.3-0.1.el7.noarch
vdsm-4.20.25-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.2.x86_64
qemu-img-rhev-2.10.0-21.el7_5.1.x86_64

How reproducible:
Happened once during Tier2 tests run so far on TestCase9854

Steps to Reproduce(Taken from automation logs):
19:33:16 art/tests/rhevmtests/storage/storage_resize_storage_domain/test_resize_data_domain_luns.py Storage manager: netapp-cluster3.mgmt.lab.eng.tlv2.redhat.com, type: ISCSI
19:33:17 2018-04-14 19:33:17,150 INFO ================================================================================
19:33:17 2018-04-14 19:33:17,150 INFO       Test Setup   1: Creating a new LUN in storage server netapp
19:33:23 2018-04-14 19:33:23,338 INFO       Test Setup   2: Deactivating and activating hosts ['host_mixed_1', 'host_mixed_2', 'host_mixed_3'] after LUN creation
19:33:25 2018-04-14 19:33:25,544 INFO       Test Setup   3: [class] Deactivate host host_mixed_1 if it's not in maintenance with {'host_resource': 'lynx01.lab.eng.tlv2.redhat.com'}
19:33:25 2018-04-14 19:33:25,615 INFO       Test Setup   4: [class] Deactivate the host host_mixed_1. with {'expected_status': 'maintenance', 'host_resource': 'lynx01.lab.eng.tlv2.redhat.com'}
19:33:36 2018-04-14 19:33:36,166 INFO       Test Setup   5: [class] Activate the host host_mixed_1 if the host host_mixed_1 is not up with {'host_resource': 'lynx01.lab.eng.tlv2.redhat.com'}
19:33:36 2018-04-14 19:33:36,232 INFO       Test Setup   6: [class] Activate host host_mixed_1 (set status to UP) with {'host_resource': 'lynx01.lab.eng.tlv2.redhat.com', 'wait': True}
19:33:56 2018-04-14 19:33:56,095 INFO       Test Setup   7: [class] Deactivate host host_mixed_2 if it's not in maintenance with {'host_resource': 'lynx02.lab.eng.tlv2.redhat.com'}
19:33:56 2018-04-14 19:33:56,167 INFO       Test Setup   8: [class] Deactivate the host host_mixed_2. with {'expected_status': 'maintenance', 'host_resource': 'lynx02.lab.eng.tlv2.redhat.com'}
19:34:21 2018-04-14 19:34:20,923 INFO       Test Setup   9: [class] Activate the host host_mixed_2 if the host host_mixed_2 is not up with {'host_resource': 'lynx02.lab.eng.tlv2.redhat.com'}
19:34:21 2018-04-14 19:34:20,989 INFO       Test Setup  10: [class] Activate host host_mixed_2 (set status to UP) with {'host_resource': 'lynx02.lab.eng.tlv2.redhat.com', 'wait': True}
19:35:43 2018-04-14 19:35:43,252 INFO       Test Setup  11: [class] Deactivate host host_mixed_3 if it's not in maintenance with {'host_resource': 'lynx03.lab.eng.tlv2.redhat.com'}
19:35:43 2018-04-14 19:35:43,325 INFO       Test Setup  12: [class] Deactivate the host host_mixed_3. with {'expected_status': 'maintenance', 'host_resource': 'lynx03.lab.eng.tlv2.redhat.com'}
19:40:56 2018-04-14 19:40:55,982 INFO       Test Setup  13: [class] Activate the host host_mixed_3 if the host host_mixed_3 is not up with {'host_resource': 'lynx03.lab.eng.tlv2.redhat.com'}
19:41:14 2018-04-14 19:41:14,181 INFO 1741: storage/rhevmtests.storage.storage_resize_storage_domain.test_resize_data_domain_luns.TestCase9854.test_basic_lun_extend[iscsi]
19:41:14 2018-04-14 19:41:14,182 INFO STORAGE: ISCSI
19:41:14 2018-04-14 19:41:14,182 ERROR  NOTE: Test failed on setup phase!
19:41:14 2018-04-14 19:41:14,182 ERROR Result: FAILED

Actual results:
Hosted engine VM migrate from host_mixed_3 to host_mixed_2 but it unexpectedly goes down -> Engine restarts

Expected results:
Hosted engine VM should not go down

Additional info:

Comment 1 Michal Skrivanek 2018-04-16 08:21:03 UTC
the source host log is from wrong period. Please include correct log from host_2.

Comment 2 Martin Sivák 2018-04-16 08:30:06 UTC
MainThread::INFO::2018-04-14 19:40:41,127::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStarting) sent? sent
MainThread::INFO::2018-04-14 19:40:41,585::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-04-14 19:40:41,694::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineStop) sent? sent

This is really weird as the only reason for transition to EngineStop is timeout and that is definitely longer than 500ms or so.

Comment 3 Avihai 2018-04-16 08:43:38 UTC
(In reply to Michal Skrivanek from comment #1)
> the source host log is from wrong period. Please include correct log from
> host_2.

I don't follow.
The issue was seen at 2018-04-14 19:35:58 .
The logs in host2 directory have the right period of time, see below details.

1) vdsm.log.14.xz:
- start time = 2018-04-14 19:01:01
- end time = 2018-04-14 20:01:05


2)agent.log.2018-04-13:
- start time = 2018-04-13 19:21:45
- end time = 2018-04-14 23:03:26

If I am missing something please specify the specific log name & what time is missing.

Comment 4 Michal Skrivanek 2018-04-16 08:45:15 UTC
(In reply to Avihai from comment #3)
> (In reply to Michal Skrivanek from comment #1)
> > the source host log is from wrong period. Please include correct log from
> > host_2.
> 
> I don't follow.
> The issue was seen at 2018-04-14 19:35:58 .
> The logs in host2 directory have the right period of time, see below details.

sorry, I meant host_3. that's covering one hour earlier only

Comment 5 Avihai 2018-04-16 10:05:54 UTC
(In reply to Michal Skrivanek from comment #4)
> (In reply to Avihai from comment #3)
> > (In reply to Michal Skrivanek from comment #1)
> > > the source host log is from wrong period. Please include correct log from
> > > host_2.
> > 
> > I don't follow.
> > The issue was seen at 2018-04-14 19:35:58 .
> > The logs in host2 directory have the right period of time, see below details.
> 
> sorry, I meant host_3. that's covering one hour earlier only

Got it , indeed host3/vdsm.log.14.xz is one hour earlier .
I attached correct host3 vdsm.log.13.xz log .

Comment 6 Avihai 2018-04-16 10:06:31 UTC
Created attachment 1422394 [details]
host3 vdsm.log.13.xz

Comment 7 Michal Skrivanek 2018-04-16 10:33:10 UTC
nothing wrong from virt perspective - migration started at 19:36, finished successfully at 19:40:36, and the VM was shut down by API call at 19:40:51

Comment 8 Simone Tiraboschi 2018-04-16 16:33:29 UTC
(In reply to Michal Skrivanek from comment #7)
> nothing wrong from virt perspective - migration started at 19:36, finished
> successfully at 19:40:36, and the VM was shut down by API call at 19:40:51

So should I close it as NOTABUG?

Comment 9 Raz Tamir 2018-04-16 17:41:24 UTC
Hi Simone,
Please don't close bugs if it was reported by QE.

We might be able to provide more information if needed

Comment 10 Michal Skrivanek 2018-04-16 18:52:55 UTC
(In reply to Simone Tiraboschi from comment #8)
> (In reply to Michal Skrivanek from comment #7)
> > nothing wrong from virt perspective - migration started at 19:36, finished
> > successfully at 19:40:36, and the VM was shut down by API call at 19:40:51
> 
> So should I close it as NOTABUG?

No, I thonk Martin needs to review hosted-engine-agent behavior a bit more deeply. IIUC it shouldn’t have issued the shutdown. Martin?

Comment 11 Martin Sivák 2018-04-18 12:07:21 UTC
We shouldn't have, the timeout is supposed to be 10 minutes. We haven't changed the code there for ages (it looks the same as in 4.0).

Comment 12 Andrej Krejcir 2018-04-20 12:58:02 UTC
Looking at the agent logs, the timeout is set a day before the error, here:

MainThread::INFO::2018-04-13 23:10:51,374::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Fri Apr 13 23:20:11 2018 while transitioning <class 'ovirt_hosted_
engine_ha.agent.states.EngineUnexpectedlyDown'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>

And it is never cleared, because the state is moved to LocalMaintenance, which is a bug.

Then later at 2018-04-14 19:40:41,585, when the EngineStarting state checks the timeout, 
it is still set to 2018-04-13 23:20:51, so the state immediately moves to EngineStop and destroys the VM.

Comment 13 Nikolai Sednev 2018-04-30 15:03:00 UTC
I think that the issue arose from https://bugzilla.redhat.com/show_bug.cgi?id=1399766#c44.
There must be minimum delay before setting ha-hosts in to maintenance and activating/deactivating them, so there will be enough time for ha-agent to properly migrate HE-VM to another ha-host and back if required.

Comment 14 Nikolai Sednev 2018-05-02 09:17:38 UTC
Had not been reproduced on latest components:
ovirt-engine-4.2.3.3-0.1.el7.noarch
rhvm-appliance-4.2-20180427.0.el7.noarch
ovirt-hosted-engine-setup-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch
Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

1.I deployed node 0 on pair ha-hosts over gluster and iS
CSI.
2.I moved ha-host 1 that was hosting HE-VM and was SPM host in to local maintenance from UI.
3.HE-VM got migrated to ha-host 2.
4.I activated ha-host 1.
5.I set ha-host 2 in to local maintenance.
6.HE-VM got migrated to ha-host 1.
7.I activated ha-host 2.
8.I repeated steps 2-7 10 iterations and did not seen any issues.

I suspect that during automation runs, there was insufficient delay between all the steps that I've performed, please make sure that in CLI you can clearly see that ha-score on ha-host had been restored from 0 to 3400, before you try to set in to local maintenace ha-host that is hosting HE-VM, otherwise it won't migrate to anywhere, as there will be no destination ha-host with positive ha-score.
I did not see HE-VM being restarted during my tests.

Moving to verified.
Feel free to reopen if being reproduced and attach all relevant reproduction steps and logs.

Comment 15 Sandro Bonazzola 2018-05-10 06:30:52 UTC
This bugzilla is included in oVirt 4.2.3 release, published on May 4th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.