Bug 1567615
Summary: | Hosted engine - engine restarted during hosted engine VM migration -ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Avihai <aefrat> | ||||||
Component: | Agent | Assignee: | Andrej Krejcir <akrejcir> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 2.2.10 | CC: | aefrat, akrejcir, bugs, michal.skrivanek, msivak, ratamir, stirabos, ylavi | ||||||
Target Milestone: | ovirt-4.2.3 | Keywords: | Automation, Triaged | ||||||
Target Release: | --- | Flags: | ylavi:
ovirt-4.2+
ylavi: blocker+ |
||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | ovirt-hosted-engine-ha-2.2.11-1.el7ev | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause:
In the hosted engine agent, the transition to global or local maintenance state happened before clearing a timeout.
Consequence:
If the agent was in a state that used timeout, for example 'EngineStarting' state, and a maintenance mode was set, the timeout would never be cleared. If the maintenance mode would be active for a longer time, the timeout would be reached. As a result, when the maintenance mode is cleared, the HE VM would not start. The agent would not wait for the VM to start and immediately destroy it, because the timeout is reached.
Fix:
Clearing a timeout happens before transition to maintenance state.
Result:
The VM can be started.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-05-10 06:30:52 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1458711 | ||||||||
Attachments: |
|
the source host log is from wrong period. Please include correct log from host_2. MainThread::INFO::2018-04-14 19:40:41,127::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStarting) sent? sent MainThread::INFO::2018-04-14 19:40:41,585::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-04-14 19:40:41,694::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineStop) sent? sent This is really weird as the only reason for transition to EngineStop is timeout and that is definitely longer than 500ms or so. (In reply to Michal Skrivanek from comment #1) > the source host log is from wrong period. Please include correct log from > host_2. I don't follow. The issue was seen at 2018-04-14 19:35:58 . The logs in host2 directory have the right period of time, see below details. 1) vdsm.log.14.xz: - start time = 2018-04-14 19:01:01 - end time = 2018-04-14 20:01:05 2)agent.log.2018-04-13: - start time = 2018-04-13 19:21:45 - end time = 2018-04-14 23:03:26 If I am missing something please specify the specific log name & what time is missing. (In reply to Avihai from comment #3) > (In reply to Michal Skrivanek from comment #1) > > the source host log is from wrong period. Please include correct log from > > host_2. > > I don't follow. > The issue was seen at 2018-04-14 19:35:58 . > The logs in host2 directory have the right period of time, see below details. sorry, I meant host_3. that's covering one hour earlier only (In reply to Michal Skrivanek from comment #4) > (In reply to Avihai from comment #3) > > (In reply to Michal Skrivanek from comment #1) > > > the source host log is from wrong period. Please include correct log from > > > host_2. > > > > I don't follow. > > The issue was seen at 2018-04-14 19:35:58 . > > The logs in host2 directory have the right period of time, see below details. > > sorry, I meant host_3. that's covering one hour earlier only Got it , indeed host3/vdsm.log.14.xz is one hour earlier . I attached correct host3 vdsm.log.13.xz log . Created attachment 1422394 [details]
host3 vdsm.log.13.xz
nothing wrong from virt perspective - migration started at 19:36, finished successfully at 19:40:36, and the VM was shut down by API call at 19:40:51 (In reply to Michal Skrivanek from comment #7) > nothing wrong from virt perspective - migration started at 19:36, finished > successfully at 19:40:36, and the VM was shut down by API call at 19:40:51 So should I close it as NOTABUG? Hi Simone, Please don't close bugs if it was reported by QE. We might be able to provide more information if needed (In reply to Simone Tiraboschi from comment #8) > (In reply to Michal Skrivanek from comment #7) > > nothing wrong from virt perspective - migration started at 19:36, finished > > successfully at 19:40:36, and the VM was shut down by API call at 19:40:51 > > So should I close it as NOTABUG? No, I thonk Martin needs to review hosted-engine-agent behavior a bit more deeply. IIUC it shouldn’t have issued the shutdown. Martin? We shouldn't have, the timeout is supposed to be 10 minutes. We haven't changed the code there for ages (it looks the same as in 4.0). Looking at the agent logs, the timeout is set a day before the error, here: MainThread::INFO::2018-04-13 23:10:51,374::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Fri Apr 13 23:20:11 2018 while transitioning <class 'ovirt_hosted_ engine_ha.agent.states.EngineUnexpectedlyDown'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'> And it is never cleared, because the state is moved to LocalMaintenance, which is a bug. Then later at 2018-04-14 19:40:41,585, when the EngineStarting state checks the timeout, it is still set to 2018-04-13 23:20:51, so the state immediately moves to EngineStop and destroys the VM. I think that the issue arose from https://bugzilla.redhat.com/show_bug.cgi?id=1399766#c44. There must be minimum delay before setting ha-hosts in to maintenance and activating/deactivating them, so there will be enough time for ha-agent to properly migrate HE-VM to another ha-host and back if required. Had not been reproduced on latest components: ovirt-engine-4.2.3.3-0.1.el7.noarch rhvm-appliance-4.2-20180427.0.el7.noarch ovirt-hosted-engine-setup-2.2.19-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) 1.I deployed node 0 on pair ha-hosts over gluster and iS CSI. 2.I moved ha-host 1 that was hosting HE-VM and was SPM host in to local maintenance from UI. 3.HE-VM got migrated to ha-host 2. 4.I activated ha-host 1. 5.I set ha-host 2 in to local maintenance. 6.HE-VM got migrated to ha-host 1. 7.I activated ha-host 2. 8.I repeated steps 2-7 10 iterations and did not seen any issues. I suspect that during automation runs, there was insufficient delay between all the steps that I've performed, please make sure that in CLI you can clearly see that ha-score on ha-host had been restored from 0 to 3400, before you try to set in to local maintenace ha-host that is hosting HE-VM, otherwise it won't migrate to anywhere, as there will be no destination ha-host with positive ha-score. I did not see HE-VM being restarted during my tests. Moving to verified. Feel free to reopen if being reproduced and attach all relevant reproduction steps and logs. This bugzilla is included in oVirt 4.2.3 release, published on May 4th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Created attachment 1422193 [details] relevant engine , vdsm logs, agent log Description of problem: Automation test TestCase9854 move each of the 3 hosts , one by one ,to maintenance & then back to active again . host_mixed_3 as the active hosted engine VM . When maintanance is called to this host it starts to migrate the hosted engine VM to host_mixed_2 but something goes wrong as the VM/Engine goes down at host_mixed_2 , see logs below. ****Engine log- Hosted engine VM goes down & engine restarted **** 2018-04-14 19:40:35,852+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] START, DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='a6c12094-82b1-433c-b0b5-117d1de7ddc6', vmId='5eeb9463-2eb7-488e-b7e0-dd08130d91f4', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 403c2ff 2018-04-14 19:40:37,222+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-12) [] FINISH, DestroyVDSCommand, log id: 403c2ff 2018-04-14 19:40:37,222+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM '5eeb9463-2eb7-488e-b7e0-dd08130d91f4'(HostedEngine) moved from 'MigratingFrom' --> 'Down' 2018-04-14 19:40:37,223+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] Handing over VM '5eeb9463-2eb7-488e-b7e0-dd08130d91f4'(HostedEngine) to Host 'null'. Setting VM to status 'MigratingTo' 2018-04-14 19:40:37,378+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '5eeb9463-2eb7-488e-b7e0-dd08130d91f4'(HostedEngine) was unexpectedly detected as 'Up' on VDS 'e2c751be-92b5-425a-8728-59fbb5fa5cda'(host_mixed_2) (expected on 'null') 2018-04-14 19:40:39,885+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] Updated host status from 'Preparing for Maintenance' to 'Maintenance' in database, host 'host_mixed_3'(a6c12094-82b1-433c-b0b5-117d1de7ddc6) 2018-04-14 19:40:39,917+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-7101) [] Clearing cache of pool: 'dd6607aa-3f27-11e8-9d4b-00163e7be006' for problematic entities of VDS: 'host_mixed_3'. ******VDSM host3 vdsm.log.13.xz log********************** 2018-04-14 19:40:37,355+0300 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:683) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None) AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent' *** VDSM host2 log ****************** 2018-04-14 19:35:58,864+0300 ERROR (jsonrpc/2) [virt.vm] (vmId='5eeb9463-2eb7-488e-b7e0-dd08130d91f4') Alias not found for device type lease during migration at destination host (vm:5538) 2018-04-14 19:41:17,056+0300 ERROR (periodic/56) [virt.vmstats] VM metrics collection failed (vmstats:264) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 197, in send_metrics data[prefix + '.cpu.usage'] = stat['cpuUsage'] KeyError: 'cpuUsage' *********Host2 agent.log.2018-04-13 ****** MainThread::INFO::2018-04-14 19:40:31,143::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-04-14 19:40:40,159::states::471::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm unexpectedly running locally, monitoring vm MainThread::INFO::2018-04-14 19:40:41,127::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStarting) sent? sent MainThread::INFO::2018-04-14 19:40:41,585::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-04-14 19:40:41,694::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineStop) sent? sent MainThread::INFO::2018-04-14 19:40:41,713::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2018-04-14 19:40:50,737::hosted_engine::1007::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` MainThread::INFO::2018-04-14 19:40:51,085::hosted_engine::1012::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2018-04-14 19:40:51,086::hosted_engine::1013::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2018-04-14 19:40:51,086::hosted_engine::1021::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost Version-Release number of selected component (if applicable): ovirt-hosted_engine-ha-2.2.10-1 ovirt-engine-4.2.3-0.1.el7.noarch vdsm-4.20.25-1.el7ev.x86_64 libvirt-3.9.0-14.el7_5.2.x86_64 qemu-img-rhev-2.10.0-21.el7_5.1.x86_64 How reproducible: Happened once during Tier2 tests run so far on TestCase9854 Steps to Reproduce(Taken from automation logs): 19:33:16 art/tests/rhevmtests/storage/storage_resize_storage_domain/test_resize_data_domain_luns.py Storage manager: netapp-cluster3.mgmt.lab.eng.tlv2.redhat.com, type: ISCSI 19:33:17 2018-04-14 19:33:17,150 INFO ================================================================================ 19:33:17 2018-04-14 19:33:17,150 INFO Test Setup 1: Creating a new LUN in storage server netapp 19:33:23 2018-04-14 19:33:23,338 INFO Test Setup 2: Deactivating and activating hosts ['host_mixed_1', 'host_mixed_2', 'host_mixed_3'] after LUN creation 19:33:25 2018-04-14 19:33:25,544 INFO Test Setup 3: [class] Deactivate host host_mixed_1 if it's not in maintenance with {'host_resource': 'lynx01.lab.eng.tlv2.redhat.com'} 19:33:25 2018-04-14 19:33:25,615 INFO Test Setup 4: [class] Deactivate the host host_mixed_1. with {'expected_status': 'maintenance', 'host_resource': 'lynx01.lab.eng.tlv2.redhat.com'} 19:33:36 2018-04-14 19:33:36,166 INFO Test Setup 5: [class] Activate the host host_mixed_1 if the host host_mixed_1 is not up with {'host_resource': 'lynx01.lab.eng.tlv2.redhat.com'} 19:33:36 2018-04-14 19:33:36,232 INFO Test Setup 6: [class] Activate host host_mixed_1 (set status to UP) with {'host_resource': 'lynx01.lab.eng.tlv2.redhat.com', 'wait': True} 19:33:56 2018-04-14 19:33:56,095 INFO Test Setup 7: [class] Deactivate host host_mixed_2 if it's not in maintenance with {'host_resource': 'lynx02.lab.eng.tlv2.redhat.com'} 19:33:56 2018-04-14 19:33:56,167 INFO Test Setup 8: [class] Deactivate the host host_mixed_2. with {'expected_status': 'maintenance', 'host_resource': 'lynx02.lab.eng.tlv2.redhat.com'} 19:34:21 2018-04-14 19:34:20,923 INFO Test Setup 9: [class] Activate the host host_mixed_2 if the host host_mixed_2 is not up with {'host_resource': 'lynx02.lab.eng.tlv2.redhat.com'} 19:34:21 2018-04-14 19:34:20,989 INFO Test Setup 10: [class] Activate host host_mixed_2 (set status to UP) with {'host_resource': 'lynx02.lab.eng.tlv2.redhat.com', 'wait': True} 19:35:43 2018-04-14 19:35:43,252 INFO Test Setup 11: [class] Deactivate host host_mixed_3 if it's not in maintenance with {'host_resource': 'lynx03.lab.eng.tlv2.redhat.com'} 19:35:43 2018-04-14 19:35:43,325 INFO Test Setup 12: [class] Deactivate the host host_mixed_3. with {'expected_status': 'maintenance', 'host_resource': 'lynx03.lab.eng.tlv2.redhat.com'} 19:40:56 2018-04-14 19:40:55,982 INFO Test Setup 13: [class] Activate the host host_mixed_3 if the host host_mixed_3 is not up with {'host_resource': 'lynx03.lab.eng.tlv2.redhat.com'} 19:41:14 2018-04-14 19:41:14,181 INFO 1741: storage/rhevmtests.storage.storage_resize_storage_domain.test_resize_data_domain_luns.TestCase9854.test_basic_lun_extend[iscsi] 19:41:14 2018-04-14 19:41:14,182 INFO STORAGE: ISCSI 19:41:14 2018-04-14 19:41:14,182 ERROR NOTE: Test failed on setup phase! 19:41:14 2018-04-14 19:41:14,182 ERROR Result: FAILED Actual results: Hosted engine VM migrate from host_mixed_3 to host_mixed_2 but it unexpectedly goes down -> Engine restarts Expected results: Hosted engine VM should not go down Additional info: