1419326 – Migration of the HE VM via engine will drop source host to the status 'EngineUnexpectedlyDown'

Bug 1419326 - Migration of the HE VM via engine will drop source host to the status 'EngineUnexpectedlyDown'

Summary: Migration of the HE VM via engine will drop source host to the status 'Engine...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	Agent
Sub Component:
Version:	2.1.0.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.1.2
Target Release:	2.1.0.6
Assignee:	Andrej Krejcir
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1399766 1411319 1436002 1446586
TreeView+	depends on / blocked

Reported:	2017-02-05 13:21 UTC by Artyom
Modified:	2017-05-23 08:17 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-05-23 08:17:36 UTC
oVirt Team:	SLA
Embargoed:
Flags:	rule-engine: ovirt-4.1+ rule-engine: exception+

Attachments	(Terms of Use)
agent and engine logs (6.70 MB, application/zip) 2017-02-05 13:21 UTC, Artyom	no flags	Details
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170309110107.tar.xz (8.82 MB, application/x-xz) 2017-03-09 09:24 UTC, Nikolai Sednev	no flags	Details
sosreport-puma18.scl.lab.tlv.redhat.com-20170309110039.tar.xz (14.13 MB, application/x-xz) 2017-03-09 09:26 UTC, Nikolai Sednev	no flags	Details
sosreport-puma19.scl.lab.tlv.redhat.com-20170309110051.tar.xz (14.43 MB, application/x-xz) 2017-03-09 09:27 UTC, Nikolai Sednev	no flags	Details
screencast-2017-03-09_10.58.20.mkv (14.38 MB, application/octet-stream) 2017-03-09 09:29 UTC, Nikolai Sednev	no flags	Details
agent log (1.56 MB, text/plain) 2017-03-23 16:46 UTC, Artyom	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	73863	master	MERGED	Add EngineMaybeAway state to wait before EngineUnexpectedlyDown state	2020-12-14 22:04:29 UTC
oVirt gerrit	74153	v2.1.z	MERGED	Add EngineMaybeAway state to wait before EngineUnexpectedlyDown state	2020-12-14 22:04:27 UTC
oVirt gerrit	74656	master	MERGED	Remove wait time when transitioning to EngineStarting	2020-12-14 22:04:58 UTC
oVirt gerrit	74870	v2.1.z	MERGED	Remove wait time when transitioning to EngineStarting	2020-12-14 22:04:29 UTC

Description Artyom 2017-02-05 13:21:08 UTC

Created attachment 1247816 [details]
agent and engine logs

Description of problem:
Migration of the HE VM via engine will drop source host to the status 'EngineUnexpectedlyDown'

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.1.0.1-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE on two hosts
2. Add master storage domain to the engine(to initiate auto-import process)
3. Choose HE VM via the engine and migrate it to the second host

Actual results:
VM migrated, but the source host has HE state 'EngineUnexpectedlyDown'(score 0)

Expected results:
VM migrated, and the source host has state EngineDown and score 3400

Additional info:
If I remember correct we just blocked migration of the HE VM via engine in the recent versions, so it does not regression.

you can start to look at the engine log from:
2017-02-05 08:02:21,883-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-26) [76bce85b-a3a3-44bc-b289-7f77b9f8655c] EVENT_ID: VM_MIGRATION_START(62), Correlation ID: 76bce85b-a3a3-44bc-b289-7f77b9f8655c, Job ID: 9930bb37-be69-4a3e-892c-54fb9e1481bb, Call Stack: null, Custom Event ID: -1, Message: Migration started (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz).

and agent.log from:
MainThread::DEBUG::2017-02-05 15:04:47,732::state_machine::162::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Cleared retry status
MainThread::INFO::2017-02-05 15:04:47,737::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>

Comment 1 Denis Chaplygin 2017-02-13 09:00:16 UTC

Neither me nor Artyom are not able to reproduce that bug any more on our testing environments.

Comment 2 Nikolai Sednev 2017-03-09 09:23:26 UTC

Fully reproduced on clean and fresh environment with these components on hosts:
rhvm-appliance-4.1.20170221.0-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.4.x86_64
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.4-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-host-deploy-1.6.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.4-1.el7ev.noarch
vdsm-4.19.7-1.el7ev.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-doc-4.1.0-2.el7ev.noarch
rhevm-setup-plugins-4.1.1-1.el7ev.noarch
rhev-guest-tools-iso-4.1-4.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-4.1.1.3-0.1.el7.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Please see attached logs from both hosts, engine and screen cast from the reproduction.

Comment 3 Nikolai Sednev 2017-03-09 09:24:26 UTC

Created attachment 1261491 [details]
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170309110107.tar.xz

Comment 4 Nikolai Sednev 2017-03-09 09:26:02 UTC

Created attachment 1261492 [details]
sosreport-puma18.scl.lab.tlv.redhat.com-20170309110039.tar.xz

Comment 5 Nikolai Sednev 2017-03-09 09:27:39 UTC

Created attachment 1261494 [details]
sosreport-puma19.scl.lab.tlv.redhat.com-20170309110051.tar.xz

Comment 6 Nikolai Sednev 2017-03-09 09:29:11 UTC

Created attachment 1261495 [details]
screencast-2017-03-09_10.58.20.mkv

Comment 7 Artyom 2017-03-23 16:45:41 UTC

Checked on ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch

I still can encounter this problem in case when I migrate the HE VM more than once from host to host.
I believe it because we do not fix it for the state "EngineStarting".
MainThread::INFO::2017-03-23 18:30:49,787::config::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE
MainThread::INFO::2017-03-23 18:30:49,791::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
MainThread::INFO::2017-03-23 18:30:49,795::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1490286649.8 type=state_transition detail=EngineStarting-EngineUnexpectedlyDown hostname='rose05.qa.lab.tlv.redhat.com'
MainThread::INFO::2017-03-23 18:30:49,851::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUnexpectedlyDown) sent? sent

I do not see the reason why the user will need to migrate the HE VM from host to host in the short time interval, but I prefer that we will fix all corner cases.

Comment 8 Red Hat Bugzilla Rules Engine 2017-03-23 16:45:48 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 9 Artyom 2017-03-23 16:46:13 UTC

Created attachment 1265835 [details]
agent log

Comment 10 Artyom 2017-04-27 14:18:30 UTC

I see that the last build BUILD VERSION: 4.1.2-1, still include the old version of the ovirt-hosted-engine-ha package(ovirt-hosted-engine-ha-2.1.0.5-2.el7ev.noarch). Can you please move it back to MODIFY?

Comment 11 Nikolai Sednev 2017-04-27 14:25:34 UTC

I also see the same on cleanly deployed environment:
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch

Hosts:
vdsm-4.19.11-1.el7ev.x86_64
ovirt-host-deploy-1.6.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-doc-4.1.2-1.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.1-1.el7ev.noarch
rhevm-4.1.2-0.1.el7.noarch
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 13 Nikolai Sednev 2017-04-30 08:29:29 UTC

Logs from alma04:
https://drive.google.com/open?id=0B85BEaDBcF88YzNOR3ZTS0FZc28

Logs from alma03:
https://drive.google.com/open?id=0B85BEaDBcF88NmR4U1VDNm5BbHc

Comment 14 Nikolai Sednev 2017-04-30 08:51:41 UTC

Issue still being reproduced even with ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch:

After waiting about 15 minutes, I've got stabilized environment:
alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 3834cc35
local_conf_timestamp               : 2012
Host timestamp                     : 1996
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1996 (Sun Apr 30 11:29:07 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=2012 (Sun Apr 30 11:29:23 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5fa1a806
local_conf_timestamp               : 2026
Host timestamp                     : 2010
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2010 (Sun Apr 30 11:29:18 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=2026 (Sun Apr 30 11:29:33 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


Several HE-VM's migrations between the hosts got reproduced the source of migration host dropping it's score to zero as appears bellow:

MainThread::INFO::2017-04-30 11:45:15,960::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state ReinitializeFSM (score: 0)


alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "migration source"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : bb4b2267
local_conf_timestamp               : 2962
Host timestamp                     : 2941
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2941 (Sun Apr 30 11:44:52 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=2962 (Sun Apr 30 11:45:13 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=ReinitializeFSM
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : cf395267
local_conf_timestamp               : 2974
Host timestamp                     : 2957
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2957 (Sun Apr 30 11:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=2974 (Sun Apr 30 11:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


Components on both hosts:
vdsm-4.19.11-1.el7ev.x86_64
ovirt-host-deploy-1.6.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 15 Nikolai Sednev 2017-04-30 08:57:40 UTC

alma04 logs:
https://drive.google.com/open?id=0B85BEaDBcF88Sm5oUVVNZGc2bVk

alma03 logs:
https://drive.google.com/open?id=0B85BEaDBcF88UnlESl8yNDQzams

Comment 16 Nikolai Sednev 2017-04-30 09:00:39 UTC

I did not observed source host getting status 'EngineUnexpectedlyDown' during reproduction, but clearly saw that source host dropped it's score to zero, as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1419326#c14.

Comment 17 Artyom 2017-05-11 10:16:24 UTC

Verified on ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch
But be aware of the bug - https://bugzilla.redhat.com/show_bug.cgi?id=1448699

Note You need to log in before you can comment on or make changes to this bug.