Bug 1419326

Summary: Migration of the HE VM via engine will drop source host to the status 'EngineUnexpectedlyDown'
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Artyom <alukiano>
Component: AgentAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1.0.1CC: akrejcir, bugs, dchaplyg, nsednev, sbonazzo, ylavi
Target Milestone: ovirt-4.1.2Keywords: Reopened, Triaged
Target Release: 2.1.0.6Flags: rule-engine: ovirt-4.1+
rule-engine: exception+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A migration of the HE VM can be triggered through the engine. The HA agent on the hosts periodically wakes up to check the state of the migration, but if the whole migration finishes in the interval while the agent sleeps, it will miss the migration. Consequence: The agent from which the VM was migrated lost the VM without observing the migration and at the same time shared storage was not yet updated by the host which received the VM. The source agent thinks the VM is down, so it switches to EngineUnexpectedlyDown state. Fix: Instead of switching to EngineUnexpectedlyDown, the agent goes to a new state, EngineMaybeAway, where it waits a while for the VM to appear on a different host. Result: Missing a migration is now handled correctly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-23 08:17:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1399766, 1411319, 1436002, 1446586    
Attachments:
Description Flags
agent and engine logs
none
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170309110107.tar.xz
none
sosreport-puma18.scl.lab.tlv.redhat.com-20170309110039.tar.xz
none
sosreport-puma19.scl.lab.tlv.redhat.com-20170309110051.tar.xz
none
screencast-2017-03-09_10.58.20.mkv
none
agent log none

Description Artyom 2017-02-05 13:21:08 UTC
Created attachment 1247816 [details]
agent and engine logs

Description of problem:
Migration of the HE VM via engine will drop source host to the status 'EngineUnexpectedlyDown'

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.1.0.1-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy HE on two hosts
2. Add master storage domain to the engine(to initiate auto-import process)
3. Choose HE VM via the engine and migrate it to the second host

Actual results:
VM migrated, but the source host has HE state 'EngineUnexpectedlyDown'(score 0)

Expected results:
VM migrated, and the source host has state EngineDown and score 3400

Additional info:
If I remember correct we just blocked migration of the HE VM via engine in the recent versions, so it does not regression.

you can start to look at the engine log from:
2017-02-05 08:02:21,883-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-26) [76bce85b-a3a3-44bc-b289-7f77b9f8655c] EVENT_ID: VM_MIGRATION_START(62), Correlation ID: 76bce85b-a3a3-44bc-b289-7f77b9f8655c, Job ID: 9930bb37-be69-4a3e-892c-54fb9e1481bb, Call Stack: null, Custom Event ID: -1, Message: Migration started (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz).

and agent.log from:
MainThread::DEBUG::2017-02-05 15:04:47,732::state_machine::162::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Cleared retry status
MainThread::INFO::2017-02-05 15:04:47,737::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>

Comment 1 Denis Chaplygin 2017-02-13 09:00:16 UTC
Neither me nor Artyom are not able to reproduce that bug any more on our testing environments.

Comment 2 Nikolai Sednev 2017-03-09 09:23:26 UTC
Fully reproduced on clean and fresh environment with these components on hosts:
rhvm-appliance-4.1.20170221.0-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.4.x86_64
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.4-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-host-deploy-1.6.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.4-1.el7ev.noarch
vdsm-4.19.7-1.el7ev.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-doc-4.1.0-2.el7ev.noarch
rhevm-setup-plugins-4.1.1-1.el7ev.noarch
rhev-guest-tools-iso-4.1-4.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-4.1.1.3-0.1.el7.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Please see attached logs from both hosts, engine and screen cast from the reproduction.

Comment 3 Nikolai Sednev 2017-03-09 09:24:26 UTC
Created attachment 1261491 [details]
sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170309110107.tar.xz

Comment 4 Nikolai Sednev 2017-03-09 09:26:02 UTC
Created attachment 1261492 [details]
sosreport-puma18.scl.lab.tlv.redhat.com-20170309110039.tar.xz

Comment 5 Nikolai Sednev 2017-03-09 09:27:39 UTC
Created attachment 1261494 [details]
sosreport-puma19.scl.lab.tlv.redhat.com-20170309110051.tar.xz

Comment 6 Nikolai Sednev 2017-03-09 09:29:11 UTC
Created attachment 1261495 [details]
screencast-2017-03-09_10.58.20.mkv

Comment 7 Artyom 2017-03-23 16:45:41 UTC
Checked on ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch

I still can encounter this problem in case when I migrate the HE VM more than once from host to host.
I believe it because we do not fix it for the state "EngineStarting".
MainThread::INFO::2017-03-23 18:30:49,787::config::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE
MainThread::INFO::2017-03-23 18:30:49,791::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
MainThread::INFO::2017-03-23 18:30:49,795::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1490286649.8 type=state_transition detail=EngineStarting-EngineUnexpectedlyDown hostname='rose05.qa.lab.tlv.redhat.com'
MainThread::INFO::2017-03-23 18:30:49,851::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUnexpectedlyDown) sent? sent

I do not see the reason why the user will need to migrate the HE VM from host to host in the short time interval, but I prefer that we will fix all corner cases.

Comment 8 Red Hat Bugzilla Rules Engine 2017-03-23 16:45:48 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 9 Artyom 2017-03-23 16:46:13 UTC
Created attachment 1265835 [details]
agent log

Comment 10 Artyom 2017-04-27 14:18:30 UTC
I see that the last build BUILD VERSION: 4.1.2-1, still include the old version of the ovirt-hosted-engine-ha package(ovirt-hosted-engine-ha-2.1.0.5-2.el7ev.noarch). Can you please move it back to MODIFY?

Comment 11 Nikolai Sednev 2017-04-27 14:25:34 UTC
I also see the same on cleanly deployed environment:
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch

Hosts:
vdsm-4.19.11-1.el7ev.x86_64
ovirt-host-deploy-1.6.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Engine:
rhevm-doc-4.1.2-1.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-branding-rhev-4.1.0-1.el7ev.noarch
rhevm-setup-plugins-4.1.1-1.el7ev.noarch
rhevm-4.1.2-0.1.el7.noarch
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 14 Nikolai Sednev 2017-04-30 08:51:41 UTC
Issue still being reproduced even with ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch:

After waiting about 15 minutes, I've got stabilized environment:
alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 3834cc35
local_conf_timestamp               : 2012
Host timestamp                     : 1996
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1996 (Sun Apr 30 11:29:07 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=2012 (Sun Apr 30 11:29:23 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5fa1a806
local_conf_timestamp               : 2026
Host timestamp                     : 2010
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2010 (Sun Apr 30 11:29:18 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=2026 (Sun Apr 30 11:29:33 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False


Several HE-VM's migrations between the hosts got reproduced the source of migration host dropping it's score to zero as appears bellow:

MainThread::INFO::2017-04-30 11:45:15,960::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state ReinitializeFSM (score: 0)


alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "migration source"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : bb4b2267
local_conf_timestamp               : 2962
Host timestamp                     : 2941
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2941 (Sun Apr 30 11:44:52 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=2962 (Sun Apr 30 11:45:13 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=ReinitializeFSM
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : cf395267
local_conf_timestamp               : 2974
Host timestamp                     : 2957
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2957 (Sun Apr 30 11:45:05 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=2974 (Sun Apr 30 11:45:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False


Components on both hosts:
vdsm-4.19.11-1.el7ev.x86_64
ovirt-host-deploy-1.6.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
sanlock-3.4.0-1.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch
ovirt-setup-lib-1.1.0-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
libvirt-client-2.0.0-10.el7_3.5.x86_64
Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017
Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Comment 16 Nikolai Sednev 2017-04-30 09:00:39 UTC
I did not observed source host getting status 'EngineUnexpectedlyDown' during reproduction, but clearly saw that source host dropped it's score to zero, as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1419326#c14.

Comment 17 Artyom 2017-05-11 10:16:24 UTC
Verified on ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch
But be aware of the bug - https://bugzilla.redhat.com/show_bug.cgi?id=1448699