Created attachment 1247816 [details] agent and engine logs Description of problem: Migration of the HE VM via engine will drop source host to the status 'EngineUnexpectedlyDown' Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.1.0.1-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy HE on two hosts 2. Add master storage domain to the engine(to initiate auto-import process) 3. Choose HE VM via the engine and migrate it to the second host Actual results: VM migrated, but the source host has HE state 'EngineUnexpectedlyDown'(score 0) Expected results: VM migrated, and the source host has state EngineDown and score 3400 Additional info: If I remember correct we just blocked migration of the HE VM via engine in the recent versions, so it does not regression. you can start to look at the engine log from: 2017-02-05 08:02:21,883-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-26) [76bce85b-a3a3-44bc-b289-7f77b9f8655c] EVENT_ID: VM_MIGRATION_START(62), Correlation ID: 76bce85b-a3a3-44bc-b289-7f77b9f8655c, Job ID: 9930bb37-be69-4a3e-892c-54fb9e1481bb, Call Stack: null, Custom Event ID: -1, Message: Migration started (VM: HostedEngine, Source: host_mixed_1, Destination: host_mixed_3, User: admin@internal-authz). and agent.log from: MainThread::DEBUG::2017-02-05 15:04:47,732::state_machine::162::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Cleared retry status MainThread::INFO::2017-02-05 15:04:47,737::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
Neither me nor Artyom are not able to reproduce that bug any more on our testing environments.
Fully reproduced on clean and fresh environment with these components on hosts: rhvm-appliance-4.1.20170221.0-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.4.x86_64 mom-0.5.9-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.4-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-host-deploy-1.6.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.4-1.el7ev.noarch vdsm-4.19.7-1.el7ev.x86_64 ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhevm-doc-4.1.0-2.el7ev.noarch rhevm-setup-plugins-4.1.1-1.el7ev.noarch rhev-guest-tools-iso-4.1-4.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-4.1.1.3-0.1.el7.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch Linux version 3.10.0-514.6.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016 Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Please see attached logs from both hosts, engine and screen cast from the reproduction.
Created attachment 1261491 [details] sosreport-nsednev-he-4.scl.lab.tlv.redhat.com-20170309110107.tar.xz
Created attachment 1261492 [details] sosreport-puma18.scl.lab.tlv.redhat.com-20170309110039.tar.xz
Created attachment 1261494 [details] sosreport-puma19.scl.lab.tlv.redhat.com-20170309110051.tar.xz
Created attachment 1261495 [details] screencast-2017-03-09_10.58.20.mkv
Checked on ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch I still can encounter this problem in case when I migrate the HE VM more than once from host to host. I believe it because we do not fix it for the state "EngineStarting". MainThread::INFO::2017-03-23 18:30:49,787::config::436::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE MainThread::INFO::2017-03-23 18:30:49,791::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'> MainThread::INFO::2017-03-23 18:30:49,795::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1490286649.8 type=state_transition detail=EngineStarting-EngineUnexpectedlyDown hostname='rose05.qa.lab.tlv.redhat.com' MainThread::INFO::2017-03-23 18:30:49,851::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUnexpectedlyDown) sent? sent I do not see the reason why the user will need to migrate the HE VM from host to host in the short time interval, but I prefer that we will fix all corner cases.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Created attachment 1265835 [details] agent log
I see that the last build BUILD VERSION: 4.1.2-1, still include the old version of the ovirt-hosted-engine-ha package(ovirt-hosted-engine-ha-2.1.0.5-2.el7ev.noarch). Can you please move it back to MODIFY?
I also see the same on cleanly deployed environment: ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch Hosts: vdsm-4.19.11-1.el7ev.x86_64 ovirt-host-deploy-1.6.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64 ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch mom-0.5.9-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.5.x86_64 Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017 Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo) Engine: rhevm-doc-4.1.2-1.el7ev.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch rhev-guest-tools-iso-4.1-5.el7ev.noarch rhevm-branding-rhev-4.1.0-1.el7ev.noarch rhevm-setup-plugins-4.1.1-1.el7ev.noarch rhevm-4.1.2-0.1.el7.noarch Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017 Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo)
Logs from alma04: https://drive.google.com/open?id=0B85BEaDBcF88YzNOR3ZTS0FZc28 Logs from alma03: https://drive.google.com/open?id=0B85BEaDBcF88NmR4U1VDNm5BbHc
Issue still being reproduced even with ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch: After waiting about 15 minutes, I've got stabilized environment: alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 3834cc35 local_conf_timestamp : 2012 Host timestamp : 1996 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1996 (Sun Apr 30 11:29:07 2017) host-id=1 score=3400 vm_conf_refresh_time=2012 (Sun Apr 30 11:29:23 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 5fa1a806 local_conf_timestamp : 2026 Host timestamp : 2010 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2010 (Sun Apr 30 11:29:18 2017) host-id=2 score=3400 vm_conf_refresh_time=2026 (Sun Apr 30 11:29:33 2017) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False Several HE-VM's migrations between the hosts got reproduced the source of migration host dropping it's score to zero as appears bellow: MainThread::INFO::2017-04-30 11:45:15,960::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state ReinitializeFSM (score: 0) alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "migration source"} Score : 0 stopped : False Local maintenance : False crc32 : bb4b2267 local_conf_timestamp : 2962 Host timestamp : 2941 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2941 (Sun Apr 30 11:44:52 2017) host-id=1 score=0 vm_conf_refresh_time=2962 (Sun Apr 30 11:45:13 2017) conf_on_shared_storage=True maintenance=False state=ReinitializeFSM stopped=False --== Host 2 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04 Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : cf395267 local_conf_timestamp : 2974 Host timestamp : 2957 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=2957 (Sun Apr 30 11:45:05 2017) host-id=2 score=3400 vm_conf_refresh_time=2974 (Sun Apr 30 11:45:21 2017) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False Components on both hosts: vdsm-4.19.11-1.el7ev.x86_64 ovirt-host-deploy-1.6.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch sanlock-3.4.0-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64 ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-hosted-engine-setup-2.1.0.5-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch ovirt-setup-lib-1.1.0-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch mom-0.5.9-1.el7ev.noarch libvirt-client-2.0.0-10.el7_3.5.x86_64 Linux version 3.10.0-514.16.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Fri Mar 10 13:12:32 EST 2017 Linux 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.3 (Maipo)
alma04 logs: https://drive.google.com/open?id=0B85BEaDBcF88Sm5oUVVNZGc2bVk alma03 logs: https://drive.google.com/open?id=0B85BEaDBcF88UnlESl8yNDQzams
I did not observed source host getting status 'EngineUnexpectedlyDown' during reproduction, but clearly saw that source host dropped it's score to zero, as appears in https://bugzilla.redhat.com/show_bug.cgi?id=1419326#c14.
Verified on ovirt-hosted-engine-ha-2.1.0.6-1.el7ev.noarch But be aware of the bug - https://bugzilla.redhat.com/show_bug.cgi?id=1448699