Description of problem: MainThread::INFO::2017-07-02 14:00:27,390::states::203::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(scor e) Penalizing score by 50 due to 1 engine vm retry attempts MainThread::INFO::2017-07-02 14:00:27,392::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ::(start_monitoring) Current state EngineDown (score: 3350) I clearly see that SPM destination host being penalized for 50 points for a few moments due to 1 engine vm retry attempts during normal HE-VM's migration. Score becomes 3350 from 3400. Version-Release number of selected component (if applicable): Host's components: qemu-kvm-rhev-2.9.0-14.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.9-1.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.3-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch vdsm-4.19.20-1.el7ev.x86_64 ovirt-hosted-engine-ha-2.1.4-1.el7ev.noarch libvirt-client-3.2.0-14.el7.x86_64 ovirt-hosted-engine-setup-2.1.3.2-1.el7ev.noarch sanlock-3.5.0-1.el7.x86_64 ovirt-host-deploy-1.6.6-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch Linux version 3.10.0-663.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-14) (GCC) ) #1 SMP Tue May 2 16:00:29 EDT 2017 Linux 3.10.0-663.el7.x86_64 #1 SMP Tue May 2 16:00:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo) Engine's components: rhev-guest-tools-iso-4.1-5.el7ev.noarch rhevm-dependencies-4.1.1-1.el7ev.noarch rhevm-doc-4.1.3-1.el7ev.noarch rhevm-branding-rhev-4.1.0-2.el7ev.noarch rhevm-4.1.3.5-0.1.el7.noarch rhevm-setup-plugins-4.1.2-1.el7ev.noarch Linux version 3.10.0-514.21.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sun May 28 17:08:21 EDT 2017 Linux 3.10.0-514.21.2.el7.x86_64 #1 SMP Sun May 28 17:08:21 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo) How reproducible: 100% Steps to Reproduce: 1.Deploy HE over NFS on pair of RHEL7.4 hosts and add two NFS data storage domains. 2.Assuming that host1 is an SPM, migrate HE-VM to host2. 3.Once migration successful, wait a few moments until in CLI migration also shown as successful. 4.Migrate back HE-VM from host2 to host1. 5.Host1's score being penalized for 50 points during migration. Actual results: Host1's score being penalized for 50 points during migration. Expected results: HE-VM should not drop down positive score on destination SPM host. Additional info: Logs from pair of hosts and the engine being attached, together with the screencast.
Forgot to mention, that score being raised back to normal on destination SPM host after some time.
Created attachment 1293603 [details] sosreport from the engine
Created attachment 1293604 [details] sosreport from host1 (the SPM host)
Created attachment 1293605 [details] sosreport from host2
Screencast is available from here: https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88SVBPdk16TWRTanc/view?usp=sharing
Seems like a temporary sync delay/issue (maybe due to more heavy duty tasks of the SPM host) So the explanation can be that the score is decreased by 50 after failing to start the HE VM and after getting the lock it starts and migration is eventually successful. see in the spm host log : MainThread::INFO::2017-07-02 13:51:23,880::hosted_engine::1119::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2017-07-02 13:51:29,005::hosted_engine::1125::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: MainThread::INFO::2017-07-02 13:51:29,006::hosted_engine::1126::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Virtual machine does not exist Virtual machine already exists MainThread::INFO::2017-07-02 13:51:29,006::hosted_engine::1148::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Failed to start engine VM: 'Virtual machine does not exist Virtual machine already exists '. Please check the vdsm logs. The possible reason: the engine has been already started on a different host so this one has failed to acquire the lock and it will sync in a while. For more information please visit: http://www.ovirt.org/Hosted_Engine_Howto#EngineUnexpectedlyDown MainThread::INFO::2017-07-02 13:51:29,010::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1498992689.01 type=state_transition detail=EngineStart-EngineDown hostname='puma18.scl.lab.tlv.redhat.com'
(In reply to Yanir Quinn from comment #6) > Seems like a temporary sync delay/issue (maybe due to more heavy duty tasks > of the SPM host) > So the explanation can be that the score is decreased by 50 after failing to > start the HE VM and after getting the lock it starts and migration is > eventually successful. > > see in the spm host log : > > MainThread::INFO::2017-07-02 > 13:51:23,880::hosted_engine::1119::ovirt_hosted_engine_ha.agent. > hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using > `/usr/sbin/hosted-engine --vm-start` > MainThread::INFO::2017-07-02 > 13:51:29,005::hosted_engine::1125::ovirt_hosted_engine_ha.agent. > hosted_engine.HostedEngine::(_start_engine_vm) stdout: > MainThread::INFO::2017-07-02 > 13:51:29,006::hosted_engine::1126::ovirt_hosted_engine_ha.agent. > hosted_engine.HostedEngine::(_start_engine_vm) stderr: Virtual machine does > not exist > Virtual machine already exists > > MainThread::INFO::2017-07-02 > 13:51:29,006::hosted_engine::1148::ovirt_hosted_engine_ha.agent. > hosted_engine.HostedEngine::(_start_engine_vm) Failed to start engine VM: > 'Virtual machine does not exist > Virtual machine already exists > '. Please check the vdsm logs. The possible reason: the engine has been > already started on a different host so this one has failed to acquire the > lock and it will sync in a while. For more information please visit: > http://www.ovirt.org/Hosted_Engine_Howto#EngineUnexpectedlyDown > MainThread::INFO::2017-07-02 > 13:51:29,010::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink. > BrokerLink::(notify) Trying: notify time=1498992689.01 type=state_transition > detail=EngineStart-EngineDown hostname='puma18.scl.lab.tlv.redhat.com' Which heavy tasks? Pair of hosts with single SHE-VM... Engine was not doing any load/performance or stress tests, only very basic migration from SPM to none-SPM and then back.
This is not a new feature. Not properly documented maybe.. but it was part of the code since the beginning. See this file from version 1.0.0 (Mar 2014): https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/agent/hosted_engine.py;h=8ab210808780c7289a33135083a1ea2cb609039f;hb=85fde3305ea11ebd367f63dfed7911ffcd265d74#l594
Is this on track for 4.1.5?
The 50 points penalty is by design (and will be better documented). For now I'm closing this issue. If there's a specific issue around the SPM please open a specific bz with the information on starting a VM there which may be a real issue.