Bug 1467063

Summary: Destination host's score being penalized for 50 points due to 1 engine vm retry attempts during normal HE-VM's migration to SPM host.
Product: [oVirt] ovirt-engine Reporter: Nikolai Sednev <nsednev>
Component: BLL.HostedEngineAssignee: Yanir Quinn <yquinn>
Status: CLOSED NOTABUG QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.3.5CC: bugs, dfediuck, mgoldboi, msivak
Target Milestone: ovirt-4.1.5Flags: rule-engine: ovirt-4.1?
dfediuck: ovirt-4.2?
mgoldboi: planning_ack+
dfediuck: devel_ack+
nsednev: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1478848 (view as bug list) Environment:
Last Closed: 2017-08-06 10:25:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1478848    
Attachments:
Description Flags
sosreport from the engine
none
sosreport from host1 (the SPM host)
none
sosreport from host2 none

Description Nikolai Sednev 2017-07-02 11:15:40 UTC
Description of problem:
MainThread::INFO::2017-07-02 14:00:27,390::states::203::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(scor
e) Penalizing score by 50 due to 1 engine vm retry attempts
MainThread::INFO::2017-07-02 14:00:27,392::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine
::(start_monitoring) Current state EngineDown (score: 3350)

I clearly see that SPM destination host being penalized for 50 points for a few moments due to 1 engine vm retry attempts during normal HE-VM's migration.
Score becomes 3350 from 3400.

Version-Release number of selected component (if applicable):
Host's components:
qemu-kvm-rhev-2.9.0-14.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.3-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.19.20-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.4-1.el7ev.noarch
libvirt-client-3.2.0-14.el7.x86_64
ovirt-hosted-engine-setup-2.1.3.2-1.el7ev.noarch
sanlock-3.5.0-1.el7.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
Linux version 3.10.0-663.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-14) (GCC) ) #1 SMP Tue May 2 16:00:29 EDT 2017
Linux 3.10.0-663.el7.x86_64 #1 SMP Tue May 2 16:00:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Engine's components:
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhevm-doc-4.1.3-1.el7ev.noarch
rhevm-branding-rhev-4.1.0-2.el7ev.noarch
rhevm-4.1.3.5-0.1.el7.noarch
rhevm-setup-plugins-4.1.2-1.el7ev.noarch
Linux version 3.10.0-514.21.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sun May 28 17:08:21 EDT 2017
Linux 3.10.0-514.21.2.el7.x86_64 #1 SMP Sun May 28 17:08:21 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over NFS on pair of RHEL7.4 hosts and add two NFS data storage domains.
2.Assuming that host1 is an SPM, migrate HE-VM to host2.
3.Once migration successful, wait a few moments until in CLI migration also shown as successful.
4.Migrate back HE-VM from host2 to host1.
5.Host1's score being penalized for 50 points during migration.

Actual results:
Host1's score being penalized for 50 points during migration.

Expected results:
HE-VM should not drop down positive score on destination SPM host.

Additional info:
Logs from pair of hosts and the engine being attached, together with the screencast.

Comment 1 Nikolai Sednev 2017-07-02 11:18:56 UTC
Forgot to mention, that score being raised back to normal on destination SPM host after some time.

Comment 2 Nikolai Sednev 2017-07-02 11:23:06 UTC
Created attachment 1293603 [details]
sosreport from the engine

Comment 3 Nikolai Sednev 2017-07-02 11:24:51 UTC
Created attachment 1293604 [details]
sosreport from host1 (the SPM host)

Comment 4 Nikolai Sednev 2017-07-02 11:27:45 UTC
Created attachment 1293605 [details]
sosreport from host2

Comment 5 Nikolai Sednev 2017-07-02 11:49:52 UTC
Screencast is available from here:
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88SVBPdk16TWRTanc/view?usp=sharing

Comment 6 Yanir Quinn 2017-07-05 12:42:21 UTC
Seems like a temporary sync delay/issue (maybe due to more heavy duty tasks of the SPM host)
So the explanation can be that the score is decreased by 50 after failing to start the HE VM  and after getting the lock it starts and migration is eventually successful.

see in the spm host log : 

MainThread::INFO::2017-07-02 13:51:23,880::hosted_engine::1119::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2017-07-02 13:51:29,005::hosted_engine::1125::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: 
MainThread::INFO::2017-07-02 13:51:29,006::hosted_engine::1126::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Virtual machine does not exist
Virtual machine already exists

MainThread::INFO::2017-07-02 13:51:29,006::hosted_engine::1148::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Failed to start engine VM: 'Virtual machine does not exist
Virtual machine already exists
'. Please check the vdsm logs. The possible reason: the engine has been already started on a different host so this one has failed to acquire the lock and it will sync in a while. For more information please visit: http://www.ovirt.org/Hosted_Engine_Howto#EngineUnexpectedlyDown
MainThread::INFO::2017-07-02 13:51:29,010::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1498992689.01 type=state_transition detail=EngineStart-EngineDown hostname='puma18.scl.lab.tlv.redhat.com'

Comment 7 Nikolai Sednev 2017-07-05 13:17:45 UTC
(In reply to Yanir Quinn from comment #6)
> Seems like a temporary sync delay/issue (maybe due to more heavy duty tasks
> of the SPM host)
> So the explanation can be that the score is decreased by 50 after failing to
> start the HE VM  and after getting the lock it starts and migration is
> eventually successful.
> 
> see in the spm host log : 
> 
> MainThread::INFO::2017-07-02
> 13:51:23,880::hosted_engine::1119::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using
> `/usr/sbin/hosted-engine --vm-start`
> MainThread::INFO::2017-07-02
> 13:51:29,005::hosted_engine::1125::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) stdout: 
> MainThread::INFO::2017-07-02
> 13:51:29,006::hosted_engine::1126::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) stderr: Virtual machine does
> not exist
> Virtual machine already exists
> 
> MainThread::INFO::2017-07-02
> 13:51:29,006::hosted_engine::1148::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) Failed to start engine VM:
> 'Virtual machine does not exist
> Virtual machine already exists
> '. Please check the vdsm logs. The possible reason: the engine has been
> already started on a different host so this one has failed to acquire the
> lock and it will sync in a while. For more information please visit:
> http://www.ovirt.org/Hosted_Engine_Howto#EngineUnexpectedlyDown
> MainThread::INFO::2017-07-02
> 13:51:29,010::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.
> BrokerLink::(notify) Trying: notify time=1498992689.01 type=state_transition
> detail=EngineStart-EngineDown hostname='puma18.scl.lab.tlv.redhat.com'

Which heavy tasks? Pair of hosts with single SHE-VM... Engine was not doing any load/performance or stress tests, only very basic migration from SPM to none-SPM and then back.

Comment 9 Martin Sivák 2017-07-10 11:18:26 UTC
This is not a new feature. Not properly documented maybe.. but it was part of the code since the beginning.

See this file from version 1.0.0 (Mar 2014):

https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/agent/hosted_engine.py;h=8ab210808780c7289a33135083a1ea2cb609039f;hb=85fde3305ea11ebd367f63dfed7911ffcd265d74#l594

Comment 11 Yaniv Kaul 2017-08-06 07:43:34 UTC
Is this on track for 4.1.5?

Comment 12 Doron Fediuck 2017-08-06 10:25:21 UTC
The 50 points penalty is by design (and will be better documented).
For now I'm closing this issue. If there's a specific issue around the SPM please open a specific bz with the information on starting a VM there which may be a real issue.