Bug 1467063 - Destination host's score being penalized for 50 points due to 1 engine vm retry attempts during normal HE-VM's migration to SPM host.
Summary: Destination host's score being penalized for 50 points due to 1 engine vm ret...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.HostedEngine
Version: 4.1.3.5
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.1.5
: ---
Assignee: Yanir Quinn
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 1478848
TreeView+ depends on / blocked
 
Reported: 2017-07-02 11:15 UTC by Nikolai Sednev
Modified: 2017-08-07 10:10 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
: 1478848 (view as bug list)
Environment:
Last Closed: 2017-08-06 10:25:21 UTC
oVirt Team: SLA
Embargoed:
rule-engine: ovirt-4.1?
dfediuck: ovirt-4.2?
mgoldboi: planning_ack+
dfediuck: devel_ack+
nsednev: testing_ack?


Attachments (Terms of Use)
sosreport from the engine (9.63 MB, application/x-xz)
2017-07-02 11:23 UTC, Nikolai Sednev
no flags Details
sosreport from host1 (the SPM host) (11.03 MB, application/x-xz)
2017-07-02 11:24 UTC, Nikolai Sednev
no flags Details
sosreport from host2 (10.58 MB, application/x-xz)
2017-07-02 11:27 UTC, Nikolai Sednev
no flags Details

Description Nikolai Sednev 2017-07-02 11:15:40 UTC
Description of problem:
MainThread::INFO::2017-07-02 14:00:27,390::states::203::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(scor
e) Penalizing score by 50 due to 1 engine vm retry attempts
MainThread::INFO::2017-07-02 14:00:27,392::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine
::(start_monitoring) Current state EngineDown (score: 3350)

I clearly see that SPM destination host being penalized for 50 points for a few moments due to 1 engine vm retry attempts during normal HE-VM's migration.
Score becomes 3350 from 3400.

Version-Release number of selected component (if applicable):
Host's components:
qemu-kvm-rhev-2.9.0-14.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.3-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.19.20-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.4-1.el7ev.noarch
libvirt-client-3.2.0-14.el7.x86_64
ovirt-hosted-engine-setup-2.1.3.2-1.el7ev.noarch
sanlock-3.5.0-1.el7.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
Linux version 3.10.0-663.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-14) (GCC) ) #1 SMP Tue May 2 16:00:29 EDT 2017
Linux 3.10.0-663.el7.x86_64 #1 SMP Tue May 2 16:00:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Engine's components:
rhev-guest-tools-iso-4.1-5.el7ev.noarch
rhevm-dependencies-4.1.1-1.el7ev.noarch
rhevm-doc-4.1.3-1.el7ev.noarch
rhevm-branding-rhev-4.1.0-2.el7ev.noarch
rhevm-4.1.3.5-0.1.el7.noarch
rhevm-setup-plugins-4.1.2-1.el7ev.noarch
Linux version 3.10.0-514.21.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sun May 28 17:08:21 EDT 2017
Linux 3.10.0-514.21.2.el7.x86_64 #1 SMP Sun May 28 17:08:21 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy HE over NFS on pair of RHEL7.4 hosts and add two NFS data storage domains.
2.Assuming that host1 is an SPM, migrate HE-VM to host2.
3.Once migration successful, wait a few moments until in CLI migration also shown as successful.
4.Migrate back HE-VM from host2 to host1.
5.Host1's score being penalized for 50 points during migration.

Actual results:
Host1's score being penalized for 50 points during migration.

Expected results:
HE-VM should not drop down positive score on destination SPM host.

Additional info:
Logs from pair of hosts and the engine being attached, together with the screencast.

Comment 1 Nikolai Sednev 2017-07-02 11:18:56 UTC
Forgot to mention, that score being raised back to normal on destination SPM host after some time.

Comment 2 Nikolai Sednev 2017-07-02 11:23:06 UTC
Created attachment 1293603 [details]
sosreport from the engine

Comment 3 Nikolai Sednev 2017-07-02 11:24:51 UTC
Created attachment 1293604 [details]
sosreport from host1 (the SPM host)

Comment 4 Nikolai Sednev 2017-07-02 11:27:45 UTC
Created attachment 1293605 [details]
sosreport from host2

Comment 5 Nikolai Sednev 2017-07-02 11:49:52 UTC
Screencast is available from here:
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88SVBPdk16TWRTanc/view?usp=sharing

Comment 6 Yanir Quinn 2017-07-05 12:42:21 UTC
Seems like a temporary sync delay/issue (maybe due to more heavy duty tasks of the SPM host)
So the explanation can be that the score is decreased by 50 after failing to start the HE VM  and after getting the lock it starts and migration is eventually successful.

see in the spm host log : 

MainThread::INFO::2017-07-02 13:51:23,880::hosted_engine::1119::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2017-07-02 13:51:29,005::hosted_engine::1125::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: 
MainThread::INFO::2017-07-02 13:51:29,006::hosted_engine::1126::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Virtual machine does not exist
Virtual machine already exists

MainThread::INFO::2017-07-02 13:51:29,006::hosted_engine::1148::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Failed to start engine VM: 'Virtual machine does not exist
Virtual machine already exists
'. Please check the vdsm logs. The possible reason: the engine has been already started on a different host so this one has failed to acquire the lock and it will sync in a while. For more information please visit: http://www.ovirt.org/Hosted_Engine_Howto#EngineUnexpectedlyDown
MainThread::INFO::2017-07-02 13:51:29,010::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1498992689.01 type=state_transition detail=EngineStart-EngineDown hostname='puma18.scl.lab.tlv.redhat.com'

Comment 7 Nikolai Sednev 2017-07-05 13:17:45 UTC
(In reply to Yanir Quinn from comment #6)
> Seems like a temporary sync delay/issue (maybe due to more heavy duty tasks
> of the SPM host)
> So the explanation can be that the score is decreased by 50 after failing to
> start the HE VM  and after getting the lock it starts and migration is
> eventually successful.
> 
> see in the spm host log : 
> 
> MainThread::INFO::2017-07-02
> 13:51:23,880::hosted_engine::1119::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using
> `/usr/sbin/hosted-engine --vm-start`
> MainThread::INFO::2017-07-02
> 13:51:29,005::hosted_engine::1125::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) stdout: 
> MainThread::INFO::2017-07-02
> 13:51:29,006::hosted_engine::1126::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) stderr: Virtual machine does
> not exist
> Virtual machine already exists
> 
> MainThread::INFO::2017-07-02
> 13:51:29,006::hosted_engine::1148::ovirt_hosted_engine_ha.agent.
> hosted_engine.HostedEngine::(_start_engine_vm) Failed to start engine VM:
> 'Virtual machine does not exist
> Virtual machine already exists
> '. Please check the vdsm logs. The possible reason: the engine has been
> already started on a different host so this one has failed to acquire the
> lock and it will sync in a while. For more information please visit:
> http://www.ovirt.org/Hosted_Engine_Howto#EngineUnexpectedlyDown
> MainThread::INFO::2017-07-02
> 13:51:29,010::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.
> BrokerLink::(notify) Trying: notify time=1498992689.01 type=state_transition
> detail=EngineStart-EngineDown hostname='puma18.scl.lab.tlv.redhat.com'

Which heavy tasks? Pair of hosts with single SHE-VM... Engine was not doing any load/performance or stress tests, only very basic migration from SPM to none-SPM and then back.

Comment 9 Martin Sivák 2017-07-10 11:18:26 UTC
This is not a new feature. Not properly documented maybe.. but it was part of the code since the beginning.

See this file from version 1.0.0 (Mar 2014):

https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/agent/hosted_engine.py;h=8ab210808780c7289a33135083a1ea2cb609039f;hb=85fde3305ea11ebd367f63dfed7911ffcd265d74#l594

Comment 11 Yaniv Kaul 2017-08-06 07:43:34 UTC
Is this on track for 4.1.5?

Comment 12 Doron Fediuck 2017-08-06 10:25:21 UTC
The 50 points penalty is by design (and will be better documented).
For now I'm closing this issue. If there's a specific issue around the SPM please open a specific bz with the information on starting a VM there which may be a real issue.


Note You need to log in before you can comment on or make changes to this bug.