Bug 1207634 - HE VM not powered up on second host | ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown
Summary: HE VM not powered up on second host | ovirt_hosted_engine_ha.agent.hosted_eng...
Keywords:
Status: CLOSED DUPLICATE of bug 1150087
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.5.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.6.0
Assignee: Roman Mohr
QA Contact: Nikolai Sednev
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-31 11:19 UTC by Nikolai Sednev
Modified: 2016-02-10 20:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-09-02 08:16:17 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
agent.log (3.57 MB, text/plain)
2015-03-31 11:22 UTC, Nikolai Sednev
no flags Details
broker.log (15.90 MB, text/plain)
2015-03-31 11:24 UTC, Nikolai Sednev
no flags Details
alma03 logs (1.85 MB, application/x-gzip)
2015-03-31 11:28 UTC, Nikolai Sednev
no flags Details
alma03 logs (1.85 MB, application/x-gzip)
2015-03-31 11:29 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 45592 0 master MERGED states, broker: Check for already starting VMs on other hosts 2021-02-13 12:59:11 UTC

Description Nikolai Sednev 2015-03-31 11:19:21 UTC
Description of problem:
HE VM not powered up on alma03 host after ovirt-ha-broker service stopped on alma04.

ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown.

[root@alma03 ~]# hosted-engine --vm-status                                         


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1                           
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400                                                                                         
Local maintenance                  : False                                                                                        
Host timestamp                     : 66481                                                                                        
Extra metadata (valid at timestamp):                                                                                              
        metadata_parse_version=1                                                                                                  
        metadata_feature_version=1                                                                                                
        timestamp=66481 (Tue Mar 31 10:45:29 2015)                                                                                
        host-id=1                                                                                                                 
        score=2400                                                                                                                
        maintenance=False                                                                                                         
        state=EngineDown                                                                                                          


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2                           
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 2400                                          
Local maintenance                  : False                                         
Host timestamp                     : 66442                                         
Extra metadata (valid at timestamp):                                               
        metadata_parse_version=1                                                   
        metadata_feature_version=1                                                 
        timestamp=66442 (Tue Mar 31 10:44:59 2015)                                 
        host-id=2                                                                  
        score=2400                                                                 
        maintenance=False                                                          
        state=EngineUp                                                             
[root@alma03 ~]# hosted-engine --vm-status                                         


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1                           
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0                                                                                            
Local maintenance                  : False                                                                                        
Host timestamp                     : 66600                                                                                        
Extra metadata (valid at timestamp):                                                                                              
        metadata_parse_version=1                                                                                                  
        metadata_feature_version=1                                                                                                
        timestamp=66600 (Tue Mar 31 10:47:28 2015)                                                                                
        host-id=1                                                                                                                 
        score=0                                                                                                                   
        maintenance=False                                                                                                         
        state=EngineUnexpectedlyDown                                                                                              
        timeout=Thu Jan  1 18:39:07 1970                                                                                          


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2                           
Engine status                      : unknown stale-data          
Score                              : 2400                        
Local maintenance                  : False                       
Host timestamp                     : 66442                       
Extra metadata (valid at timestamp):                             
        metadata_parse_version=1                                 
        metadata_feature_version=1                               
        timestamp=66442 (Tue Mar 31 10:44:59 2015)               
        host-id=2                                                
        score=2400                                               
        maintenance=False                                        
        state=EngineUp                                  
Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Deploy HE on two RHEVHs6.6 (20150304.0.el6ev).
2.Stop service ovirt-ha-broker on host that currently running the HE VM.
3.Wait for HE VM to get started on another host and see that it's score changes to 0 for some unknown reason.

Actual results:
HE VM not started on second host after service ovirt-ha-broker stopped on host that is running HE VM.

Expected results:
HE VM should be started on second host and score should not be zero.

Additional info:
logs attached.

Comment 1 Nikolai Sednev 2015-03-31 11:22:32 UTC
Created attachment 1008955 [details]
agent.log

Comment 2 Nikolai Sednev 2015-03-31 11:24:22 UTC
Created attachment 1008956 [details]
broker.log

Comment 3 Nikolai Sednev 2015-03-31 11:28:22 UTC
Created attachment 1008957 [details]
alma03 logs

Comment 4 Nikolai Sednev 2015-03-31 11:29:00 UTC
Created attachment 1008958 [details]
alma03 logs

Comment 5 Nikolai Sednev 2015-03-31 13:52:48 UTC
Components that were used on Red Hat Enterprise Virtualization Hypervisor 6.6 (20150304.0.el6ev):
sanlock-2.8-1.el6.x86_64
mom-0.4.1-4.el6ev.noarch
ovirt-node-selinux-3.2.1-9.el6.noarch
ovirt-host-deploy-offline-1.3.0-3.el6ev.x86_64
ovirt-node-plugin-vdsm-0.2.0-19.el6ev.noarch
ovirt-host-deploy-1.3.0-2.el6ev.noarch
libvirt-client-0.10.2-46.el6_6.3.x86_64
ovirt-node-plugin-rhn-3.2.1-9.el6.noarch
ovirt-node-3.2.1-9.el6.noarch
vdsm-4.16.8.1-7.el6ev.x86_64
ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-9.0.el6ev.x86_64
ovirt-node-plugin-cim-3.2.1-9.el6.noarch
ovirt-node-branding-rhev-3.2.1-9.el6.noarch
qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64
ovirt-hosted-engine-setup-1.2.2-1.el6ev.noarch
ovirt-node-plugin-snmp-3.2.1-9.el6.noarch

On engine Red Hat Enterprise Linux Server release 6.6 (Santiago):
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
rhevm-3.5.1-0.2.el6ev.noarch

Comment 6 Martin Sivák 2015-04-07 07:45:18 UTC
This is not urgent at all, because I have not seen it in production ever.

The second host tries to start the engine when you stop the broker on the first host (because it is not getting any updates and thinks that the host is dead). But the engine is still running so sanlock prevents the VM from starting on the second host. That puts the host to EngineUnexpectedlyDown for ten minutes. The score is reduced to 0 while the host is in that state.

There is one known issue here and that is we do not know the reason for the VM crash. We can't distinguish sanlock protection from a real crash here.

Comment 8 Martin Sivák 2015-09-02 08:16:17 UTC

*** This bug has been marked as a duplicate of bug 1150087 ***


Note You need to log in before you can comment on or make changes to this bug.