Bug 1130173 - can't start hosted engine VM in cluster with 3+ hosts
Summary: can't start hosted engine VM in cluster with 3+ hosts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.0
Assignee: Doron Fediuck
QA Contact: Nikolai Sednev
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-14 13:41 UTC by Jiri Moskovcak
Modified: 2016-02-10 20:18 UTC (History)
9 users (show)

Fixed In Version: ovirt-hosted-engine-ha-1.2.2-1.el6ev
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1147411 (view as bug list)
Environment:
Last Closed: 2015-02-11 21:09:07 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0194 0 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix and enhancement update 2015-02-12 01:35:33 UTC
oVirt gerrit 31519 0 master MERGED don't expect the engine to be up right after starting it's guest 2020-07-22 19:40:59 UTC
oVirt gerrit 33760 0 ovirt-hosted-engine-ha-1.2 MERGED don't expect the engine to be up right after starting it's guest 2020-07-22 19:40:59 UTC

Description Jiri Moskovcak 2014-08-14 13:41:17 UTC
Description of problem:
When trying to start the hosted engine VM either manually or waiting for ha agent, it fails, because the agent's state machine is put directly to EngineUp state which expects fully operational engine and doesn't wait for the vm to start and thus falling to EngineUpBadHealth state and killing the VM. This situation then repeats on other hosts.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. install HE on a cluster with 3+ hosts
2. kill the HE guest (halt -p)
3. run hosted-engine --vm-start

Actual results:
the VM is killed while powering up

Expected results:
engnie up'm'running after a while...

Additional info:

Comment 2 Sandro Bonazzola 2014-10-03 09:43:45 UTC
Missing merge on 1.2 branch

Comment 4 Nikolai Sednev 2014-10-22 11:33:29 UTC
Same as 1147411 behaviour, here even on two hosts only, host on which vm being manually started, for the beginning being powered up, but then goes to powering down:

[root@brown-vdsd ~]# hosted-engine --vm-status                                                                        


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1           
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400                                                                                         
Local maintenance                  : False                                                                                        
Host timestamp                     : 96751                                                                                        
Extra metadata (valid at timestamp):                                                                                              
        metadata_parse_version=1                                                                                                  
        metadata_feature_version=1                                                                                                
        timestamp=96751 (Wed Oct 22 11:10:48 2014)                                                                                
        host-id=1                                                                                                                 
        score=2400                                                                                                                
        maintenance=False                                                                                                         
        state=EngineDown                                                                                                          


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 2400                                                                             
Local maintenance                  : False                                                                            
Host timestamp                     : 77390                                                                            
Extra metadata (valid at timestamp):                                                                                  
        metadata_parse_version=1                                                                                      
        metadata_feature_version=1                                                                                    
        timestamp=77390 (Wed Oct 22 08:10:47 2014)                                                                    
        host-id=3                                                                                                     
        score=2400                                                                                                    
        maintenance=False                                                                                             
        state=EngineStop                                                                                              
        timeout=Thu Jan  1 23:34:27 1970                                                                              
[root@brown-vdsd ~]# hosted-engine --vm-status                               




[root@brown-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 96961
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=96961 (Wed Oct 22 11:14:18 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineDown


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3
Engine status                      : {"health": "good", "vm": "up", "detail": "powering down"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 77593
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=77593 (Wed Oct 22 08:14:10 2014)
        host-id=3
        score=2400
        maintenance=False
        state=EngineStop
        timeout=Thu Jan  1 23:34:27 1970



Engine actually up and HE within GUI shown as being powered down, then powered up and stays up. After some time engine goes within the GUI to UP again and via CLI shown as follows:
[root@brown-vdsd ~]# hosted-engine --vm-status                                                


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 2400                                                                        
Local maintenance                  : False                                                                       
Host timestamp                     : 97097                                                                       
Extra metadata (valid at timestamp):                                                                             
        metadata_parse_version=1                                                                                 
        metadata_feature_version=1                                                                               
        timestamp=97097 (Wed Oct 22 11:16:34 2014)                                                               
        host-id=1                                                                                                
        score=2400                                                                                               
        maintenance=False                                                                                        
        state=EngineStarting                                                                                     


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 2400                                                                             
Local maintenance                  : False                                                                            
Host timestamp                     : 77732                                                                            
Extra metadata (valid at timestamp):                                                                                  
        metadata_parse_version=1                                                                                      
        metadata_feature_version=1                                                                                    
        timestamp=77732 (Wed Oct 22 08:16:28 2014)                                                                    
        host-id=3
        score=2400
        maintenance=False
        state=EngineStarting
You have new mail in /var/spool/mail/root
[root@brown-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 97131
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=97131 (Wed Oct 22 11:17:08 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineStarting


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 77765
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=77765 (Wed Oct 22 08:17:02 2014)
        host-id=3
        score=2400
        maintenance=False
        state=EngineStarting




Behaviour should be stable, VM should be powered up.

Comment 5 Jiri Moskovcak 2014-11-18 07:56:29 UTC
The behaviour is not the same, in this bug the liveliness check fails, which means that agent fails to communicate with the engine (accessing the the health status page) so my guess here is that your network is somehow broken or the VM running the engine is overloaded. Either way, this is expected behaviour, and you should wait for a while if it will come back to 'up' and health 'good'. When you reproduce this again, please try to run this command, to test the accessibility of the engine status page:

curl http://{fqdn}/ovirt-engine/services/health if it fetches the page correctly? also please not

Comment 6 Nikolai Sednev 2014-12-16 14:15:21 UTC
Works for me on these components:
ovirt-host-deploy-1.3.0-2.el6ev.noarch
ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64
mom-0.4.1-4.el6ev.noarch
sanlock-2.8-1.el6.x86_64
vdsm-4.16.8.1-3.el6ev.x86_64
libvirt-0.10.2-46.el6_6.2.x86_64
ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch
rhevm-3.5.0-0.25.el6ev.noarch

Comment 10 errata-xmlrpc 2015-02-11 21:09:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html


Note You need to log in before you can comment on or make changes to this bug.