Bug 1130173

Summary: can't start hosted engine VM in cluster with 3+ hosts
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Moskovcak <jmoskovc>
Component: ovirt-hosted-engine-haAssignee: Doron Fediuck <dfediuck>
Status: CLOSED ERRATA QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.0CC: dfediuck, ecohen, gklein, iheim, lsurette, mavital, rbalakri, sbonazzo, yeylon
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: ovirt-hosted-engine-ha-1.2.2-1.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1147411 (view as bug list) Environment:
Last Closed: 2015-02-11 21:09:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jiri Moskovcak 2014-08-14 13:41:17 UTC
Description of problem:
When trying to start the hosted engine VM either manually or waiting for ha agent, it fails, because the agent's state machine is put directly to EngineUp state which expects fully operational engine and doesn't wait for the vm to start and thus falling to EngineUpBadHealth state and killing the VM. This situation then repeats on other hosts.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. install HE on a cluster with 3+ hosts
2. kill the HE guest (halt -p)
3. run hosted-engine --vm-start

Actual results:
the VM is killed while powering up

Expected results:
engnie up'm'running after a while...

Additional info:

Comment 2 Sandro Bonazzola 2014-10-03 09:43:45 UTC
Missing merge on 1.2 branch

Comment 4 Nikolai Sednev 2014-10-22 11:33:29 UTC
Same as 1147411 behaviour, here even on two hosts only, host on which vm being manually started, for the beginning being powered up, but then goes to powering down:

[root@brown-vdsd ~]# hosted-engine --vm-status                                                                        


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1           
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400                                                                                         
Local maintenance                  : False                                                                                        
Host timestamp                     : 96751                                                                                        
Extra metadata (valid at timestamp):                                                                                              
        metadata_parse_version=1                                                                                                  
        metadata_feature_version=1                                                                                                
        timestamp=96751 (Wed Oct 22 11:10:48 2014)                                                                                
        host-id=1                                                                                                                 
        score=2400                                                                                                                
        maintenance=False                                                                                                         
        state=EngineDown                                                                                                          


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 2400                                                                             
Local maintenance                  : False                                                                            
Host timestamp                     : 77390                                                                            
Extra metadata (valid at timestamp):                                                                                  
        metadata_parse_version=1                                                                                      
        metadata_feature_version=1                                                                                    
        timestamp=77390 (Wed Oct 22 08:10:47 2014)                                                                    
        host-id=3                                                                                                     
        score=2400                                                                                                    
        maintenance=False                                                                                             
        state=EngineStop                                                                                              
        timeout=Thu Jan  1 23:34:27 1970                                                                              
[root@brown-vdsd ~]# hosted-engine --vm-status                               




[root@brown-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 96961
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=96961 (Wed Oct 22 11:14:18 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineDown


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3
Engine status                      : {"health": "good", "vm": "up", "detail": "powering down"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 77593
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=77593 (Wed Oct 22 08:14:10 2014)
        host-id=3
        score=2400
        maintenance=False
        state=EngineStop
        timeout=Thu Jan  1 23:34:27 1970



Engine actually up and HE within GUI shown as being powered down, then powered up and stays up. After some time engine goes within the GUI to UP again and via CLI shown as follows:
[root@brown-vdsd ~]# hosted-engine --vm-status                                                


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 2400                                                                        
Local maintenance                  : False                                                                       
Host timestamp                     : 97097                                                                       
Extra metadata (valid at timestamp):                                                                             
        metadata_parse_version=1                                                                                 
        metadata_feature_version=1                                                                               
        timestamp=97097 (Wed Oct 22 11:16:34 2014)                                                               
        host-id=1                                                                                                
        score=2400                                                                                               
        maintenance=False                                                                                        
        state=EngineStarting                                                                                     


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 2400                                                                             
Local maintenance                  : False                                                                            
Host timestamp                     : 77732                                                                            
Extra metadata (valid at timestamp):                                                                                  
        metadata_parse_version=1                                                                                      
        metadata_feature_version=1                                                                                    
        timestamp=77732 (Wed Oct 22 08:16:28 2014)                                                                    
        host-id=3
        score=2400
        maintenance=False
        state=EngineStarting
You have new mail in /var/spool/mail/root
[root@brown-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 97131
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=97131 (Wed Oct 22 11:17:08 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineStarting


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 77765
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=77765 (Wed Oct 22 08:17:02 2014)
        host-id=3
        score=2400
        maintenance=False
        state=EngineStarting




Behaviour should be stable, VM should be powered up.

Comment 5 Jiri Moskovcak 2014-11-18 07:56:29 UTC
The behaviour is not the same, in this bug the liveliness check fails, which means that agent fails to communicate with the engine (accessing the the health status page) so my guess here is that your network is somehow broken or the VM running the engine is overloaded. Either way, this is expected behaviour, and you should wait for a while if it will come back to 'up' and health 'good'. When you reproduce this again, please try to run this command, to test the accessibility of the engine status page:

curl http://{fqdn}/ovirt-engine/services/health if it fetches the page correctly? also please not

Comment 6 Nikolai Sednev 2014-12-16 14:15:21 UTC
Works for me on these components:
ovirt-host-deploy-1.3.0-2.el6ev.noarch
ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64
mom-0.4.1-4.el6ev.noarch
sanlock-2.8-1.el6.x86_64
vdsm-4.16.8.1-3.el6ev.x86_64
libvirt-0.10.2-46.el6_6.2.x86_64
ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch
rhevm-3.5.0-0.25.el6ev.noarch

Comment 10 errata-xmlrpc 2015-02-11 21:09:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html