1130173 – can't start hosted engine VM in cluster with 3+ hosts

Bug 1130173 - can't start hosted engine VM in cluster with 3+ hosts

Summary: can't start hosted engine VM in cluster with 3+ hosts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Doron Fediuck
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	sla
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-14 13:41 UTC by Jiri Moskovcak
Modified:	2016-02-10 20:18 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-1.2.2-1.el6ev
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1147411 (view as bug list)
Environment:
Last Closed:	2015-02-11 21:09:07 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0194	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix and enhancement update	2015-02-12 01:35:33 UTC
oVirt gerrit	31519	master	MERGED	don't expect the engine to be up right after starting it's guest	2020-07-22 19:40:59 UTC
oVirt gerrit	33760	ovirt-hosted-engine-ha-1.2	MERGED	don't expect the engine to be up right after starting it's guest	2020-07-22 19:40:59 UTC

Description Jiri Moskovcak 2014-08-14 13:41:17 UTC

Description of problem:
When trying to start the hosted engine VM either manually or waiting for ha agent, it fails, because the agent's state machine is put directly to EngineUp state which expects fully operational engine and doesn't wait for the vm to start and thus falling to EngineUpBadHealth state and killing the VM. This situation then repeats on other hosts.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch

How reproducible:
100%

Steps to Reproduce:
1. install HE on a cluster with 3+ hosts
2. kill the HE guest (halt -p)
3. run hosted-engine --vm-start

Actual results:
the VM is killed while powering up

Expected results:
engnie up'm'running after a while...

Additional info:

Comment 2 Sandro Bonazzola 2014-10-03 09:43:45 UTC

Missing merge on 1.2 branch

Comment 4 Nikolai Sednev 2014-10-22 11:33:29 UTC

Same as 1147411 behaviour, here even on two hosts only, host on which vm being manually started, for the beginning being powered up, but then goes to powering down:

[root@brown-vdsd ~]# hosted-engine --vm-status                                                                        


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1           
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400                                                                                         
Local maintenance                  : False                                                                                        
Host timestamp                     : 96751                                                                                        
Extra metadata (valid at timestamp):                                                                                              
        metadata_parse_version=1                                                                                                  
        metadata_feature_version=1                                                                                                
        timestamp=96751 (Wed Oct 22 11:10:48 2014)                                                                                
        host-id=1                                                                                                                 
        score=2400                                                                                                                
        maintenance=False                                                                                                         
        state=EngineDown                                                                                                          


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 2400                                                                             
Local maintenance                  : False                                                                            
Host timestamp                     : 77390                                                                            
Extra metadata (valid at timestamp):                                                                                  
        metadata_parse_version=1                                                                                      
        metadata_feature_version=1                                                                                    
        timestamp=77390 (Wed Oct 22 08:10:47 2014)                                                                    
        host-id=3                                                                                                     
        score=2400                                                                                                    
        maintenance=False                                                                                             
        state=EngineStop                                                                                              
        timeout=Thu Jan  1 23:34:27 1970                                                                              
[root@brown-vdsd ~]# hosted-engine --vm-status                               




[root@brown-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 96961
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=96961 (Wed Oct 22 11:14:18 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineDown


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3
Engine status                      : {"health": "good", "vm": "up", "detail": "powering down"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 77593
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=77593 (Wed Oct 22 08:14:10 2014)
        host-id=3
        score=2400
        maintenance=False
        state=EngineStop
        timeout=Thu Jan  1 23:34:27 1970



Engine actually up and HE within GUI shown as being powered down, then powered up and stays up. After some time engine goes within the GUI to UP again and via CLI shown as follows:
[root@brown-vdsd ~]# hosted-engine --vm-status                                                


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 2400                                                                        
Local maintenance                  : False                                                                       
Host timestamp                     : 97097                                                                       
Extra metadata (valid at timestamp):                                                                             
        metadata_parse_version=1                                                                                 
        metadata_feature_version=1                                                                               
        timestamp=97097 (Wed Oct 22 11:16:34 2014)                                                               
        host-id=1                                                                                                
        score=2400                                                                                               
        maintenance=False                                                                                        
        state=EngineStarting                                                                                     


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3           
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"}
Score                              : 2400                                                                             
Local maintenance                  : False                                                                            
Host timestamp                     : 77732                                                                            
Extra metadata (valid at timestamp):                                                                                  
        metadata_parse_version=1                                                                                      
        metadata_feature_version=1                                                                                    
        timestamp=77732 (Wed Oct 22 08:16:28 2014)                                                                    
        host-id=3
        score=2400
        maintenance=False
        state=EngineStarting
You have new mail in /var/spool/mail/root
[root@brown-vdsd ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.103.12
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 97131
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=97131 (Wed Oct 22 11:17:08 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineStarting


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.106.13
Host ID                            : 3
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 77765
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=77765 (Wed Oct 22 08:17:02 2014)
        host-id=3
        score=2400
        maintenance=False
        state=EngineStarting




Behaviour should be stable, VM should be powered up.

Comment 5 Jiri Moskovcak 2014-11-18 07:56:29 UTC

The behaviour is not the same, in this bug the liveliness check fails, which means that agent fails to communicate with the engine (accessing the the health status page) so my guess here is that your network is somehow broken or the VM running the engine is overloaded. Either way, this is expected behaviour, and you should wait for a while if it will come back to 'up' and health 'good'. When you reproduce this again, please try to run this command, to test the accessibility of the engine status page:

curl http://{fqdn}/ovirt-engine/services/health if it fetches the page correctly? also please not

Comment 6 Nikolai Sednev 2014-12-16 14:15:21 UTC

Works for me on these components:
ovirt-host-deploy-1.3.0-2.el6ev.noarch
ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64
mom-0.4.1-4.el6ev.noarch
sanlock-2.8-1.el6.x86_64
vdsm-4.16.8.1-3.el6ev.x86_64
libvirt-0.10.2-46.el6_6.2.x86_64
ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch
rhevm-3.5.0-0.25.el6ev.noarch

Comment 10 errata-xmlrpc 2015-02-11 21:09:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html

Note You need to log in before you can comment on or make changes to this bug.