Bug 1627958 - engine-retry-score-penalty could be better used
Summary: engine-retry-score-penalty could be better used
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 4.2.6
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.3.4
: 4.3.1
Assignee: Steven Rosenberg
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1627957 (view as bug list)
Depends On:
Blocks: gss_rhv_4_3_4 1709969
TreeView+ depends on / blocked
 
Reported: 2018-09-12 00:48 UTC by Germano Veit Michel
Modified: 2020-08-03 15:35 UTC (History)
15 users (show)

Fixed In Version: ovirt-hosted-engine-setup-2.3.8
Doc Type: Bug Fix
Doc Text:
The self-hosted engine was not detecting exceptions or failures when the ‘--vm-start’ command was initiated. As a result, the retry count was not incremented and therefore the score of the host was not reduced for rescheduling. In this release, the self-hosted engine setup fails when the create VM functionality fails. As a result, the self-hosted engine High Availability agent can adjust its retry count and reduce the score of the host, giving other hosts a higher priority.
Clone Of:
Environment:
Last Closed: 2019-06-20 14:48:16 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:1563 0 None None None 2019-06-20 14:48:19 UTC
oVirt gerrit 99171 0 ovirt-hosted-engine-setup-2.3 MERGED network: Exit with Return Code when VM start Fails 2020-07-15 20:25:57 UTC
oVirt gerrit 99175 0 ovirt-hosted-engine-setup-2.3 MERGED network: Exit with Return Code when VM start Fails 2020-07-15 20:25:57 UTC

Description Germano Veit Michel 2018-09-12 00:48:51 UTC
Description of problem:

In the ha-agent, there is implemented logic that subtracts the host score by engine-retry-score-penalty (50 points) in case the Hosted-Engine fails to start on the host.

This depends on engine_vm_retry_count increasing. To increase it START_VM() must return False.

        if fsm.actions.START_VM():                                                                                                                                                             
            new_data = new_data._replace(                                                                                                                                                      
                engine_vm_retry_time=dtime(new_data),                                                                                                                                          
                engine_vm_retry_count=0)                                                                                                                                                       
            return EngineStarting(new_data)                                                                                                                                                    
        else:                                                                                                                                                                                  
            new_data = new_data._replace(                                                                                                                                                      
                engine_vm_retry_time=dtime(new_data),                                                                                                                                          
                engine_vm_retry_count=retry_count + 1)                                                                                                                                         
            return EngineDown(new_data)

So it means _start_engine_vm() needs to return False. But it returns false on very limited circumstances.

~~~
    def _start_engine_vm(self):                                                                                                                                                                
        try:                                                                                                                                                                                   
            self._config.refresh_vm_conf()                                                                                                                                                     
                                                                                                                                                                                               
            # Ensure there isn't any stale VDSM state from a prior VM lifecycle                                                                                                                
            self._clean_vdsm_state()                                                                                                                                                           
                                                                                                                                                                                               
            self._log.info("Starting vm using `%s --vm-start`",                                                                                                                                
                           constants.HOSTED_ENGINE_BINARY)                                                                                                                                     
            p = subprocess.Popen([constants.HOSTED_ENGINE_BINARY,                                                                                                                              
                                  '--vm-start'],                                                                                                                                               
                                 stdout=subprocess.PIPE,                                                                                                                                       
                                 stderr=subprocess.PIPE)                                                                                                                                       
            output = p.communicate()                                                                                                                                                           
            self._log.info("stdout: %s", output[0])                                                                                                                                            
            self._log.info("stderr: %s", output[1])                                                                                                                                            
            if p.returncode != 0:                                                                                                                                                              
                # FIXME consider removing, we can get vm status from sanlock,                                                                                                                  
                # if still an issue then the alternative tracking the time we                                                                                                                  
                # started the engine might be better than parsing this output                                                                                                                  
                if output[0].startswith("Virtual machine already exists"):                                                                                                                     
                    self._log.warning("Failed to start engine VM,"                                                                                                                             
                                      " already running according to VDSM")                                                                                                                    
                    return True                                                                                                                                                                
                                                                                                                                                                                               
                raise Exception(output[1])                                                                                                                                                     
                                                                                                                                                                                               
            self._log.info("Engine VM started on localhost")                                                                                                                                   
            return True                                                                                         
        except Exception as e:                                                                                                                                                                 
            self._log.info("Failed to start engine VM: '%s'. Please check the"                                                                                                                 
                           " vdsm logs. The possible reason: the engine has"                                                                                                                   
                           " been already started on a different host so this"                                                                                                                 
                           " one has failed to acquire the lock and it will"                                                                                                                   
                           " sync in a while."                                                                                                                                                 
                           " For more information please visit: "                                                                                                                              
                           "http://www.ovirt.org/Hosted_Engine_Howto"                                                                                                                          
                           "#EngineUnexpectedlyDown", str(e))                                                                                                                                  
            return False 
~~~

So unless we get an Exception during the above function, it pretty much always returns True. And on most cases, the HE start fails without throwing any exceptions there.

For example, the most common cases would be the Hosted-Engine failing to start due to some bad configuration on the host (libvirt/qemu/vdsm). All these errors are not catched and the score is not lowered. hosted-engine --vm-start subprocess runs without raising any exception even if the command fails. And looking at the code the vdsm_helper returns 0 on most cases too!). refresh_vm_conf() and _clean_vdsm_state() are also mostly clean and unlikely to fail. So on most failures the host score is not lowered.

Basically the logic is there and makes a lot of sense, but it is not covering most use cases.

For example, if VDSM fails to start the VM, this still returns 0 and the host is not penalized.
It can be confirmed by editing the 50_hostedengine hook and raising an exception to fail the HE start.

2018-09-12 10:25:58,685+1000 INFO  (vm/6a0f1af5) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=1 err=Traceback (most recent call last):
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 88, in <module>
    HostedEngineHook().main()
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 84, in main
    raise ValueError('Ohh no, Hosted-Engine start failed.')
ValueError: Ohh no, Hosted-Engine start failed.
 (hooks:110)

The host is not penalized by 50 points.

I think this feature could be better used by catching more return conditions. Or for example, check if the VM started by talking to vdsm?

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
Just make HostedEngine start fail within VDSM/libvirt/qemu.

Actual results:
Host is not penalized

Expected results:
Host penalized.

Comment 1 Germano Veit Michel 2018-09-12 00:49:40 UTC
*** Bug 1627957 has been marked as a duplicate of this bug. ***

Comment 2 Marina Kalinin 2018-10-05 14:57:33 UTC
Should this be more of an RFE?
Or get at least medium priority? 
What is the actual impact on the customer? Let's see we have 3 hosts, and one of 3 hosts has a problem, that means, that next time HE VM will be started on other 2 hosts, otherwise it just continues starting on this host and fail?

Comment 3 Germano Veit Michel 2018-10-07 23:17:56 UTC
(In reply to Marina from comment #2)
> Should this be more of an RFE?
> Or get at least medium priority? 
> What is the actual impact on the customer? Let's see we have 3 hosts, and
> one of 3 hosts has a problem, that means, that next time HE VM will be
> started on other 2 hosts, otherwise it just continues starting on this host
> and fail?

It can affect some specific scenarios. i.e.:

3 hosts:
- Host A score 3400 but fails to start the VM for some reason
- Host B and C score < 3400

A may not lower its score due to this BZ until zeroing it after retry_count, so B and C wont attempt to start the HE VM right away. The way I see it HA will take more time to recover from a failure, but I shouldn't cause any outages, will just take more time to bring the HE up.

Comment 12 Nikolai Sednev 2019-05-26 15:00:41 UTC
Software Version:4.3.4.1-0.1.el7
I powered-off the engine via CLI, also tried to stop libvirt multiple times and did not seen any penalties to score, except for one of the host's score was set to zero, once I powered-off the engine on it.
Tested on these components:
Software Version:4.3.4.1-0.1.el7
ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch
Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Please provide more reproduction steps.

Comment 13 Steven Rosenberg 2019-05-26 16:16:27 UTC
(In reply to Nikolai Sednev from comment #12)
> Software Version:4.3.4.1-0.1.el7
> I powered-off the engine via CLI, also tried to stop libvirt multiple times
> and did not seen any penalties to score, except for one of the host's score
> was set to zero, once I powered-off the engine on it.
> Tested on these components:
> Software Version:4.3.4.1-0.1.el7
> ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch
> ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch
> Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64
> x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux Server release 7.6 (Maipo)
> 
> Please provide more reproduction steps.

Hi Nikolai,

You need to test it with the: "hosted-engine --vm-start" command and have the VM create fail. This should be due to a policy constraint such as lacking resources or CPU Types not supported. Then see if how the score is affected.

Comment 14 Nikolai Sednev 2019-05-27 13:18:46 UTC
Works for me on these components:
ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch
Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

I overloaded CPU on target ha-host and powered-off engine's VM on source ha-host. engine's VM got started on overloaded ha-host and due to overloaded CPU failed to get started, during trying to get started on overloaded ha-host, I checked that score was lowered as designed:
alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : b590cf18
local_conf_timestamp               : 95331
Host timestamp                     : 95331
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95331 (Mon May 27 15:03:21 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95331 (Mon May 27 15:03:21 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 412bc2c9
local_conf_timestamp               : 369386
Host timestamp                     : 369385
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369385 (Mon May 27 15:03:18 2019)
        host-id=2
        score=3400
        vm_conf_refresh_time=369386 (Mon May 27 15:03:18 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : c49f4d8f
local_conf_timestamp               : 95421
Host timestamp                     : 95421
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95421 (Mon May 27 15:04:51 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95421 (Mon May 27 15:04:51 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3357
stopped                            : False
Local maintenance                  : False
crc32                              : 7051f835
local_conf_timestamp               : 369467
Host timestamp                     : 369465
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369465 (Mon May 27 15:04:38 2019)
        host-id=2
        score=3357
        vm_conf_refresh_time=369467 (Mon May 27 15:04:40 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 354d0425
local_conf_timestamp               : 95431
Host timestamp                     : 95431
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95431 (Mon May 27 15:05:01 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95431 (Mon May 27 15:05:02 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3338
stopped                            : False
Local maintenance                  : False
crc32                              : 4f9e67df
local_conf_timestamp               : 369479
Host timestamp                     : 369475
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369475 (Mon May 27 15:04:48 2019)
        host-id=2
        score=3338
        vm_conf_refresh_time=369479 (Mon May 27 15:04:52 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 354d0425
local_conf_timestamp               : 95431
Host timestamp                     : 95431
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95431 (Mon May 27 15:05:01 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95431 (Mon May 27 15:05:02 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3338
stopped                            : False
Local maintenance                  : False
crc32                              : 4f9e67df
local_conf_timestamp               : 369479
Host timestamp                     : 369475
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369475 (Mon May 27 15:04:48 2019)
        host-id=2
        score=3338
        vm_conf_refresh_time=369479 (Mon May 27 15:04:52 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 8de1f2b0
local_conf_timestamp               : 95441
Host timestamp                     : 95441
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95441 (Mon May 27 15:05:12 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3317
stopped                            : False
Local maintenance                  : False
crc32                              : 090da97f
local_conf_timestamp               : 369492
Host timestamp                     : 369486
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369486 (Mon May 27 15:04:59 2019)
        host-id=2
        score=3317
        vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 8de1f2b0
local_conf_timestamp               : 95441
Host timestamp                     : 95441
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95441 (Mon May 27 15:05:12 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3317
stopped                            : False
Local maintenance                  : False
crc32                              : 090da97f
local_conf_timestamp               : 369492
Host timestamp                     : 369486
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369486 (Mon May 27 15:04:59 2019)
        host-id=2
        score=3317
        vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 8de1f2b0
local_conf_timestamp               : 95441
Host timestamp                     : 95441
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95441 (Mon May 27 15:05:12 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3317
stopped                            : False
Local maintenance                  : False
crc32                              : 090da97f
local_conf_timestamp               : 369492
Host timestamp                     : 369486
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369486 (Mon May 27 15:04:59 2019)
        host-id=2
        score=3317
        vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : e53aeaa0
local_conf_timestamp               : 95452
Host timestamp                     : 95451
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95451 (Mon May 27 15:05:22 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95452 (Mon May 27 15:05:22 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3296
stopped                            : False
Local maintenance                  : False
crc32                              : 8e127803
local_conf_timestamp               : 369502
Host timestamp                     : 369497
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369497 (Mon May 27 15:05:10 2019)
        host-id=2
        score=3296
        vm_conf_refresh_time=369502 (Mon May 27 15:05:15 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False
[root@alma03 ~]# hosted-engine --vm-status


--== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : e53aeaa0
local_conf_timestamp               : 95452
Host timestamp                     : 95451
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=95451 (Mon May 27 15:05:22 2019)
        host-id=1
        score=0
        vm_conf_refresh_time=95452 (Mon May 27 15:05:22 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Fri Jan  2 04:34:30 1970


--== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3296
stopped                            : False
Local maintenance                  : False
crc32                              : 8e127803
local_conf_timestamp               : 369502
Host timestamp                     : 369497
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=369497 (Mon May 27 15:05:10 2019)
        host-id=2
        score=3296
        vm_conf_refresh_time=369502 (Mon May 27 15:05:15 2019)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


In the end overloaded host alma04 got to score 2400 and engine got started on alma03 with score 3400.

Moving to verified.

Comment 19 errata-xmlrpc 2019-06-20 14:48:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1563

Comment 20 Daniel Gur 2019-08-28 13:13:21 UTC
sync2jira

Comment 21 Daniel Gur 2019-08-28 13:17:34 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.