Bug 1627957 - engine-retry-score-penalty could be better used.
Summary: engine-retry-score-penalty could be better used.
Keywords:
Status: CLOSED DUPLICATE of bug 1627958
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.2.6
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: ---
Assignee: Martin Sivák
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-12 00:37 UTC by Germano Veit Michel
Modified: 2020-08-03 15:30 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-12 00:49:40 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)

Description Germano Veit Michel 2018-09-12 00:37:29 UTC
Description of problem:

In the ha-agent, there is implemented logic that subtracts the host score by engine-retry-score-penalty (50 points) in case the Hosted-Engine fails to start on the host.

This depends on engine_vm_retry_count increasing. To increase it START_VM() must return False.

        if fsm.actions.START_VM():                                                                                                                                                             
            new_data = new_data._replace(                                                                                                                                                      
                engine_vm_retry_time=dtime(new_data),                                                                                                                                          
                engine_vm_retry_count=0)                                                                                                                                                       
            return EngineStarting(new_data)                                                                                                                                                    
        else:                                                                                                                                                                                  
            new_data = new_data._replace(                                                                                                                                                      
                engine_vm_retry_time=dtime(new_data),                                                                                                                                          
                engine_vm_retry_count=retry_count + 1)                                                                                                                                         
            return EngineDown(new_data)

So it means _start_engine_vm() needs to return False. But it returns false on very limited circumstances.

~~~
    def _start_engine_vm(self):                                                                                                                                                                
        try:                                                                                                                                                                                   
            self._config.refresh_vm_conf()                                                                                                                                                     
                                                                                                                                                                                               
            # Ensure there isn't any stale VDSM state from a prior VM lifecycle                                                                                                                
            self._clean_vdsm_state()                                                                                                                                                           
                                                                                                                                                                                               
            self._log.info("Starting vm using `%s --vm-start`",                                                                                                                                
                           constants.HOSTED_ENGINE_BINARY)                                                                                                                                     
            p = subprocess.Popen([constants.HOSTED_ENGINE_BINARY,                                                                                                                              
                                  '--vm-start'],                                                                                                                                               
                                 stdout=subprocess.PIPE,                                                                                                                                       
                                 stderr=subprocess.PIPE)                                                                                                                                       
            output = p.communicate()                                                                                                                                                           
            self._log.info("stdout: %s", output[0])                                                                                                                                            
            self._log.info("stderr: %s", output[1])                                                                                                                                            
            if p.returncode != 0:                                                                                                                                                              
                # FIXME consider removing, we can get vm status from sanlock,                                                                                                                  
                # if still an issue then the alternative tracking the time we                                                                                                                  
                # started the engine might be better than parsing this output                                                                                                                  
                if output[0].startswith("Virtual machine already exists"):                                                                                                                     
                    self._log.warning("Failed to start engine VM,"                                                                                                                             
                                      " already running according to VDSM")                                                                                                                    
                    return True                                                                                                                                                                
                                                                                                                                                                                               
                raise Exception(output[1])                                                                                                                                                     
                                                                                                                                                                                               
            self._log.info("Engine VM started on localhost")                                                                                                                                   
            return True                              
2018-09-12 10:25:58,685+1000 INFO  (vm/6a0f1af5) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=1 err=Traceback (most recent call last):
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 88, in <module>
    HostedEngineHook().main()
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 84, in main
    raise ValueError('Ohh no, Hosted-Engine start failed.')
ValueError: Ohh no, Hosted-Engine start failed.
 (hooks:110)

The host is not penalized by 50 points.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
                                                                                                                                          
        except Exception as e:                                                                                                                                                                 
            self._log.info("Failed to start engine VM: '%s'. Please check the"                                                                                                                 
                           " vdsm logs. The possible reason: the engine has"                                                                                                                   
                           " been already started on a different host so this"                                                                                                                 
                           " one has failed to acquire the lock and it will"                                                                                                                   
                           " sync in a while."                                                                                                                                                 
                           " For more information please visit: "                                                                                                                              
                           "http://www.ovirt.org/Hosted_Engine_Howto"                                                                                                                          
                           "#EngineUnexpectedlyDown", str(e))                                                                                                                                  
            return False 
~~~

So unless we get an Exception during the above function, it pretty much always returns True. And on most cases, the HE start fails without throwing any exceptions there.

For example, the most common cases would be the Hosted-Engine failing to start due to some bad configuration on the host (libvirt/qemu/vdsm). All these errors are not catched and the score is not lowered. hosted-engine --vm-start subprocess runs without raising any exception even if the command fails. And looking at the code the vdsm_helper returns 0 on most cases too!). refresh_vm_conf() and _clean_vdsm_state() are also mostly clean and unlikely to fail. So on most failures the host score is not lowered.

Basically the logic is there and makes a lot of sense, but it is not covering most use cases.

For example, if VDSM fails to start the VM, this still returns 0 and the host is not penalized.
It can be confirmed by editing the 50_hostedengine hook and raising an exception to fail the HE start.

2018-09-12 10:25:58,685+1000 INFO  (vm/6a0f1af5) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=1 err=Traceback (most recent call last):
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 88, in <module>
    HostedEngineHook().main()
  File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 84, in main
    raise ValueError('Ohh no, Hosted-Engine start failed.')
ValueError: Ohh no, Hosted-Engine start failed.
 (hooks:110)

The host is not penalized by 50 points.

I think this feature could be better used by catching more return conditions. Or for example, check if the VM started by talking to vdsm?

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
Just make HostedEngine start fail within VDSM/libvirt/qemu.

Actual results:
Host is not penalized

Expected results:
Host penalized.

Comment 1 Germano Veit Michel 2018-09-12 00:46:35 UTC
It seems I made some weird copy paste above, please apologize for it. I think its readable anyway.

Comment 2 Germano Veit Michel 2018-09-12 00:49:40 UTC
Cleaned up that copy paste error on a new bug, closing this as dup. 
Sorry

*** This bug has been marked as a duplicate of bug 1627958 ***


Note You need to log in before you can comment on or make changes to this bug.