Description of problem: In the ha-agent, there is implemented logic that subtracts the host score by engine-retry-score-penalty (50 points) in case the Hosted-Engine fails to start on the host. This depends on engine_vm_retry_count increasing. To increase it START_VM() must return False. if fsm.actions.START_VM(): new_data = new_data._replace( engine_vm_retry_time=dtime(new_data), engine_vm_retry_count=0) return EngineStarting(new_data) else: new_data = new_data._replace( engine_vm_retry_time=dtime(new_data), engine_vm_retry_count=retry_count + 1) return EngineDown(new_data) So it means _start_engine_vm() needs to return False. But it returns false on very limited circumstances. ~~~ def _start_engine_vm(self): try: self._config.refresh_vm_conf() # Ensure there isn't any stale VDSM state from a prior VM lifecycle self._clean_vdsm_state() self._log.info("Starting vm using `%s --vm-start`", constants.HOSTED_ENGINE_BINARY) p = subprocess.Popen([constants.HOSTED_ENGINE_BINARY, '--vm-start'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) output = p.communicate() self._log.info("stdout: %s", output[0]) self._log.info("stderr: %s", output[1]) if p.returncode != 0: # FIXME consider removing, we can get vm status from sanlock, # if still an issue then the alternative tracking the time we # started the engine might be better than parsing this output if output[0].startswith("Virtual machine already exists"): self._log.warning("Failed to start engine VM," " already running according to VDSM") return True raise Exception(output[1]) self._log.info("Engine VM started on localhost") return True 2018-09-12 10:25:58,685+1000 INFO (vm/6a0f1af5) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=1 err=Traceback (most recent call last): File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 88, in <module> HostedEngineHook().main() File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 84, in main raise ValueError('Ohh no, Hosted-Engine start failed.') ValueError: Ohh no, Hosted-Engine start failed. (hooks:110) The host is not penalized by 50 points. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.16-1.el7.noarch except Exception as e: self._log.info("Failed to start engine VM: '%s'. Please check the" " vdsm logs. The possible reason: the engine has" " been already started on a different host so this" " one has failed to acquire the lock and it will" " sync in a while." " For more information please visit: " "http://www.ovirt.org/Hosted_Engine_Howto" "#EngineUnexpectedlyDown", str(e)) return False ~~~ So unless we get an Exception during the above function, it pretty much always returns True. And on most cases, the HE start fails without throwing any exceptions there. For example, the most common cases would be the Hosted-Engine failing to start due to some bad configuration on the host (libvirt/qemu/vdsm). All these errors are not catched and the score is not lowered. hosted-engine --vm-start subprocess runs without raising any exception even if the command fails. And looking at the code the vdsm_helper returns 0 on most cases too!). refresh_vm_conf() and _clean_vdsm_state() are also mostly clean and unlikely to fail. So on most failures the host score is not lowered. Basically the logic is there and makes a lot of sense, but it is not covering most use cases. For example, if VDSM fails to start the VM, this still returns 0 and the host is not penalized. It can be confirmed by editing the 50_hostedengine hook and raising an exception to fail the HE start. 2018-09-12 10:25:58,685+1000 INFO (vm/6a0f1af5) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=1 err=Traceback (most recent call last): File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 88, in <module> HostedEngineHook().main() File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 84, in main raise ValueError('Ohh no, Hosted-Engine start failed.') ValueError: Ohh no, Hosted-Engine start failed. (hooks:110) The host is not penalized by 50 points. I think this feature could be better used by catching more return conditions. Or for example, check if the VM started by talking to vdsm? Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.16-1.el7.noarch How reproducible: 100% Steps to Reproduce: Just make HostedEngine start fail within VDSM/libvirt/qemu. Actual results: Host is not penalized Expected results: Host penalized.
It seems I made some weird copy paste above, please apologize for it. I think its readable anyway.
Cleaned up that copy paste error on a new bug, closing this as dup. Sorry *** This bug has been marked as a duplicate of bug 1627958 ***