Hide Forgot
Description of problem: In the ha-agent, there is implemented logic that subtracts the host score by engine-retry-score-penalty (50 points) in case the Hosted-Engine fails to start on the host. This depends on engine_vm_retry_count increasing. To increase it START_VM() must return False. if fsm.actions.START_VM(): new_data = new_data._replace( engine_vm_retry_time=dtime(new_data), engine_vm_retry_count=0) return EngineStarting(new_data) else: new_data = new_data._replace( engine_vm_retry_time=dtime(new_data), engine_vm_retry_count=retry_count + 1) return EngineDown(new_data) So it means _start_engine_vm() needs to return False. But it returns false on very limited circumstances. ~~~ def _start_engine_vm(self): try: self._config.refresh_vm_conf() # Ensure there isn't any stale VDSM state from a prior VM lifecycle self._clean_vdsm_state() self._log.info("Starting vm using `%s --vm-start`", constants.HOSTED_ENGINE_BINARY) p = subprocess.Popen([constants.HOSTED_ENGINE_BINARY, '--vm-start'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) output = p.communicate() self._log.info("stdout: %s", output[0]) self._log.info("stderr: %s", output[1]) if p.returncode != 0: # FIXME consider removing, we can get vm status from sanlock, # if still an issue then the alternative tracking the time we # started the engine might be better than parsing this output if output[0].startswith("Virtual machine already exists"): self._log.warning("Failed to start engine VM," " already running according to VDSM") return True raise Exception(output[1]) self._log.info("Engine VM started on localhost") return True except Exception as e: self._log.info("Failed to start engine VM: '%s'. Please check the" " vdsm logs. The possible reason: the engine has" " been already started on a different host so this" " one has failed to acquire the lock and it will" " sync in a while." " For more information please visit: " "http://www.ovirt.org/Hosted_Engine_Howto" "#EngineUnexpectedlyDown", str(e)) return False ~~~ So unless we get an Exception during the above function, it pretty much always returns True. And on most cases, the HE start fails without throwing any exceptions there. For example, the most common cases would be the Hosted-Engine failing to start due to some bad configuration on the host (libvirt/qemu/vdsm). All these errors are not catched and the score is not lowered. hosted-engine --vm-start subprocess runs without raising any exception even if the command fails. And looking at the code the vdsm_helper returns 0 on most cases too!). refresh_vm_conf() and _clean_vdsm_state() are also mostly clean and unlikely to fail. So on most failures the host score is not lowered. Basically the logic is there and makes a lot of sense, but it is not covering most use cases. For example, if VDSM fails to start the VM, this still returns 0 and the host is not penalized. It can be confirmed by editing the 50_hostedengine hook and raising an exception to fail the HE start. 2018-09-12 10:25:58,685+1000 INFO (vm/6a0f1af5) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=1 err=Traceback (most recent call last): File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 88, in <module> HostedEngineHook().main() File "/usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine", line 84, in main raise ValueError('Ohh no, Hosted-Engine start failed.') ValueError: Ohh no, Hosted-Engine start failed. (hooks:110) The host is not penalized by 50 points. I think this feature could be better used by catching more return conditions. Or for example, check if the VM started by talking to vdsm? Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.16-1.el7.noarch How reproducible: 100% Steps to Reproduce: Just make HostedEngine start fail within VDSM/libvirt/qemu. Actual results: Host is not penalized Expected results: Host penalized.
*** Bug 1627957 has been marked as a duplicate of this bug. ***
Should this be more of an RFE? Or get at least medium priority? What is the actual impact on the customer? Let's see we have 3 hosts, and one of 3 hosts has a problem, that means, that next time HE VM will be started on other 2 hosts, otherwise it just continues starting on this host and fail?
(In reply to Marina from comment #2) > Should this be more of an RFE? > Or get at least medium priority? > What is the actual impact on the customer? Let's see we have 3 hosts, and > one of 3 hosts has a problem, that means, that next time HE VM will be > started on other 2 hosts, otherwise it just continues starting on this host > and fail? It can affect some specific scenarios. i.e.: 3 hosts: - Host A score 3400 but fails to start the VM for some reason - Host B and C score < 3400 A may not lower its score due to this BZ until zeroing it after retry_count, so B and C wont attempt to start the HE VM right away. The way I see it HA will take more time to recover from a failure, but I shouldn't cause any outages, will just take more time to bring the HE up.
Software Version:4.3.4.1-0.1.el7 I powered-off the engine via CLI, also tried to stop libvirt multiple times and did not seen any penalties to score, except for one of the host's score was set to zero, once I powered-off the engine on it. Tested on these components: Software Version:4.3.4.1-0.1.el7 ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) Please provide more reproduction steps.
(In reply to Nikolai Sednev from comment #12) > Software Version:4.3.4.1-0.1.el7 > I powered-off the engine via CLI, also tried to stop libvirt multiple times > and did not seen any penalties to score, except for one of the host's score > was set to zero, once I powered-off the engine on it. > Tested on these components: > Software Version:4.3.4.1-0.1.el7 > ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch > ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch > Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 > x86_64 x86_64 GNU/Linux > Red Hat Enterprise Linux Server release 7.6 (Maipo) > > Please provide more reproduction steps. Hi Nikolai, You need to test it with the: "hosted-engine --vm-start" command and have the VM create fail. This should be due to a policy constraint such as lacking resources or CPU Types not supported. Then see if how the score is affected.
Works for me on these components: ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) I overloaded CPU on target ha-host and powered-off engine's VM on source ha-host. engine's VM got started on overloaded ha-host and due to overloaded CPU failed to get started, during trying to get started on overloaded ha-host, I checked that score was lowered as designed: alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : b590cf18 local_conf_timestamp : 95331 Host timestamp : 95331 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95331 (Mon May 27 15:03:21 2019) host-id=1 score=0 vm_conf_refresh_time=95331 (Mon May 27 15:03:21 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 412bc2c9 local_conf_timestamp : 369386 Host timestamp : 369385 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369385 (Mon May 27 15:03:18 2019) host-id=2 score=3400 vm_conf_refresh_time=369386 (Mon May 27 15:03:18 2019) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : c49f4d8f local_conf_timestamp : 95421 Host timestamp : 95421 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95421 (Mon May 27 15:04:51 2019) host-id=1 score=0 vm_conf_refresh_time=95421 (Mon May 27 15:04:51 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3357 stopped : False Local maintenance : False crc32 : 7051f835 local_conf_timestamp : 369467 Host timestamp : 369465 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369465 (Mon May 27 15:04:38 2019) host-id=2 score=3357 vm_conf_refresh_time=369467 (Mon May 27 15:04:40 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 354d0425 local_conf_timestamp : 95431 Host timestamp : 95431 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95431 (Mon May 27 15:05:01 2019) host-id=1 score=0 vm_conf_refresh_time=95431 (Mon May 27 15:05:02 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3338 stopped : False Local maintenance : False crc32 : 4f9e67df local_conf_timestamp : 369479 Host timestamp : 369475 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369475 (Mon May 27 15:04:48 2019) host-id=2 score=3338 vm_conf_refresh_time=369479 (Mon May 27 15:04:52 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 354d0425 local_conf_timestamp : 95431 Host timestamp : 95431 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95431 (Mon May 27 15:05:01 2019) host-id=1 score=0 vm_conf_refresh_time=95431 (Mon May 27 15:05:02 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3338 stopped : False Local maintenance : False crc32 : 4f9e67df local_conf_timestamp : 369479 Host timestamp : 369475 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369475 (Mon May 27 15:04:48 2019) host-id=2 score=3338 vm_conf_refresh_time=369479 (Mon May 27 15:04:52 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 8de1f2b0 local_conf_timestamp : 95441 Host timestamp : 95441 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95441 (Mon May 27 15:05:12 2019) host-id=1 score=0 vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3317 stopped : False Local maintenance : False crc32 : 090da97f local_conf_timestamp : 369492 Host timestamp : 369486 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369486 (Mon May 27 15:04:59 2019) host-id=2 score=3317 vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 8de1f2b0 local_conf_timestamp : 95441 Host timestamp : 95441 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95441 (Mon May 27 15:05:12 2019) host-id=1 score=0 vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3317 stopped : False Local maintenance : False crc32 : 090da97f local_conf_timestamp : 369492 Host timestamp : 369486 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369486 (Mon May 27 15:04:59 2019) host-id=2 score=3317 vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 8de1f2b0 local_conf_timestamp : 95441 Host timestamp : 95441 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95441 (Mon May 27 15:05:12 2019) host-id=1 score=0 vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3317 stopped : False Local maintenance : False crc32 : 090da97f local_conf_timestamp : 369492 Host timestamp : 369486 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369486 (Mon May 27 15:04:59 2019) host-id=2 score=3317 vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : e53aeaa0 local_conf_timestamp : 95452 Host timestamp : 95451 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95451 (Mon May 27 15:05:22 2019) host-id=1 score=0 vm_conf_refresh_time=95452 (Mon May 27 15:05:22 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3296 stopped : False Local maintenance : False crc32 : 8e127803 local_conf_timestamp : 369502 Host timestamp : 369497 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369497 (Mon May 27 15:05:10 2019) host-id=2 score=3296 vm_conf_refresh_time=369502 (Mon May 27 15:05:15 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : e53aeaa0 local_conf_timestamp : 95452 Host timestamp : 95451 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95451 (Mon May 27 15:05:22 2019) host-id=1 score=0 vm_conf_refresh_time=95452 (Mon May 27 15:05:22 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3296 stopped : False Local maintenance : False crc32 : 8e127803 local_conf_timestamp : 369502 Host timestamp : 369497 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369497 (Mon May 27 15:05:10 2019) host-id=2 score=3296 vm_conf_refresh_time=369502 (Mon May 27 15:05:15 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False In the end overloaded host alma04 got to score 2400 and engine got started on alma03 with score 3400. Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1563
sync2jira