Bug 1627958
Summary: | engine-retry-score-penalty could be better used | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | ovirt-hosted-engine-setup | Assignee: | Steven Rosenberg <srosenbe> |
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.2.6 | CC: | bcholler, dfodor, emarcus, fkust, gveitmic, lsurette, mavital, michal.skrivanek, mkalinin, mtessun, myamazak, nsednev, rbarry, srosenbe, stirabos |
Target Milestone: | ovirt-4.3.4 | Keywords: | Triaged, ZStream |
Target Release: | 4.3.1 | Flags: | lsvaty:
testing_plan_complete-
|
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | ovirt-hosted-engine-setup-2.3.8 | Doc Type: | Bug Fix |
Doc Text: |
The self-hosted engine was not detecting exceptions or failures when the ‘--vm-start’ command was initiated. As a result, the retry count was not incremented and therefore the score of the host was not reduced for rescheduling.
In this release, the self-hosted engine setup fails when the create VM functionality fails. As a result, the self-hosted engine High Availability agent can adjust its retry count and reduce the score of the host, giving other hosts a higher priority.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-20 14:48:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1703275, 1709969 |
Description
Germano Veit Michel
2018-09-12 00:48:51 UTC
*** Bug 1627957 has been marked as a duplicate of this bug. *** Should this be more of an RFE? Or get at least medium priority? What is the actual impact on the customer? Let's see we have 3 hosts, and one of 3 hosts has a problem, that means, that next time HE VM will be started on other 2 hosts, otherwise it just continues starting on this host and fail? (In reply to Marina from comment #2) > Should this be more of an RFE? > Or get at least medium priority? > What is the actual impact on the customer? Let's see we have 3 hosts, and > one of 3 hosts has a problem, that means, that next time HE VM will be > started on other 2 hosts, otherwise it just continues starting on this host > and fail? It can affect some specific scenarios. i.e.: 3 hosts: - Host A score 3400 but fails to start the VM for some reason - Host B and C score < 3400 A may not lower its score due to this BZ until zeroing it after retry_count, so B and C wont attempt to start the HE VM right away. The way I see it HA will take more time to recover from a failure, but I shouldn't cause any outages, will just take more time to bring the HE up. Software Version:4.3.4.1-0.1.el7 I powered-off the engine via CLI, also tried to stop libvirt multiple times and did not seen any penalties to score, except for one of the host's score was set to zero, once I powered-off the engine on it. Tested on these components: Software Version:4.3.4.1-0.1.el7 ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) Please provide more reproduction steps. (In reply to Nikolai Sednev from comment #12) > Software Version:4.3.4.1-0.1.el7 > I powered-off the engine via CLI, also tried to stop libvirt multiple times > and did not seen any penalties to score, except for one of the host's score > was set to zero, once I powered-off the engine on it. > Tested on these components: > Software Version:4.3.4.1-0.1.el7 > ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch > ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch > Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 > x86_64 x86_64 GNU/Linux > Red Hat Enterprise Linux Server release 7.6 (Maipo) > > Please provide more reproduction steps. Hi Nikolai, You need to test it with the: "hosted-engine --vm-start" command and have the VM create fail. This should be due to a policy constraint such as lacking resources or CPU Types not supported. Then see if how the score is affected. Works for me on these components: ovirt-hosted-engine-ha-2.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.8-1.el7ev.noarch Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) I overloaded CPU on target ha-host and powered-off engine's VM on source ha-host. engine's VM got started on overloaded ha-host and due to overloaded CPU failed to get started, during trying to get started on overloaded ha-host, I checked that score was lowered as designed: alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : b590cf18 local_conf_timestamp : 95331 Host timestamp : 95331 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95331 (Mon May 27 15:03:21 2019) host-id=1 score=0 vm_conf_refresh_time=95331 (Mon May 27 15:03:21 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : 412bc2c9 local_conf_timestamp : 369386 Host timestamp : 369385 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369385 (Mon May 27 15:03:18 2019) host-id=2 score=3400 vm_conf_refresh_time=369386 (Mon May 27 15:03:18 2019) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : c49f4d8f local_conf_timestamp : 95421 Host timestamp : 95421 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95421 (Mon May 27 15:04:51 2019) host-id=1 score=0 vm_conf_refresh_time=95421 (Mon May 27 15:04:51 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3357 stopped : False Local maintenance : False crc32 : 7051f835 local_conf_timestamp : 369467 Host timestamp : 369465 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369465 (Mon May 27 15:04:38 2019) host-id=2 score=3357 vm_conf_refresh_time=369467 (Mon May 27 15:04:40 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 354d0425 local_conf_timestamp : 95431 Host timestamp : 95431 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95431 (Mon May 27 15:05:01 2019) host-id=1 score=0 vm_conf_refresh_time=95431 (Mon May 27 15:05:02 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3338 stopped : False Local maintenance : False crc32 : 4f9e67df local_conf_timestamp : 369479 Host timestamp : 369475 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369475 (Mon May 27 15:04:48 2019) host-id=2 score=3338 vm_conf_refresh_time=369479 (Mon May 27 15:04:52 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 354d0425 local_conf_timestamp : 95431 Host timestamp : 95431 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95431 (Mon May 27 15:05:01 2019) host-id=1 score=0 vm_conf_refresh_time=95431 (Mon May 27 15:05:02 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3338 stopped : False Local maintenance : False crc32 : 4f9e67df local_conf_timestamp : 369479 Host timestamp : 369475 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369475 (Mon May 27 15:04:48 2019) host-id=2 score=3338 vm_conf_refresh_time=369479 (Mon May 27 15:04:52 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 8de1f2b0 local_conf_timestamp : 95441 Host timestamp : 95441 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95441 (Mon May 27 15:05:12 2019) host-id=1 score=0 vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3317 stopped : False Local maintenance : False crc32 : 090da97f local_conf_timestamp : 369492 Host timestamp : 369486 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369486 (Mon May 27 15:04:59 2019) host-id=2 score=3317 vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 8de1f2b0 local_conf_timestamp : 95441 Host timestamp : 95441 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95441 (Mon May 27 15:05:12 2019) host-id=1 score=0 vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3317 stopped : False Local maintenance : False crc32 : 090da97f local_conf_timestamp : 369492 Host timestamp : 369486 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369486 (Mon May 27 15:04:59 2019) host-id=2 score=3317 vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : 8de1f2b0 local_conf_timestamp : 95441 Host timestamp : 95441 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95441 (Mon May 27 15:05:12 2019) host-id=1 score=0 vm_conf_refresh_time=95441 (Mon May 27 15:05:12 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3317 stopped : False Local maintenance : False crc32 : 090da97f local_conf_timestamp : 369492 Host timestamp : 369486 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369486 (Mon May 27 15:04:59 2019) host-id=2 score=3317 vm_conf_refresh_time=369492 (Mon May 27 15:05:05 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : e53aeaa0 local_conf_timestamp : 95452 Host timestamp : 95451 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95451 (Mon May 27 15:05:22 2019) host-id=1 score=0 vm_conf_refresh_time=95452 (Mon May 27 15:05:22 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3296 stopped : False Local maintenance : False crc32 : 8e127803 local_conf_timestamp : 369502 Host timestamp : 369497 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369497 (Mon May 27 15:05:10 2019) host-id=2 score=3296 vm_conf_refresh_time=369502 (Mon May 27 15:05:15 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [root@alma03 ~]# hosted-engine --vm-status --== Host alma03.qa.lab.tlv.redhat.com (id: 1) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_missing", "detail": "unknown"} Score : 0 stopped : False Local maintenance : False crc32 : e53aeaa0 local_conf_timestamp : 95452 Host timestamp : 95451 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=95451 (Mon May 27 15:05:22 2019) host-id=1 score=0 vm_conf_refresh_time=95452 (Mon May 27 15:05:22 2019) conf_on_shared_storage=True maintenance=False state=EngineUnexpectedlyDown stopped=False timeout=Fri Jan 2 04:34:30 1970 --== Host alma04.qa.lab.tlv.redhat.com (id: 2) status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3296 stopped : False Local maintenance : False crc32 : 8e127803 local_conf_timestamp : 369502 Host timestamp : 369497 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=369497 (Mon May 27 15:05:10 2019) host-id=2 score=3296 vm_conf_refresh_time=369502 (Mon May 27 15:05:15 2019) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False In the end overloaded host alma04 got to score 2400 and engine got started on alma03 with score 3400. Moving to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1563 sync2jira sync2jira |