Bug 1147411
Summary: | can't start hosted engine VM in cluster with 3+ hosts | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ | ||||||
Component: | ovirt-hosted-engine-ha | Assignee: | Jiri Moskovcak <jmoskovc> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.5.0 | CC: | dfediuck, ecohen, gklein, iheim, jmoskovc, juwu, lsurette, mavital, rbalakri, sbonazzo, yeylon | ||||||
Target Milestone: | --- | Keywords: | ZStream | ||||||
Target Release: | 3.4.3 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | sla | ||||||||
Fixed In Version: | ovirt-hosted-engine-ha-1.1.6-1.el6ev | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause:
The ha agent expected the engine virtual machine to be in up state right after it's started not giving it enough time to actually boot and start the engine.
Consequence:
This makes agent to wrongly determine the state of the engine and the agent penalized the host giving it score 0. This makes other hosts with higher score better target for running the engine virtual machine so the VM is killed on the actual host and started on host with better score where the situation repeats.
Fix:
Change the logic to take :powering up" phase into consideration when checking for the engine state and don't penalize the host if the engine is powering up and wait until it's fully started.
Result:
The engine is properly started and the host score is not penalized while the engine vm is powering up.
|
Story Points: | --- | ||||||
Clone Of: | 1130173 | Environment: | |||||||
Last Closed: | 2014-10-27 22:47:09 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1097767 | ||||||||
Attachments: |
|
Comment 2
Nikolai Sednev
2014-10-13 11:51:16 UTC
Created attachment 946350 [details]
answers.conf
Components: qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 ovirt-hosted-engine-setup-1.2.1-1.el6ev.noarch libvirt-0.10.2-46.el6.x86_64 sanlock-2.8-1.el6.x86_64 vdsm-4.16.6-1.el6ev.x86_64 ovirt-hosted-engine-ha-1.2.2-2.el6ev.noarch Created attachment 946351 [details]
vdsm and supervdsm logs
The above failure is due to deployment issue and has nothing to do with this BZ. Moving to on_qa. After putting the HE vm to power-off via halt -p, and then running on the same host on which it ran before command hosted-engine --vm-start, engine doesn't starts on that particular host, but it starts on third host, which is seen as stale from host on which VM was tried to be started: --== Host 4 status ==-- Status up-to-date : False Hostname : 10.35.117.26 Host ID : 4 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 1413953568 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1413953568 (Wed Oct 22 07:52:48 2014) host-id=4 score=2400 maintenance=False state=EngineUp When entering to the host on which VM is running (the same that reported as unknown stale-data (10.35.117.26), then VM is shown as running on it: --== Host 4 status ==-- Status up-to-date : True Hostname : 10.35.117.26 Host ID : 4 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 2400 Local maintenance : False Host timestamp : 1413953494 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1413953494 (Wed Oct 22 07:51:34 2014) host-id=4 score=2400 maintenance=False state=EngineUp We don't have an issue with that HE VM doesn't started at all, it's started, but not on the requested host and third host shown incorrectly as stale. Checked using these components: libvirt-0.10.2-46.el6.x86_64 ovirt-hosted-engine-ha-1.1.6-3.el6ev.noarch ovirt-host-deploy-1.2.3-1.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 vdsm-4.14.17-1.el6ev.x86_64 ovirt-hosted-engine-setup-1.1.5-1.el6ev.noarch sanlock-2.8-1.el6.x86_64 rhevm-3.4.3-1.2.el6ev.noarch It's expected behavior, when you killed the engine by halt -p the host running the engine VM got score 0 because of that unexpected shutdown, so when you tried to start it on the same host the agent detects there are hosts with better score and immediately re-starts the engine VM on the host with better score. And even if this would be a problem it's definitelly not connected with this bug so I don't understand why you marked it as FailedQA. (In reply to Jiri Moskovcak from comment #9) > It's expected behavior, when you killed the engine by halt -p the host > running the engine VM got score 0 because of that unexpected shutdown, so > when you tried to start it on the same host the agent detects there are > hosts with better score and immediately re-starts the engine VM on the host > with better score. And even if this would be a problem it's definitelly not > connected with this bug so I don't understand why you marked it as FailedQA. The reason I re-opened is because host on which VM was eventually powered-up was seen by 2 others as in stale state, although it was running the VM, additionally VM first was started on one host, then brought down and then up again, instead of doing it once, I'll verify this one and open 2 more on this issue, as root cause was fixed by you. (In reply to Nikolai Sednev from comment #10) > (In reply to Jiri Moskovcak from comment #9) > > It's expected behavior, when you killed the engine by halt -p the host > > running the engine VM got score 0 because of that unexpected shutdown, so > > when you tried to start it on the same host the agent detects there are > > hosts with better score and immediately re-starts the engine VM on the host > > with better score. And even if this would be a problem it's definitelly not > > connected with this bug so I don't understand why you marked it as FailedQA. > > The reason I re-opened is because host on which VM was eventually powered-up > was seen by 2 others as in stale state, although it was running the VM, > additionally VM first was started on one host, then brought down and then up > again, instead of doing it once, I'll verify this one and open 2 more on > this issue, as root cause was fixed by you. Is this test run by some script? The stale data might just mean that the agents on the other hosts weren't just running long enough, it takes time to synchronize. Hi Jiri, Please provide the doc text or set require_doc_text flag to -. Many thanks, Julie Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2014-1722.html |