Bug 1859922
| Summary: | Local Maintenance by cli and restarting ovirt-ha-agent.service causes 0 score of the host with HE vm on it | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-setup | Reporter: | Qin Yuan <qiyuan> | ||||
| Component: | Tools | Assignee: | Yedidyah Bar David <didi> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Qin Yuan <qiyuan> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 2.4.1 | CC: | arachman, bugs, didi, mzamazal, sbonazzo | ||||
| Target Milestone: | ovirt-4.4.3 | Keywords: | Automation | ||||
| Target Release: | --- | Flags: | qiyuan:
needinfo-
sbonazzo: ovirt-4.4? |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ovirt-hosted-engine-setup-2.4.7 | Doc Type: | No Doc Update | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-11-11 06:42:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 1703062 [details]
agent log
Please clarify the flow. Thanks.
It's prevented (in 4.4, see the bugs linked to by the patches [1]) to set local maintenance using:
hosted-engine --set-maintenance --mode=local
On the host running the engine VM.
I now tried 'systemctl restart ovirt-ha-agent' on such a host, and indeed after it finished starting, score was 0, but after about half a minute it went back to 3400. IMO that's by design. --vm-status never showed 'local maintenance'.
[1] https://gerrit.ovirt.org/#/q/Ia06b9bc6e65a7937e6d6462c001b59572369fe66,n,z
(In reply to Yedidyah Bar David from comment #5) > Please clarify the flow. Thanks. > > It's prevented (in 4.4, see the bugs linked to by the patches [1]) to set > local maintenance using: > > hosted-engine --set-maintenance --mode=local > > On the host running the engine VM. > > I now tried 'systemctl restart ovirt-ha-agent' on such a host, and indeed > after it finished starting, score was 0, but after about half a minute it > went back to 3400. IMO that's by design. --vm-status never showed 'local > maintenance'. > > [1] > https://gerrit.ovirt.org/#/q/Ia06b9bc6e65a7937e6d6462c001b59572369fe66,n,z I tried the following two steps again on a host with HE VM, still the same result that the score changed to 0, didn't return back to 3400, and --vm-status showed 'state=LocalMaintenance': 1) hosted-engine --set-maintenance --mode=local (This is the first and a must step) 2) systemctl restart ovirt-ha-agent.service I now spent some time looking at this. I did manage to reproduce. There are two separate issues here:
1. Why does 'hosted-engine --set-maintenance --mode=local' work, despite the decision (and code) to prevent that, if the engine vm is on this host?
As far as I can tell, it might be due to [1]. When I run this command, I see in vdsm.log:
2020-09-15 09:30:46,038+0300 INFO (jsonrpc/6) [api.host] START getVMList(fullStatus=False, vmList=[], onlyUUID=True) from=::1,58176 (api:48)
2020-09-15 09:30:46,038+0300 INFO (jsonrpc/6) [api.host] FINISH getVMList return={'status': {'code': 0, 'message': 'Done'}, 'vmList': [{'vmId': '5894f840-fff1-4318-8455-ea67ab1716c0', 'status': 'Up', 'statusTime': '4554382906'}]} from=::1,58176 (api:54)
The relevant code in hosted-engine does:
vm_id = config.Config().get(config.ENGINE, const.HEVMID)
cli = ohautil.connect_vdsm_json_rpc()
try:
vm_list = cli.Host.getVMList()
* sys.stderr.write('vm_id: %s\n' % vm_id)
* sys.stderr.write('vm_list: %s\n' % vm_list)
except ServerError as e:
sys.stderr.write(
_("Failed communicating with VDSM: {e}").format(e=e)
)
return False
if vm_id in vm_list:
sys.stderr.write(_(
"Unable to enter local maintenance mode: "
"the engine VM is running on the current host, "
"please migrate it before entering local "
"maintenance mode.\n"
))
return False
The two lines marked with '*' were added by me, for debugging. They are not in the merged code.
So, as can be seen, this code relies on getVMList() to return a list of IDs ("if vm_id in vm_list"), but as can be seen both in vdsm.log and, with above two lines, in stdout, it gets a list of dicts, so the test fails and we do not return, but continue and set local maintenance.
Patch [1] was merged by Milan. Milan - what should we do? Simply assume that going forward we'll always get a list of dicts? I am pushing a patch, feel free to reply/review there instead.
2. Why does it take effect only after an ha-agent restart? Because when we added the code preventing local maintenance [2], we also removed the code doing migration manually [3], as well as other relevant rather-significant changes (see also the bugs linked from these patches and the other patches linked from these bugs). So the only remaining thing is that the agent saves this fact (local maintenance) in its conf (in /var/lib/ovirt-hosted-engine-ha/ha.conf) and re-reads it upon restart. In principle, this is harmless, as this does not affect anything else in the system other than the output of '--vm-status'. So I am not going to touch this, but only fix the test.
[1] https://gerrit.ovirt.org/100368
[2] https://gerrit.ovirt.org/#/q/Ia06b9bc6e65a7937e6d6462c001b59572369fe66,n,z
[3] https://gerrit.ovirt.org/#/q/I42c810987d24e05ec2002ec38ecf2bff1f134290
(In reply to Yedidyah Bar David from comment #7) > Patch [1] was merged by Milan. Milan - what should we do? Simply assume that > going forward we'll always get a list of dicts? Yes, support for the obsolete response format was removed, please use the new one, that means dicts. Moving to -setup, where the patch was added. Verified with: ovirt-hosted-engine-setup-2.4.7-2.el8ev.noarch ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch ovirt-engine-4.4.3.6-0.13.el8ev.noarch Steps: 1. Run `hosted-engine --set-maintenance --mode=local` on the host with HE vm 2. Run `systemctl restart ovirt-ha-agent.service` 3. Check HE status by running `hosted-engine --vm-status` Results: 1. When run `hosted-engine --set-maintenance --mode=local` on the host with HE vm, it returns a message saying "Unable to enter local maintenance mode...", see below: [root@lynx01 ~]# hosted-engine --set-maintenance --mode=local Unable to enter local maintenance mode: the engine VM is running on the current host, please migrate it before entering local maintenance mode. 2. After restart ovirt-ha-agent.service, the maintenance state of the host with HE vm is False, the score is still 3400. This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Created attachment 1702208 [details] automation logs Description of problem: If run `hosted-engine --set-maintenance --mode=local` on the host with HE VM, then restart ovirt-ha-agent.service, the host will be set to local maintenance, the score will be set to 0, though the engine vm is still up: [root@ocelot03 ~]# hosted-engine --vm-status --== Host ocelot03.qa.lab.tlv.redhat.com (id: 3) status ==-- Host ID : 3 Host timestamp : 672290 Score : 0 Engine status : {"vm": "up", "health": "good", "detail": "Up"} Hostname : ocelot03.qa.lab.tlv.redhat.com Local maintenance : True stopped : False crc32 : f03b810e conf_on_shared_storage : True local_conf_timestamp : 672290 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=672290 (Fri Jul 31 10:48:39 2020) host-id=3 score=0 vm_conf_refresh_time=672290 (Fri Jul 31 10:48:39 2020) conf_on_shared_storage=True maintenance=True state=LocalMaintenance stopped=False On engine side, the host is still up, maintenace/activate the host can't bring the score back to 3400. Run `hosted-engine --set-maintenance --mode=none` on the host can bring its status back to normal. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch vdsm-4.40.22-1.el8ev.x86_64 ovirt-engine-4.4.1.8-0.7.el8ev.noarch libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run `hosted-engine --set-maintenance --mode=local` on the host with HE vm 2. Run `systemctl restart ovirt-ha-agent.service` 3. Check HE status by running `hosted-engine --vm-status` Actual results: 1. The host is in local maintenance state, its score is 0, but engine vm is still running on it. Expected results: 1. The host with HE vm shouldn't be set to local maintenance state, its score should remain 3400. Additional info: