Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1859922

Summary: Local Maintenance by cli and restarting ovirt-ha-agent.service causes 0 score of the host with HE vm on it
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Qin Yuan <qiyuan>
Component: ToolsAssignee: Yedidyah Bar David <didi>
Status: CLOSED CURRENTRELEASE QA Contact: Qin Yuan <qiyuan>
Severity: medium Docs Contact:
Priority: high    
Version: 2.4.1CC: arachman, bugs, didi, mzamazal, sbonazzo
Target Milestone: ovirt-4.4.3Keywords: Automation
Target Release: ---Flags: qiyuan: needinfo-
sbonazzo: ovirt-4.4?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-setup-2.4.7 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-11 06:42:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
agent log none

Description Qin Yuan 2020-07-23 10:16:31 UTC
Created attachment 1702208 [details]
automation logs

Description of problem:
If run `hosted-engine --set-maintenance --mode=local` on the host with HE VM, then restart ovirt-ha-agent.service, the host will be set to local maintenance, the score will be set to 0, though the engine vm is still up:

[root@ocelot03 ~]# hosted-engine --vm-status
--== Host ocelot03.qa.lab.tlv.redhat.com (id: 3) status ==--
Host ID                            : 3
Host timestamp                     : 672290
Score                              : 0
Engine status                      : {"vm": "up", "health": "good", "detail": "Up"}
Hostname                           : ocelot03.qa.lab.tlv.redhat.com
Local maintenance                  : True
stopped                            : False
crc32                              : f03b810e
conf_on_shared_storage             : True
local_conf_timestamp               : 672290
Status up-to-date                  : True
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=672290 (Fri Jul 31 10:48:39 2020)
	host-id=3
	score=0
	vm_conf_refresh_time=672290 (Fri Jul 31 10:48:39 2020)
	conf_on_shared_storage=True
	maintenance=True
	state=LocalMaintenance
	stopped=False

On engine side, the host is still up, maintenace/activate the host can't bring the score back to 3400.
Run `hosted-engine --set-maintenance --mode=none` on the host can bring its status back to normal.


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
vdsm-4.40.22-1.el8ev.x86_64
ovirt-engine-4.4.1.8-0.7.el8ev.noarch
libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Run `hosted-engine --set-maintenance --mode=local` on the host with HE vm
2. Run `systemctl restart ovirt-ha-agent.service`
3. Check HE status by running `hosted-engine --vm-status`

Actual results:
1. The host is in local maintenance state, its score is 0, but engine vm is still running on it.

Expected results:
1. The host with HE vm shouldn't be set to local maintenance state, its score should remain 3400. 


Additional info:

Comment 2 Qin Yuan 2020-07-31 08:13:44 UTC
Created attachment 1703062 [details]
agent log

Comment 5 Yedidyah Bar David 2020-09-14 13:23:11 UTC
Please clarify the flow. Thanks.

It's prevented (in 4.4, see the bugs linked to by the patches [1]) to set local maintenance using:

    hosted-engine --set-maintenance --mode=local

On the host running the engine VM.

I now tried 'systemctl restart ovirt-ha-agent' on such a host, and indeed after it finished starting, score was 0, but after about half a minute it went back to 3400. IMO that's by design. --vm-status never showed 'local maintenance'.

[1] https://gerrit.ovirt.org/#/q/Ia06b9bc6e65a7937e6d6462c001b59572369fe66,n,z

Comment 6 Qin Yuan 2020-09-15 04:23:46 UTC
(In reply to Yedidyah Bar David from comment #5)
> Please clarify the flow. Thanks.
> 
> It's prevented (in 4.4, see the bugs linked to by the patches [1]) to set
> local maintenance using:
> 
>     hosted-engine --set-maintenance --mode=local
> 
> On the host running the engine VM.
> 
> I now tried 'systemctl restart ovirt-ha-agent' on such a host, and indeed
> after it finished starting, score was 0, but after about half a minute it
> went back to 3400. IMO that's by design. --vm-status never showed 'local
> maintenance'.
> 
> [1]
> https://gerrit.ovirt.org/#/q/Ia06b9bc6e65a7937e6d6462c001b59572369fe66,n,z

I tried the following two steps again on a host with HE VM, still the same result that the score changed to 0, didn't return back to 3400, and --vm-status showed 'state=LocalMaintenance':
1) hosted-engine --set-maintenance --mode=local   (This is the first and a must step)
2) systemctl restart ovirt-ha-agent.service

Comment 7 Yedidyah Bar David 2020-09-15 08:35:46 UTC
I now spent some time looking at this. I did manage to reproduce. There are two separate issues here:

1. Why does 'hosted-engine --set-maintenance --mode=local' work, despite the decision (and code) to prevent that, if the engine vm is on this host?

As far as I can tell, it might be due to [1]. When I run this command, I see in vdsm.log:

2020-09-15 09:30:46,038+0300 INFO  (jsonrpc/6) [api.host] START getVMList(fullStatus=False, vmList=[], onlyUUID=True) from=::1,58176 (api:48)                                                
2020-09-15 09:30:46,038+0300 INFO  (jsonrpc/6) [api.host] FINISH getVMList return={'status': {'code': 0, 'message': 'Done'}, 'vmList': [{'vmId': '5894f840-fff1-4318-8455-ea67ab1716c0', 'status': 'Up', 'statusTime': '4554382906'}]} from=::1,58176 (api:54)

The relevant code in hosted-engine does:

            vm_id = config.Config().get(config.ENGINE, const.HEVMID)
            cli = ohautil.connect_vdsm_json_rpc()
            try:
                vm_list = cli.Host.getVMList()
*               sys.stderr.write('vm_id: %s\n' % vm_id)
*               sys.stderr.write('vm_list: %s\n' % vm_list)
            except ServerError as e:
                sys.stderr.write(
                    _("Failed communicating with VDSM: {e}").format(e=e)
                )
                return False
            if vm_id in vm_list:
                sys.stderr.write(_(
                    "Unable to enter local maintenance mode: "
                    "the engine VM is running on the current host, "
                    "please migrate it before entering local "
                    "maintenance mode.\n"
                ))
                return False

The two lines marked with '*' were added by me, for debugging. They are not in the merged code.

So, as can be seen, this code relies on getVMList() to return a list of IDs ("if vm_id in vm_list"), but as can be seen both in vdsm.log and, with above two lines, in stdout, it gets a list of dicts, so the test fails and we do not return, but continue and set local maintenance.

Patch [1] was merged by Milan. Milan - what should we do? Simply assume that going forward we'll always get a list of dicts? I am pushing a patch, feel free to reply/review there instead.

2. Why does it take effect only after an ha-agent restart? Because when we added the code preventing local maintenance [2], we also removed the code doing migration manually [3], as well as other relevant rather-significant changes (see also the bugs linked from these patches and the other patches linked from these bugs). So the only remaining thing is that the agent saves this fact (local maintenance) in its conf (in /var/lib/ovirt-hosted-engine-ha/ha.conf) and re-reads it upon restart. In principle, this is harmless, as this does not affect anything else in the system other than the output of '--vm-status'. So I am not going to touch this, but only fix the test.

[1] https://gerrit.ovirt.org/100368
[2] https://gerrit.ovirt.org/#/q/Ia06b9bc6e65a7937e6d6462c001b59572369fe66,n,z
[3] https://gerrit.ovirt.org/#/q/I42c810987d24e05ec2002ec38ecf2bff1f134290

Comment 8 Milan Zamazal 2020-09-21 16:17:37 UTC
(In reply to Yedidyah Bar David from comment #7)

> Patch [1] was merged by Milan. Milan - what should we do? Simply assume that
> going forward we'll always get a list of dicts?

Yes, support for the obsolete response format was removed, please use the new one, that means dicts.

Comment 9 Yedidyah Bar David 2020-09-22 07:05:12 UTC
Moving to -setup, where the patch was added.

Comment 10 Qin Yuan 2020-10-13 03:13:14 UTC
Verified with:
ovirt-hosted-engine-setup-2.4.7-2.el8ev.noarch
ovirt-hosted-engine-ha-2.4.5-1.el8ev.noarch
ovirt-engine-4.4.3.6-0.13.el8ev.noarch

Steps:
1. Run `hosted-engine --set-maintenance --mode=local` on the host with HE vm
2. Run `systemctl restart ovirt-ha-agent.service`
3. Check HE status by running `hosted-engine --vm-status`

Results:
1. When run `hosted-engine --set-maintenance --mode=local` on the host with HE vm, it returns a message saying "Unable to enter local maintenance mode...", see below:
[root@lynx01 ~]# hosted-engine --set-maintenance --mode=local
Unable to enter local maintenance mode: the engine VM is running on the current host, please migrate it before entering local maintenance mode.

2. After restart ovirt-ha-agent.service, the maintenance state of the host with HE vm is False, the score is still 3400.

Comment 11 Sandro Bonazzola 2020-11-11 06:42:31 UTC
This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.