Bug 1993957 - Engine VM might be shut down after the score wrongly being penalized due to cpu load
Summary: Engine VM might be shut down after the score wrongly being penalized due to c...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 2.4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.9
: 2.4.9
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks: 2002945
TreeView+ depends on / blocked
 
Reported: 2021-08-16 13:01 UTC by Yedidyah Bar David
Modified: 2021-10-21 07:27 UTC (History)
1 user (show)

Fixed In Version: ovirt-hosted-engine-ha-2.4.9
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-21 07:27:14 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.4+
sbonazzo: devel_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43037 0 None None None 2021-08-16 13:02:53 UTC
oVirt gerrit 116145 0 master MERGED cpu_load_no_engine: Ignore cpu stats if cpuUsage is '0.00' 2021-08-16 13:31:45 UTC

Description Yedidyah Bar David 2021-08-16 13:01:19 UTC
Description of problem:

ovirt-ha-broker routinely monitors (also) cpu load on the hosts. It does that by checking the entire load, and subtracting from that the load caused by the engine VM, if running, as reported by VDSM. ovirt-ha-agent uses this data to penalize the host score, if the load is high, but ignoring load caused by the engine VM itself. If the score is getting significantly lower than the best host, it shuts down the engine VM, to let it be started on a "better" host (with the highest score).

VDSM gets this data from libvirt, using getAllDomainStats or domainListGetStats.

Under certain conditions, VDSM fails to get correct cpu usage statistics from libvirt, and so reports both cpuUser and cpuSys to be '0.00', thus causing HA to consider the entire, potentially high, load, to be not due to the engine VM, thus penalizing the score, eventually potentially leading to the engine VM being shut down.

This happened recently a few times on CI:

https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/7HNIFCW4NENG4ADZ5ROT43TCDXDURRJB/

Version-Release number of selected component (if applicable):
Current master. Might be related to a recent libvirt update or related, not sure.

How reproducible:
Not sure, happened a few times on CI

Steps to Reproduce:
1. Deploy hosted-engine on two hosts
2. Set global maintenance
3. Cleanly restart the engine VM
4. Immediately after the engine is up, exit global maintenance

Actual results:
In certain cases, shortly after exiting global maintenance, the engine VM is shut down.

Expected results:
engine VM stays up, or at least being shut down somewhat longer after noticing a high cpu load that's not clearly being caused by non-engine-VM tasks

Additional info:
I spent quite some time and failed to reproduce locally. Already wrote a patch, and verified it in a somewhat artificial environment, by also patching vdsm to always report wrong cpu stats as described. With this patch, it should take around 5 minutes until the engine VM will be shut down.

Comment 1 Yedidyah Bar David 2021-10-03 06:34:51 UTC
QE: As I wrote here and on gerrit, I failed to reproduce this issue.

For verifying the patch, I also patched vdsm locally to force reproduction. If you want to try this yourself, you can find my patch at [1].

Otherwise, I suggest to do some sanity testing based on the steps from comment 0, and also:

# grep -i cpu /var/log/ovirt-hosted-engine-ha/broker.log

on both broken and fixed versions, to have a chance to see some differences.

[1] https://gerrit.ovirt.org/c/vdsm/+/116915

Comment 2 Nikolai Sednev 2021-10-19 12:16:25 UTC
Unable top reproduce, tested on:
ovirt-engine-4.4.9.2-0.6.el8ev.noarch
ovirt-hosted-engine-setup-2.5.4-2.el8ev.noarch
ovirt-hosted-engine-ha-2.4.9-1.el8ev.noarch
Red Hat Enterprise Linux release 8.5 (Ootpa)
Linux 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Moving to verified.
In case that this bug being still reproduced, please attach logs and detailed steps for reproduction.

Comment 3 Sandro Bonazzola 2021-10-21 07:27:14 UTC
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.