Description of problem: ovirt-ha-broker routinely monitors (also) cpu load on the hosts. It does that by checking the entire load, and subtracting from that the load caused by the engine VM, if running, as reported by VDSM. ovirt-ha-agent uses this data to penalize the host score, if the load is high, but ignoring load caused by the engine VM itself. If the score is getting significantly lower than the best host, it shuts down the engine VM, to let it be started on a "better" host (with the highest score). VDSM gets this data from libvirt, using getAllDomainStats or domainListGetStats. Under certain conditions, VDSM fails to get correct cpu usage statistics from libvirt, and so reports both cpuUser and cpuSys to be '0.00', thus causing HA to consider the entire, potentially high, load, to be not due to the engine VM, thus penalizing the score, eventually potentially leading to the engine VM being shut down. This happened recently a few times on CI: https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/7HNIFCW4NENG4ADZ5ROT43TCDXDURRJB/ Version-Release number of selected component (if applicable): Current master. Might be related to a recent libvirt update or related, not sure. How reproducible: Not sure, happened a few times on CI Steps to Reproduce: 1. Deploy hosted-engine on two hosts 2. Set global maintenance 3. Cleanly restart the engine VM 4. Immediately after the engine is up, exit global maintenance Actual results: In certain cases, shortly after exiting global maintenance, the engine VM is shut down. Expected results: engine VM stays up, or at least being shut down somewhat longer after noticing a high cpu load that's not clearly being caused by non-engine-VM tasks Additional info: I spent quite some time and failed to reproduce locally. Already wrote a patch, and verified it in a somewhat artificial environment, by also patching vdsm to always report wrong cpu stats as described. With this patch, it should take around 5 minutes until the engine VM will be shut down.
QE: As I wrote here and on gerrit, I failed to reproduce this issue. For verifying the patch, I also patched vdsm locally to force reproduction. If you want to try this yourself, you can find my patch at [1]. Otherwise, I suggest to do some sanity testing based on the steps from comment 0, and also: # grep -i cpu /var/log/ovirt-hosted-engine-ha/broker.log on both broken and fixed versions, to have a chance to see some differences. [1] https://gerrit.ovirt.org/c/vdsm/+/116915
Unable top reproduce, tested on: ovirt-engine-4.4.9.2-0.6.el8ev.noarch ovirt-hosted-engine-setup-2.5.4-2.el8ev.noarch ovirt-hosted-engine-ha-2.4.9-1.el8ev.noarch Red Hat Enterprise Linux release 8.5 (Ootpa) Linux 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux Moving to verified. In case that this bug being still reproduced, please attach logs and detailed steps for reproduction.
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.