Created attachment 1842337 [details] logs and vm.conf Description of problem: Penalizing score by 1000 due to cpu load is not canceled after load decreasing to 0 Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.4.9-1.el8ev.noarch rhvm-4.4.9.5-0.1.el8ev.noarch How reproducible:100% Steps to Reproduce: 1. On hosted-engine setup choose the host with HE VM running, load CPU on the host till maximum. (can use dd if=/dev/zero of=/dev/null command for this purpose). Wait until the score is dropped to 2400. As result, HE VM is migrated to the host with a score 3400. 2. Stop CPU load on the host and wait until the score will be returned to 3400. Actual results: the host score remains 2400 forever. Expected results: the host score must reset to 3400. Additional info:
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
A related change that was done on the engine side is bz 1991804
(In reply to Arik from comment #2) > A related change that was done on the engine side is bz 1991804 And in some more details - the max score is now set to 3400 on the engine side, previously it was 2400 the engine considers every score that is higher than the max the same way [1] - so with the aforementioned patch, there will be an attempt to migrate out the VM but I don't see how it's related to the part that the score is not reset to the max score [1] https://gerrit.ovirt.org/#/c/ovirt-engine/+/116471/4/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/HostedEngineHAClusterWeightPolicyUnit.java
Seems like it's related to bz 1993957's fix.
QE: The flow is more-or-less exactly what's in comment 0: 1. Deploy HE on two hosts 2. Run 'dd if=/dev/zero of=/dev/null &' a number of times matching the number of cpu cores on the host running the engine VM 3. Wait until the agent stops the engine VM (and the agent on the other host should start it there, but that's not part of current bug) 4. Optionally wait until the score is penalized by 1000 points due to the cpu, e.g.: # grep 'cpu load' /var/log/ovirt-hosted-engine-ha/agent.log MainThread::INFO::2021-11-22 09:59:58,539::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 11 due to cpu load MainThread::INFO::2021-11-22 10:00:08,836::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 31 due to cpu load MainThread::INFO::2021-11-22 10:00:19,113::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 50 due to cpu load MainThread::INFO::2021-11-22 10:00:28,381::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 70 due to cpu load MainThread::INFO::2021-11-22 10:00:38,648::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 90 due to cpu load MainThread::INFO::2021-11-22 10:00:48,909::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 109 due to cpu load ... MainThread::INFO::2021-11-22 10:08:33,798::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 927 due to cpu load MainThread::INFO::2021-11-22 10:08:44,045::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 939 due to cpu load MainThread::INFO::2021-11-22 10:08:53,304::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 952 due to cpu load MainThread::INFO::2021-11-22 10:09:03,545::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 964 due to cpu load MainThread::INFO::2021-11-22 10:09:13,760::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 976 due to cpu load MainThread::INFO::2021-11-22 10:09:24,012::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 989 due to cpu load MainThread::INFO::2021-11-22 10:09:33,200::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1000 due to cpu load 5. Kill the dd's With a broken -ha, the score remains low and does not go up. With a fixed -ha, the score will slowly go up, e.g.: # grep -E 'score:|cpu load' /var/log/ovirt-hosted-engine-ha/agent.log MainThread::INFO::2021-11-23 15:14:43,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1000 due to cpu load MainThread::INFO::2021-11-23 15:14:43,402::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400) MainThread::INFO::2021-11-23 15:14:52,583::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400) MainThread::INFO::2021-11-23 15:15:02,740::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400) MainThread::INFO::2021-11-23 15:15:12,930::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400) MainThread::INFO::2021-11-23 15:15:23,133::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400) MainThread::INFO::2021-11-23 15:15:33,321::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400) MainThread::INFO::2021-11-23 15:15:42,465::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 995 due to cpu load MainThread::INFO::2021-11-23 15:15:42,466::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2405) MainThread::INFO::2021-11-23 15:15:52,660::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 974 due to cpu load MainThread::INFO::2021-11-23 15:15:52,660::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2426) MainThread::INFO::2021-11-23 15:16:02,849::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 952 due to cpu load MainThread::INFO::2021-11-23 15:16:02,849::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2448) ... MainThread::INFO::2021-11-23 15:23:14,202::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 23 due to cpu load MainThread::INFO::2021-11-23 15:23:14,203::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3377) MainThread::INFO::2021-11-23 15:23:14,203::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node4491117 (id: 2, score: 3044) MainThread::INFO::2021-11-23 15:23:24,399::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1 due to cpu load MainThread::INFO::2021-11-23 15:23:24,399::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3399) MainThread::INFO::2021-11-23 15:23:24,400::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node4491117 (id: 2, score: 3026) MainThread::INFO::2021-11-23 15:23:33,594::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) The logic (both before and after the fix) is slightly complex, so you might want to play with this some more and see that things remain reasonable, to see that the fix to current bug, and to bug 1993957, do not break other stuff. Some log messages you might run into, the broker log (with grep 'cpu_load_no_engine'): - "VM not on this host" - normal state, score should go up/down based on cpu load - "System load", "total", "engine", "non-engine" - normal - "Ignoring cpuUser/cpuSys, init values" - should/might happen right after starting the engine VM - "engine VM cpu usage is not up-to-date" - might happen more than 5 minutes after starting the engine VM, if cpu usage is still not reported correctly. This does affect score and might lead to the engine VM being stopped.
Workaround: If the score, with 4.4.9, remains low, despite cpu load becoming low, you can 'systemctl restart ovirt-ha-broker', and wait a few minutes.
(In reply to Yedidyah Bar David from comment #6) > Workaround: If the score, with 4.4.9, remains low, despite cpu load becoming > low, you can 'systemctl restart ovirt-ha-broker', and wait a few minutes. yes, restarting of ovirt-ha-broker helps in such a case.
verified on vdsm-4.50.0.10-1.el8ev.x86_64, ovirt-engine-4.5.0.1-601.f26e9ea8cac5.3.el8ev.noarch by running automation test rhevmtests.compute.sla.hosted_engine.hosted_engine_ha.hosted_engine_ha_test.TestHostCpuLoadProblem
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.