Bug 2024161 - Penalizing score by 1000 due to cpu load is not canceled after load decreasing to 0
Summary: Penalizing score by 1000 due to cpu load is not canceled after load decreasin...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.4.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.0
: 2.5.0
Assignee: Yedidyah Bar David
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-17 13:02 UTC by Polina
Modified: 2022-04-20 06:33 UTC (History)
4 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.5.0
Doc Type: Bug Fix
Doc Text:
Cause: The fix to bug 1993957 (4.4.9) was wrong, and caused a low score of a hosted-engine host that was caused by high cpu load, to remain low, even if the cpu load later declined, after the engine VM was stopped and started on some other host. Consequence: The score remained low, so the host could not be used for running the engine VM. Fix: The code was fixed to handle this case correctly. Result: Once the cpu load declines, the score raises. doc team: Not sure we mention this in main docs, but both the calculated cpu load and the score change slowly over time, similarly to how the "load average" in general behaves. So if you e.g. raise in an instant the load from 0 to 100% (by starting a number of endless loops matching the number of your cpu cores), it will take around 5-10 minutes until this is reflected in the score. If this isn't mentioned, you might want to open a doc bug about this. You can see the score in the agent log and in 'hosted-engine --vm-status'.
Clone Of:
Environment:
Last Closed: 2022-04-20 06:33:59 UTC
oVirt Team: Integration
Embargoed:
pm-rhel: ovirt-4.5?
michal.skrivanek: blocker-


Attachments (Terms of Use)
logs and vm.conf (6.29 MB, application/gzip)
2021-11-17 13:02 UTC, Polina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44045 0 None None None 2021-11-17 13:06:14 UTC
oVirt gerrit 117757 0 master MERGED Ignore engine vm cpu calculations if vm is missing 2021-11-24 07:52:50 UTC

Description Polina 2021-11-17 13:02:33 UTC
Created attachment 1842337 [details]
logs and vm.conf

Description of problem:
Penalizing score by 1000 due to cpu load is not canceled after load decreasing to 0

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.4.9-1.el8ev.noarch
rhvm-4.4.9.5-0.1.el8ev.noarch

How reproducible:100%

Steps to Reproduce:
1. On hosted-engine setup choose the host with HE VM running, load CPU on the host till maximum. (can use dd if=/dev/zero of=/dev/null command for this purpose). 
Wait until the score is dropped to 2400. As result, HE VM is migrated to the host with a score 3400.

2. Stop CPU load on the host and wait until the score will be returned to 3400. 

Actual results: the host score remains 2400 forever.

Expected results: the host score must reset to 3400.


Additional info:

Comment 1 RHEL Program Management 2021-11-17 15:05:31 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Arik 2021-11-22 13:26:49 UTC
A related change that was done on the engine side is bz 1991804

Comment 3 Arik 2021-11-22 13:39:52 UTC
(In reply to Arik from comment #2)
> A related change that was done on the engine side is bz 1991804

And in some more details - the max score is now set to 3400 on the engine side, previously it was 2400
the engine considers every score that is higher than the max the same way [1] - so with the aforementioned patch, there will be an attempt to migrate out the VM but I don't see how it's related to the part that the score is not reset to the max score

[1] https://gerrit.ovirt.org/#/c/ovirt-engine/+/116471/4/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/HostedEngineHAClusterWeightPolicyUnit.java

Comment 4 Asaf Rachmani 2021-11-22 16:16:51 UTC
Seems like it's related to bz 1993957's fix.

Comment 5 Yedidyah Bar David 2021-11-24 09:13:58 UTC
QE: The flow is more-or-less exactly what's in comment 0:

1. Deploy HE on two hosts

2. Run 'dd if=/dev/zero of=/dev/null &' a number of times matching the number of cpu cores on the host running the engine VM

3. Wait until the agent stops the engine VM (and the agent on the other host should start it there, but that's not part of current bug)

4. Optionally wait until the score is penalized by 1000 points due to the cpu, e.g.:
# grep 'cpu load' /var/log/ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2021-11-22 09:59:58,539::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 11 due to cpu load
MainThread::INFO::2021-11-22 10:00:08,836::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 31 due to cpu load
MainThread::INFO::2021-11-22 10:00:19,113::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 50 due to cpu load
MainThread::INFO::2021-11-22 10:00:28,381::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 70 due to cpu load
MainThread::INFO::2021-11-22 10:00:38,648::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 90 due to cpu load
MainThread::INFO::2021-11-22 10:00:48,909::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 109 due to cpu load
...
MainThread::INFO::2021-11-22 10:08:33,798::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 927 due to cpu load
MainThread::INFO::2021-11-22 10:08:44,045::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 939 due to cpu load
MainThread::INFO::2021-11-22 10:08:53,304::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 952 due to cpu load
MainThread::INFO::2021-11-22 10:09:03,545::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 964 due to cpu load
MainThread::INFO::2021-11-22 10:09:13,760::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 976 due to cpu load
MainThread::INFO::2021-11-22 10:09:24,012::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 989 due to cpu load
MainThread::INFO::2021-11-22 10:09:33,200::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1000 due to cpu load


5. Kill the dd's

With a broken -ha, the score remains low and does not go up.
With a fixed -ha, the score will slowly go up, e.g.:
# grep -E 'score:|cpu load' /var/log/ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2021-11-23 15:14:43,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1000 due to cpu load
MainThread::INFO::2021-11-23 15:14:43,402::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400)
MainThread::INFO::2021-11-23 15:14:52,583::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400)
MainThread::INFO::2021-11-23 15:15:02,740::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400)
MainThread::INFO::2021-11-23 15:15:12,930::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400)
MainThread::INFO::2021-11-23 15:15:23,133::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400)
MainThread::INFO::2021-11-23 15:15:33,321::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2400)
MainThread::INFO::2021-11-23 15:15:42,465::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 995 due to cpu load
MainThread::INFO::2021-11-23 15:15:42,466::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2405)
MainThread::INFO::2021-11-23 15:15:52,660::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 974 due to cpu load
MainThread::INFO::2021-11-23 15:15:52,660::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2426)
MainThread::INFO::2021-11-23 15:16:02,849::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 952 due to cpu load
MainThread::INFO::2021-11-23 15:16:02,849::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 2448)
...
MainThread::INFO::2021-11-23 15:23:14,202::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 23 due to cpu load
MainThread::INFO::2021-11-23 15:23:14,203::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3377)
MainThread::INFO::2021-11-23 15:23:14,203::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node4491117 (id: 2, score: 3044)
MainThread::INFO::2021-11-23 15:23:24,399::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1 due to cpu load
MainThread::INFO::2021-11-23 15:23:24,399::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3399)
MainThread::INFO::2021-11-23 15:23:24,400::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node4491117 (id: 2, score: 3026)
MainThread::INFO::2021-11-23 15:23:33,594::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)

The logic (both before and after the fix) is slightly complex, so you might want to play with this some more and see that things remain reasonable, to see that the fix to current bug, and to bug 1993957, do not break other stuff. Some log messages you might run into, the broker log (with grep 'cpu_load_no_engine'):

- "VM not on this host" - normal state, score should go up/down based on cpu load

- "System load", "total", "engine", "non-engine" - normal

- "Ignoring cpuUser/cpuSys, init values" - should/might happen right after starting the engine VM

- "engine VM cpu usage is not up-to-date" - might happen more than 5 minutes after starting the engine VM, if cpu usage is still not reported correctly. This does affect score and might lead to the engine VM being stopped.

Comment 6 Yedidyah Bar David 2021-11-24 09:17:45 UTC
Workaround: If the score, with 4.4.9, remains low, despite cpu load becoming low, you can 'systemctl restart ovirt-ha-broker', and wait a few minutes.

Comment 7 Polina 2021-11-28 14:57:32 UTC
(In reply to Yedidyah Bar David from comment #6)
> Workaround: If the score, with 4.4.9, remains low, despite cpu load becoming
> low, you can 'systemctl restart ovirt-ha-broker', and wait a few minutes.

yes, restarting of ovirt-ha-broker helps in such a case.

Comment 8 Polina 2022-03-20 11:11:32 UTC
verified on vdsm-4.50.0.10-1.el8ev.x86_64, ovirt-engine-4.5.0.1-601.f26e9ea8cac5.3.el8ev.noarch  by running automation test rhevmtests.compute.sla.hosted_engine.hosted_engine_ha.hosted_engine_ha_test.TestHostCpuLoadProblem

Comment 9 Sandro Bonazzola 2022-04-20 06:33:59 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.