Red Hat Bugzilla – Bug 1294772
Host upgrade manager checks updates for all hosts at the same time
Last modified: 2016-02-18 06:17:32 EST
Description of problem:
The scheduled job for checking availability of updates for the hosts is being scheduled at the same time for hosts in the system.
For RHEL/next-gen-ovirt-node hosts, it requires a dedicated thread for the duration of the ssh session. For a minutes or two there is a spike in thread consumption on the engine-server. If the repositories aren't refreshed, the check will take longer.
The purpose of the bug is to start each check in a random time, during the first hours of the engine's startup, so the overload will be divided. Future checks will be triggered based on that initial offset.
Steps to Reproduce:
1. Have ovirt-engine with multiple hosts in up/maintenance status.
2. Start ovirt-engine, and wait for the host-upgrade-manager to check for updates.
The log (event-log/engine.log) indicates that all of the hosts are being examined at the same time.
Checking for host updates should be done more 'evenly'.
What will be the user experience? Clearly, we want to present to the user in a single event all hosts that need upgrade, and not one by one every several hours. Why not trigger them with a small sleep in between (say, 1 minute or so) ?
(In reply to Yaniv Kaul from comment #1)
> What will be the user experience? Clearly, we want to present to the user in
> a single event all hosts that need upgrade, and not one by one every several
> hours. Why not trigger them with a small sleep in between (say, 1 minute or
> so) ?
As in any host-deploy flows (add/upgrade/reinstall/enrol certificate), the user experience is action-per-host. There is no mass operation for hosts. Each host has its own life-cycle in the system. Each host has its own scheduled jobs, and examined whether to be executed or skipped based on the host status or host lock. Therefore a single event will be created per each host which has its updates available. in addition, in the UI the 'updates available' icons will appear next to that host's status, as soon as the updates check is completed.
So if we have a DC with 100 hosts, the division should be 1-2 hosts per minutes to be checked for updates. After 1 hour, all of the hosts will be checked.
The next check for each host will be X hours after the first check was completed, therefore the division of checks per minutes is kept.
If we'll change that by introducing a single job to check for upgrades sequentially host-by-host, we'll face with the following issues:
1. The event will create only after all of the hosts were examined.
1.1. In the suggested fix, all hosts will report their upgrades status within ~1 hour.
2. The event will list the hosts' names, but will not be visible on the specific host context due to the event structure (it may refer to a single host at most).
3. We'll have to mix flows of host-specific with cross-system flow (this adds complexity that currently doesn't exist)
4. The host logic will be scattered around the code instead of being host-centric which is less maintainable.
*** Bug 1294773 has been marked as a duplicate of this bug. ***
2016-02-08 15:29:42,774 INFO [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler_Worker-10) [74801aa8] Connected to host 10.34.63.223 with SSH key fingerprint: SHA256:s5KOBxuTA4QxlqJLvN8gjC00bDt9/sD22+Gt16VsBDs
2016-02-08 15:32:23,590 INFO [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler_Worker-30) [44acfb7c] Connected to host 10.34.62.205 with SSH key fingerprint: SHA256:KLrpafe4NKNwdt1Ri5m6zGa3NLBSHJV3Zd+ux50S+L0