Bug 1294772 - Host upgrade manager checks updates for all hosts at the same time
Host upgrade manager checks updates for all hosts at the same time
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra (Show other bugs)
3.6.0
Unspecified Unspecified
medium Severity medium (vote)
: ovirt-3.6.2
: 3.6.2.5
Assigned To: Moti Asayag
Jiri Belka
:
: 1294773 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-30 04:03 EST by Moti Asayag
Modified: 2016-02-18 06:17 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 06:17:32 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
masayag: ovirt‑3.6.z?
rule-engine: planning_ack?
masayag: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 51163 master MERGED core: Randomly schedule check updates for host job 2015-12-30 09:23 EST
oVirt gerrit 51177 ovirt-engine-3.6 MERGED core: Randomly schedule check updates for host job 2015-12-30 11:12 EST
oVirt gerrit 51179 ovirt-engine-3.6.2 MERGED core: Randomly schedule check updates for host job 2015-12-30 11:15 EST

  None (edit)
Description Moti Asayag 2015-12-30 04:03:07 EST
Description of problem:
The scheduled job for checking availability of updates for the hosts is being scheduled at the same time for hosts in the system.
For RHEL/next-gen-ovirt-node hosts, it requires a dedicated thread for the duration of the ssh session. For a minutes or two there is a spike in thread consumption on the engine-server. If the repositories aren't refreshed, the check will take longer.

The purpose of the bug is to start each check in a random time, during the first hours of the engine's startup, so the overload will be divided. Future checks will be triggered based on that initial offset.

How reproducible:
always

Steps to Reproduce:
1. Have ovirt-engine with multiple hosts in up/maintenance status.
2. Start ovirt-engine, and wait for the host-upgrade-manager to check for updates.

Actual results:
The log (event-log/engine.log) indicates that all of the hosts are being examined at the same time.

Expected results:
Checking for host updates should be done more 'evenly'.
Comment 1 Yaniv Kaul 2015-12-30 04:32:03 EST
What will be the user experience? Clearly, we want to present to the user in a single event all hosts that need upgrade, and not one by one every several hours. Why not trigger them with a small sleep in between (say, 1 minute or so) ?
Comment 2 Moti Asayag 2015-12-30 05:12:40 EST
(In reply to Yaniv Kaul from comment #1)
> What will be the user experience? Clearly, we want to present to the user in
> a single event all hosts that need upgrade, and not one by one every several
> hours. Why not trigger them with a small sleep in between (say, 1 minute or
> so) ?

As in any host-deploy flows (add/upgrade/reinstall/enrol certificate), the user experience is action-per-host. There is no mass operation for hosts. Each host has its own life-cycle in the system. Each host has its own scheduled jobs, and examined whether to be executed or skipped based on the host status or host lock. Therefore a single event will be created per each host which has its updates available. in addition, in the UI the 'updates available' icons will appear next to that host's status, as soon as the updates check is completed.

So if we have a DC with 100 hosts, the division should be 1-2 hosts per minutes to be checked for updates. After 1 hour, all of the hosts will be checked.
The next check for each host will be X hours after the first check was completed, therefore the division of checks per minutes is kept.

If we'll change that by introducing a single job to check for upgrades sequentially host-by-host, we'll face with the following issues:
1. The event will create only after all of the hosts were examined.
1.1. In the suggested fix, all hosts will report their upgrades status within ~1 hour.
2. The event will list the hosts' names, but will not be visible on the specific host context due to the event structure (it may refer to a single host at most).
3. We'll have to mix flows of host-specific with cross-system flow (this adds complexity that currently doesn't exist)
4. The host logic will be scattered around the code instead of being host-centric which is less maintainable.
Comment 3 Moti Asayag 2016-01-10 02:29:45 EST
*** Bug 1294773 has been marked as a duplicate of this bug. ***
Comment 4 Jiri Belka 2016-02-08 09:33:18 EST
ok, rhevm-3.6.3-0.1.el6.noarch

2016-02-08 15:29:42,774 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler_Worker-10) [74801aa8] Connected to host 10.34.63.223 with SSH key fingerprint: SHA256:s5KOBxuTA4QxlqJLvN8gjC00bDt9/sD22+Gt16VsBDs
2016-02-08 15:32:23,590 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler_Worker-30) [44acfb7c] Connected to host 10.34.62.205 with SSH key fingerprint: SHA256:KLrpafe4NKNwdt1Ri5m6zGa3NLBSHJV3Zd+ux50S+L0

Note You need to log in before you can comment on or make changes to this bug.