Bug 1294772

Summary: Host upgrade manager checks updates for all hosts at the same time
Product: [oVirt] ovirt-engine Reporter: Moti Asayag <masayag>
Component: BLL.InfraAssignee: Moti Asayag <masayag>
Status: CLOSED CURRENTRELEASE QA Contact: Jiri Belka <jbelka>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.0CC: bugs, masayag, oourfali, stirabos
Target Milestone: ovirt-3.6.2Flags: masayag: ovirt-3.6.z?
rule-engine: planning_ack?
masayag: devel_ack+
pstehlik: testing_ack+
Target Release: 3.6.2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-18 11:17:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Moti Asayag 2015-12-30 09:03:07 UTC
Description of problem:
The scheduled job for checking availability of updates for the hosts is being scheduled at the same time for hosts in the system.
For RHEL/next-gen-ovirt-node hosts, it requires a dedicated thread for the duration of the ssh session. For a minutes or two there is a spike in thread consumption on the engine-server. If the repositories aren't refreshed, the check will take longer.

The purpose of the bug is to start each check in a random time, during the first hours of the engine's startup, so the overload will be divided. Future checks will be triggered based on that initial offset.

How reproducible:
always

Steps to Reproduce:
1. Have ovirt-engine with multiple hosts in up/maintenance status.
2. Start ovirt-engine, and wait for the host-upgrade-manager to check for updates.

Actual results:
The log (event-log/engine.log) indicates that all of the hosts are being examined at the same time.

Expected results:
Checking for host updates should be done more 'evenly'.

Comment 1 Yaniv Kaul 2015-12-30 09:32:03 UTC
What will be the user experience? Clearly, we want to present to the user in a single event all hosts that need upgrade, and not one by one every several hours. Why not trigger them with a small sleep in between (say, 1 minute or so) ?

Comment 2 Moti Asayag 2015-12-30 10:12:40 UTC
(In reply to Yaniv Kaul from comment #1)
> What will be the user experience? Clearly, we want to present to the user in
> a single event all hosts that need upgrade, and not one by one every several
> hours. Why not trigger them with a small sleep in between (say, 1 minute or
> so) ?

As in any host-deploy flows (add/upgrade/reinstall/enrol certificate), the user experience is action-per-host. There is no mass operation for hosts. Each host has its own life-cycle in the system. Each host has its own scheduled jobs, and examined whether to be executed or skipped based on the host status or host lock. Therefore a single event will be created per each host which has its updates available. in addition, in the UI the 'updates available' icons will appear next to that host's status, as soon as the updates check is completed.

So if we have a DC with 100 hosts, the division should be 1-2 hosts per minutes to be checked for updates. After 1 hour, all of the hosts will be checked.
The next check for each host will be X hours after the first check was completed, therefore the division of checks per minutes is kept.

If we'll change that by introducing a single job to check for upgrades sequentially host-by-host, we'll face with the following issues:
1. The event will create only after all of the hosts were examined.
1.1. In the suggested fix, all hosts will report their upgrades status within ~1 hour.
2. The event will list the hosts' names, but will not be visible on the specific host context due to the event structure (it may refer to a single host at most).
3. We'll have to mix flows of host-specific with cross-system flow (this adds complexity that currently doesn't exist)
4. The host logic will be scattered around the code instead of being host-centric which is less maintainable.

Comment 3 Moti Asayag 2016-01-10 07:29:45 UTC
*** Bug 1294773 has been marked as a duplicate of this bug. ***

Comment 4 Jiri Belka 2016-02-08 14:33:18 UTC
ok, rhevm-3.6.3-0.1.el6.noarch

2016-02-08 15:29:42,774 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler_Worker-10) [74801aa8] Connected to host 10.34.63.223 with SSH key fingerprint: SHA256:s5KOBxuTA4QxlqJLvN8gjC00bDt9/sD22+Gt16VsBDs
2016-02-08 15:32:23,590 INFO  [org.ovirt.engine.core.bll.hostdeploy.VdsDeployBase] (DefaultQuartzScheduler_Worker-30) [44acfb7c] Connected to host 10.34.62.205 with SSH key fingerprint: SHA256:KLrpafe4NKNwdt1Ri5m6zGa3NLBSHJV3Zd+ux50S+L0