Bug 584783

Summary: [RFE] job congestion measurement (number of queued jobs within a time period)
Product: [Retired] Beaker Reporter: Ales Zelinka <azelinka>
Component: web UIAssignee: Nick Coghlan <ncoghlan>
Status: CLOSED CURRENTRELEASE QA Contact: Dan Callaghan <dcallagh>
Severity: high Docs Contact:
Priority: high    
Version: 0.5CC: bpeck, dcallagh, dkovalsk, kbaker, mcermak, mcsontos, mishin, ohudlick, psplicha, rmancy
Target Milestone: 0.11Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: Measurements
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 891841 (view as bug list) Environment:
Last Closed: 2013-01-17 04:33:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 883606    
Bug Blocks: 593663    

Description Ales Zelinka 2010-04-22 12:11:52 UTC
Description of problem:
When job queue becomes too long it's useful to see how many jobs are waiting to be processed. From this number and from its changes over time I can extrapolate whether congestion is building up or diminishing and plan my work accordingly.

Please add number of returned results on Job search page in a similar way RHTS does ("1 - 25 of 81" in the time of writing this).


Version-Release number of selected component (if applicable):
Version - 0.5.25

Comment 1 Kevin Baker 2010-04-22 15:42:28 UTC
extra points for showing the queue size over a time period. queue size one hour ago, 30 min ago, now. Bit like uptime.

more points if you had a small graph which shows the queue size over the last hour/week. Bit like http://www.google.com/finance?q=rht. Shows the latest but then you can scroll in and out to get the larger picture. I can but ask.

Comment 2 Raymond Mancy 2010-04-23 00:11:02 UTC
I've already pushed a branch which will show the number of lines returned in the search pages (Jobs/System/Recipes etc etc). The ticket for it is here 

https://bugzilla.redhat.com/show_bug.cgi?id=567788


Kevin,
I think perhaps what we are after here is a whole 'analysis' kind of page?

Comment 3 Kevin Baker 2010-04-23 01:00:22 UTC
(In reply to comment #2)
> I've already pushed a branch which will show the number of lines returned in
> the search pages (Jobs/System/Recipes etc etc). The ticket for it is here 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=567788
> 
> 
> Kevin,
> I think perhaps what we are after here is a whole 'analysis' kind of page?    

No, not an analysis page, though that would be nice. Just a little bit of extra feedback on which way the queue is going. 

Ales can correct me if I'm wrong but I think he is referring to the jobs queued page in legacy [1]. It tells you how many jobs are currently in queue. 

It'd be helpful if the scheduler could take periodic measurements of the queue load and save it to a metrics table. Then it would be a simple case of rendering some stats from that.

Something like the output from uptime:
 20:58:21 up 10:31, 18 users,  load average: 0.69, 0.40, 0.29

That could be in this case:
123 Jobs Queued, load average: 120, 80, 5

Shows How many jobs are currently queued, and the queue load averages for the past 1minute, 20minutes, and 1 hour.





[1] http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?status=Queued

Comment 4 Raymond Mancy 2010-04-23 01:12:09 UTC
He probably is referring to that page. 
That page though is equivalent to our Job Search page with the 'Queued' filter. Having an item count at the bottom does serve the same purpose as the "1 - 25 of 81". 


Re, the metrics. I don't see why it couldn't be done at some point.

Comment 5 Ales Zelinka 2010-04-23 08:43:22 UTC
(In reply to comment #3)
> Just a little bit of extra
> feedback on which way the queue is going. 

That's exactly the use-case I have in mind.
> 
> Ales can correct me if I'm wrong but I think he is referring to the jobs queued
> page in legacy [1]. 
> 
Yes, I'm.

> It'd be helpful if the scheduler could take periodic measurements of the queue
> load and save it to a metrics table. Then it would be a simple case of
> rendering some stats from that.
> 
> Something like the output from uptime:
>  20:58:21 up 10:31, 18 users,  load average: 0.69, 0.40, 0.29
> 
> That could be in this case:
> 123 Jobs Queued, load average: 120, 80, 5
> 
> Shows How many jobs are currently queued, and the queue load averages for the
> past 1minute, 20minutes, and 1 hour.

That would be a great enhancement. As the total number of queued jobs is already implemented in bz567788, lets use this bug for the uptime-style job queue size.

Comment 6 Petr Šplíchal 2011-03-17 14:36:06 UTC
That would be a very useful feature. Lately here & there happens
that jobs are queued for a long time. Having a short "uptime"
message always displayed at the top / bottom of the page would let
users know what response time to expect and possible keep from
submitting low priority / experimental jobs.

Hovering on the message could show architecture-specific results
(waiting on a specific arch is quite common scenario).

Comment 7 Ales Zelinka 2011-04-01 12:58:03 UTC
Raising priority, please consider implementing this soon. Lack of free machines and long waiting queues are still a problem (as indicates recent discussion on mailing list).

Comment 8 Petr Šplíchal 2011-06-21 11:50:43 UTC
Could we get at least some basic functionality implemented? Here &
there we encounter situation when no machines are available for
longer period of time and it's quite hard to find out the people
responsible for resource exhaustion so that we could "manually"
ping them. For now, a simple table with machines per user, sorted
and updated hourly, perhaps with a week history, would help a lot.

Comment 13 Marian Csontos 2012-01-18 10:50:51 UTC
Seeing this, there are "few" old jobs sitting in Beaker for a long time (usually waiting for particular machine).

My hypothesis is jobs older than 1 week are unlikely to have significant effect on current congestion, as these will be more likely executed in sequence while less specific jobs will run in parallel on anything.

Second hypothesis is there is a negative correlation between "number of possible systems" and "time from queued to scheduled." This may be better indicator to help predict when the job will be scheduled.

My hypotheses were not tested, just common sense said so to me.

Also there will be always exceptions to this "rules" like jobs running on "Group" owned HW where this may be under-utilized, while common pool is crowded. For improved accuracy we would need to consider things system by system, which would generate way too large load.

If this RFE get ever addressed, could we have "congestion graph" somehow displaying those in different color? I have some "heat map" on my mind: hottest - 0-1 hour, 1hour-1 day, 1day-1 week and older - coldest.

Comment 15 Ales Zelinka 2012-11-09 13:44:21 UTC
I've create a proof of concept for how I imagine this should work: 
https://wiki.test.redhat.com/azelinka/BeakerStatus

(please ignore the gaps in graphs - I run it wrom my personal workstation)

Comment 16 Nick Coghlan 2012-12-13 08:41:39 UTC
Beaker 0.11 will include improved granularity for the stats sent to a configured Graphics metrics server. At the moment we have already defined the necessary metrics for stacked graphs showing:
- recipes split by status (aggregate)
- systems split by current utilisation (aggregate, per arch, per lab controller)

To handle the job congestion use case, we'd also like to publish a "per arch" set of graphs for the recipe queue. However, I've realised I'm not clear on what a "per arch" recipe queue actually means. Is it:

- any recipe that *can* run on that arch? (would lead to a lot of double counting and make the per-arch queues look worse than they are)
- any recipe that can *only* run on that arch? (would leave out recipes that can run on multiple arches, including not specifying an arch constraint at all as well as recipes that can run only on the 32- and 64-bit variants of the same basic arch)

I'm thinking the second interpretation makes more sense, but then it's a matter of extracting that information from the host_requires data.

Comment 17 Dan Callaghan 2012-12-13 23:11:34 UTC
(In reply to comment #16)
> - any recipe that can *only* run on that arch? (would leave out recipes that
> can run on multiple arches, including not specifying an arch constraint at
> all as well as recipes that can run only on the 32- and 64-bit variants of
> the same basic arch)

Every recipe can by definition only run on exactly one arch because every recipe has exactly one distro tree, which has a specific arch.

Comment 18 Nick Coghlan 2012-12-17 06:33:03 UTC
Overall queue metrics reporting, as well as broken down by arch:
http://gerrit.beaker-project.org/#/c/1574/

Comment 21 Dan Callaghan 2012-12-20 05:51:10 UTC
2012-12-20 15:50:04,995 beakerd ERROR Exception in metrics loop
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 627, in metrics_loop
    recipe_count_metrics()
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 574, in recipe_count_metrics
    Recipe.filter(Recipe.virt_status == RecipeVirtStatus.possible))
AttributeError: type object 'Recipe' has no attribute 'filter'

Comment 23 Nick Coghlan 2013-01-04 07:28:15 UTC
Broken query fixed and unit tests added in:
http://gerrit.beaker-project.org/#/c/1602

Comment 26 Dan Callaghan 2013-01-17 04:33:38 UTC
Beaker 0.11.0 has been released.