Red Hat Bugzilla – Bug 584783
[RFE] job congestion measurement (number of queued jobs within a time period)
Last modified: 2018-02-05 19:41:31 EST
Description of problem:
When job queue becomes too long it's useful to see how many jobs are waiting to be processed. From this number and from its changes over time I can extrapolate whether congestion is building up or diminishing and plan my work accordingly.
Please add number of returned results on Job search page in a similar way RHTS does ("1 - 25 of 81" in the time of writing this).
Version-Release number of selected component (if applicable):
Version - 0.5.25
extra points for showing the queue size over a time period. queue size one hour ago, 30 min ago, now. Bit like uptime.
more points if you had a small graph which shows the queue size over the last hour/week. Bit like http://www.google.com/finance?q=rht. Shows the latest but then you can scroll in and out to get the larger picture. I can but ask.
I've already pushed a branch which will show the number of lines returned in the search pages (Jobs/System/Recipes etc etc). The ticket for it is here
I think perhaps what we are after here is a whole 'analysis' kind of page?
(In reply to comment #2)
> I've already pushed a branch which will show the number of lines returned in
> the search pages (Jobs/System/Recipes etc etc). The ticket for it is here
> I think perhaps what we are after here is a whole 'analysis' kind of page?
No, not an analysis page, though that would be nice. Just a little bit of extra feedback on which way the queue is going.
Ales can correct me if I'm wrong but I think he is referring to the jobs queued page in legacy . It tells you how many jobs are currently in queue.
It'd be helpful if the scheduler could take periodic measurements of the queue load and save it to a metrics table. Then it would be a simple case of rendering some stats from that.
Something like the output from uptime:
20:58:21 up 10:31, 18 users, load average: 0.69, 0.40, 0.29
That could be in this case:
123 Jobs Queued, load average: 120, 80, 5
Shows How many jobs are currently queued, and the queue load averages for the past 1minute, 20minutes, and 1 hour.
He probably is referring to that page.
That page though is equivalent to our Job Search page with the 'Queued' filter. Having an item count at the bottom does serve the same purpose as the "1 - 25 of 81".
Re, the metrics. I don't see why it couldn't be done at some point.
(In reply to comment #3)
> Just a little bit of extra
> feedback on which way the queue is going.
That's exactly the use-case I have in mind.
> Ales can correct me if I'm wrong but I think he is referring to the jobs queued
> page in legacy .
> It'd be helpful if the scheduler could take periodic measurements of the queue
> load and save it to a metrics table. Then it would be a simple case of
> rendering some stats from that.
> Something like the output from uptime:
> 20:58:21 up 10:31, 18 users, load average: 0.69, 0.40, 0.29
> That could be in this case:
> 123 Jobs Queued, load average: 120, 80, 5
> Shows How many jobs are currently queued, and the queue load averages for the
> past 1minute, 20minutes, and 1 hour.
That would be a great enhancement. As the total number of queued jobs is already implemented in bz567788, lets use this bug for the uptime-style job queue size.
That would be a very useful feature. Lately here & there happens
that jobs are queued for a long time. Having a short "uptime"
message always displayed at the top / bottom of the page would let
users know what response time to expect and possible keep from
submitting low priority / experimental jobs.
Hovering on the message could show architecture-specific results
(waiting on a specific arch is quite common scenario).
Raising priority, please consider implementing this soon. Lack of free machines and long waiting queues are still a problem (as indicates recent discussion on mailing list).
Could we get at least some basic functionality implemented? Here &
there we encounter situation when no machines are available for
longer period of time and it's quite hard to find out the people
responsible for resource exhaustion so that we could "manually"
ping them. For now, a simple table with machines per user, sorted
and updated hourly, perhaps with a week history, would help a lot.
Seeing this, there are "few" old jobs sitting in Beaker for a long time (usually waiting for particular machine).
My hypothesis is jobs older than 1 week are unlikely to have significant effect on current congestion, as these will be more likely executed in sequence while less specific jobs will run in parallel on anything.
Second hypothesis is there is a negative correlation between "number of possible systems" and "time from queued to scheduled." This may be better indicator to help predict when the job will be scheduled.
My hypotheses were not tested, just common sense said so to me.
Also there will be always exceptions to this "rules" like jobs running on "Group" owned HW where this may be under-utilized, while common pool is crowded. For improved accuracy we would need to consider things system by system, which would generate way too large load.
If this RFE get ever addressed, could we have "congestion graph" somehow displaying those in different color? I have some "heat map" on my mind: hottest - 0-1 hour, 1hour-1 day, 1day-1 week and older - coldest.
I've create a proof of concept for how I imagine this should work:
(please ignore the gaps in graphs - I run it wrom my personal workstation)
Beaker 0.11 will include improved granularity for the stats sent to a configured Graphics metrics server. At the moment we have already defined the necessary metrics for stacked graphs showing:
- recipes split by status (aggregate)
- systems split by current utilisation (aggregate, per arch, per lab controller)
To handle the job congestion use case, we'd also like to publish a "per arch" set of graphs for the recipe queue. However, I've realised I'm not clear on what a "per arch" recipe queue actually means. Is it:
- any recipe that *can* run on that arch? (would lead to a lot of double counting and make the per-arch queues look worse than they are)
- any recipe that can *only* run on that arch? (would leave out recipes that can run on multiple arches, including not specifying an arch constraint at all as well as recipes that can run only on the 32- and 64-bit variants of the same basic arch)
I'm thinking the second interpretation makes more sense, but then it's a matter of extracting that information from the host_requires data.
(In reply to comment #16)
> - any recipe that can *only* run on that arch? (would leave out recipes that
> can run on multiple arches, including not specifying an arch constraint at
> all as well as recipes that can run only on the 32- and 64-bit variants of
> the same basic arch)
Every recipe can by definition only run on exactly one arch because every recipe has exactly one distro tree, which has a specific arch.
Overall queue metrics reporting, as well as broken down by arch:
2012-12-20 15:50:04,995 beakerd ERROR Exception in metrics loop
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 627, in metrics_loop
File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 574, in recipe_count_metrics
Recipe.filter(Recipe.virt_status == RecipeVirtStatus.possible))
AttributeError: type object 'Recipe' has no attribute 'filter'
Broken query fixed and unit tests added in:
Beaker 0.11.0 has been released.