Bug 584783
Summary: | [RFE] job congestion measurement (number of queued jobs within a time period) | |||
---|---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Ales Zelinka <azelinka> | |
Component: | web UI | Assignee: | Nick Coghlan <ncoghlan> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Dan Callaghan <dcallagh> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 0.5 | CC: | bpeck, dcallagh, dkovalsk, kbaker, mcermak, mcsontos, mishin, ohudlick, psplicha, rmancy | |
Target Milestone: | 0.11 | Keywords: | FutureFeature | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | Measurements | |||
Fixed In Version: | Doc Type: | Enhancement | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 891841 (view as bug list) | Environment: | ||
Last Closed: | 2013-01-17 04:33:38 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 883606 | |||
Bug Blocks: | 593663 |
Description
Ales Zelinka
2010-04-22 12:11:52 UTC
extra points for showing the queue size over a time period. queue size one hour ago, 30 min ago, now. Bit like uptime. more points if you had a small graph which shows the queue size over the last hour/week. Bit like http://www.google.com/finance?q=rht. Shows the latest but then you can scroll in and out to get the larger picture. I can but ask. I've already pushed a branch which will show the number of lines returned in the search pages (Jobs/System/Recipes etc etc). The ticket for it is here https://bugzilla.redhat.com/show_bug.cgi?id=567788 Kevin, I think perhaps what we are after here is a whole 'analysis' kind of page? (In reply to comment #2) > I've already pushed a branch which will show the number of lines returned in > the search pages (Jobs/System/Recipes etc etc). The ticket for it is here > > https://bugzilla.redhat.com/show_bug.cgi?id=567788 > > > Kevin, > I think perhaps what we are after here is a whole 'analysis' kind of page? No, not an analysis page, though that would be nice. Just a little bit of extra feedback on which way the queue is going. Ales can correct me if I'm wrong but I think he is referring to the jobs queued page in legacy [1]. It tells you how many jobs are currently in queue. It'd be helpful if the scheduler could take periodic measurements of the queue load and save it to a metrics table. Then it would be a simple case of rendering some stats from that. Something like the output from uptime: 20:58:21 up 10:31, 18 users, load average: 0.69, 0.40, 0.29 That could be in this case: 123 Jobs Queued, load average: 120, 80, 5 Shows How many jobs are currently queued, and the queue load averages for the past 1minute, 20minutes, and 1 hour. [1] http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?status=Queued He probably is referring to that page. That page though is equivalent to our Job Search page with the 'Queued' filter. Having an item count at the bottom does serve the same purpose as the "1 - 25 of 81". Re, the metrics. I don't see why it couldn't be done at some point. (In reply to comment #3) > Just a little bit of extra > feedback on which way the queue is going. That's exactly the use-case I have in mind. > > Ales can correct me if I'm wrong but I think he is referring to the jobs queued > page in legacy [1]. > Yes, I'm. > It'd be helpful if the scheduler could take periodic measurements of the queue > load and save it to a metrics table. Then it would be a simple case of > rendering some stats from that. > > Something like the output from uptime: > 20:58:21 up 10:31, 18 users, load average: 0.69, 0.40, 0.29 > > That could be in this case: > 123 Jobs Queued, load average: 120, 80, 5 > > Shows How many jobs are currently queued, and the queue load averages for the > past 1minute, 20minutes, and 1 hour. That would be a great enhancement. As the total number of queued jobs is already implemented in bz567788, lets use this bug for the uptime-style job queue size. That would be a very useful feature. Lately here & there happens that jobs are queued for a long time. Having a short "uptime" message always displayed at the top / bottom of the page would let users know what response time to expect and possible keep from submitting low priority / experimental jobs. Hovering on the message could show architecture-specific results (waiting on a specific arch is quite common scenario). Raising priority, please consider implementing this soon. Lack of free machines and long waiting queues are still a problem (as indicates recent discussion on mailing list). Could we get at least some basic functionality implemented? Here & there we encounter situation when no machines are available for longer period of time and it's quite hard to find out the people responsible for resource exhaustion so that we could "manually" ping them. For now, a simple table with machines per user, sorted and updated hourly, perhaps with a week history, would help a lot. Seeing this, there are "few" old jobs sitting in Beaker for a long time (usually waiting for particular machine). My hypothesis is jobs older than 1 week are unlikely to have significant effect on current congestion, as these will be more likely executed in sequence while less specific jobs will run in parallel on anything. Second hypothesis is there is a negative correlation between "number of possible systems" and "time from queued to scheduled." This may be better indicator to help predict when the job will be scheduled. My hypotheses were not tested, just common sense said so to me. Also there will be always exceptions to this "rules" like jobs running on "Group" owned HW where this may be under-utilized, while common pool is crowded. For improved accuracy we would need to consider things system by system, which would generate way too large load. If this RFE get ever addressed, could we have "congestion graph" somehow displaying those in different color? I have some "heat map" on my mind: hottest - 0-1 hour, 1hour-1 day, 1day-1 week and older - coldest. I've create a proof of concept for how I imagine this should work: https://wiki.test.redhat.com/azelinka/BeakerStatus (please ignore the gaps in graphs - I run it wrom my personal workstation) Beaker 0.11 will include improved granularity for the stats sent to a configured Graphics metrics server. At the moment we have already defined the necessary metrics for stacked graphs showing: - recipes split by status (aggregate) - systems split by current utilisation (aggregate, per arch, per lab controller) To handle the job congestion use case, we'd also like to publish a "per arch" set of graphs for the recipe queue. However, I've realised I'm not clear on what a "per arch" recipe queue actually means. Is it: - any recipe that *can* run on that arch? (would lead to a lot of double counting and make the per-arch queues look worse than they are) - any recipe that can *only* run on that arch? (would leave out recipes that can run on multiple arches, including not specifying an arch constraint at all as well as recipes that can run only on the 32- and 64-bit variants of the same basic arch) I'm thinking the second interpretation makes more sense, but then it's a matter of extracting that information from the host_requires data. (In reply to comment #16) > - any recipe that can *only* run on that arch? (would leave out recipes that > can run on multiple arches, including not specifying an arch constraint at > all as well as recipes that can run only on the 32- and 64-bit variants of > the same basic arch) Every recipe can by definition only run on exactly one arch because every recipe has exactly one distro tree, which has a specific arch. Overall queue metrics reporting, as well as broken down by arch: http://gerrit.beaker-project.org/#/c/1574/ 2012-12-20 15:50:04,995 beakerd ERROR Exception in metrics loop Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 627, in metrics_loop recipe_count_metrics() File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 574, in recipe_count_metrics Recipe.filter(Recipe.virt_status == RecipeVirtStatus.possible)) AttributeError: type object 'Recipe' has no attribute 'filter' Broken query fixed and unit tests added in: http://gerrit.beaker-project.org/#/c/1602 Beaker 0.11.0 has been released. |