Bug 1243809

Summary: drop kernel.percpu.interrupts from default pmlogconf
Product: Red Hat Enterprise Linux 7 Reporter: Frank Ch. Eigler <fche>
Component: pcpAssignee: Nathan Scott <nathans>
Status: CLOSED ERRATA QA Contact: Miloš Prchlík <mprchlik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: brolley, fche, jmario, lberk, mbenitez, mcermak, mgoodwin, mprchlik, nathans
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 11:55:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1249090    

Description Frank Ch. Eigler 2015-07-16 11:34:52 UTC
The kernel.percpu.interrupts metric indom scales badly when run on a larger machine (#cpus * #irqs, which can run into the tens of thousands).  That leads to much larger than usual log files.  We should stop recording this by default in pmlogconf's various files.  While looking for pcp-residents consumers of this information, I found only pmcollectl, and even that was aggregating across interrupt lines.  So, the collection & logging effort is being apprx. 100% wasted.

We already have a kernel.all.intr metric (and are logging it by default) to feed overall system stats.  If there were demand, we could perhaps add a kernel.percpu.intr (indom = cpus) aggregated across interrupt lines and/or a kernel.all.interrupt.FOO (PMNS pseudo-indom = interrupt-line) aggregated across cpus), and get those pmlogconf-defaulted.

Comment 4 Nathan Scott 2015-08-05 02:59:33 UTC
pcp-3.10.6-1 el7 build contains this fix.

Comment 6 Joe Mario 2015-08-23 23:31:57 UTC

Frank and Nathan:
 I'm on the large system again, (HP Dragonhawk with 480 cpus).  It's running RHEL 7.2 alpha.   The pcp version is pcp-3.10.6-1 el7.

But it doesn't look like the excessive logging decreased by much.  I expected it to be much lower, given comment #4 above says it's fixed.

The system is idle, and with pmlogger enabled, the log file is growing by 841308 bytes per minute.  

The "pminfo -f" command still takes 1 minute to run and generates 84 Meg of data.

The pminfo output is at http://perf1.lab.bos.redhat.com/jmario/scratch/pminfo_aug_23_bl920gen8.txt

Let me know if you need access to the system.  I currently have it reserved.

Joe

Comment 7 Nathan Scott 2015-08-24 04:21:08 UTC
Thanks Joe.  From a look around the system, here's a few notes I made:


pmlogger

- the default log size here will drop a fair bit again shortly, with this weeks 7.2 pcp rebuild including the BZ 1254509 fix - I'll send you a note when that's ready if you like.

- after that, the default size will be around the 200MB per day mark uncompressed (140624 bytes every 60 sec).  For such a large system, this is pretty good I think - back when I was doing production system analysis we'd typically see daily logs in the order 150-175MB (smaller, application servers) though that was logging much more frequently (~15 second sampling)

- when the log compression kicks in, after 3 days IIRC, that drops right down to around about 1MB (!) - on-the-fly-compression from pmlogger is in the long-term PCP roadmap.

- alot of the space we're consuming currently is due to the per-cpu time metrics
(see "pminfo kernel.percpu.cpu") - simply because 480 * 11 64-bit values * the sampling interval, needs to be part of the logged set;


pminfo

- most of the pminfo time is being spent traversing and fetching the kernel.percpu.interrupts metrics (while we've stopped logging these now with my previous change, we've not done any optimisation work here yet) - there's plenty of scope for improving that code still.  But other metric trees are nice and quick to fetch values from, including the other percpu metrics...

# time pminfo -f kernel.percpu.cpu >/dev/null

real	0m0.058s
user	0m0.005s
sys	0m0.001s

So, once we come back to tackling the interrupts metrics, we'll see a noticeable improvement there.  Its not super-high priority yet though, just because running "pminfo -f" across every possible metric is not a common operation.  We do have other planned work in the interrupts metrics though, so it was good to see first hand the pain level there.

Comment 8 Miloš Prchlík 2015-10-20 11:13:50 UTC
Verified for build pcp-3.10.6-2.el7.

Comment 9 errata-xmlrpc 2015-11-19 11:55:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2096.html