Bug 1403095

Summary: pcp replay performance doesn't scale for large indoms
Product: Red Hat Enterprise Linux 7 Reporter: Mark Goodwin <mgoodwin>
Component: pcpAssignee: Dave Brolley <brolley>
Status: CLOSED ERRATA QA Contact: Miloš Prchlík <mprchlik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.3CC: brolley, fche, lberk, mbenitez, mcermak, mgoodwin, mprchlik, nathans
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcp-3.11.8-4.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1451533 (view as bug list) Environment:
Last Closed: 2017-08-01 18:29:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1451533    
Attachments:
Description Flags
patch for partial fix to pmGetInDomArchive none

Description Mark Goodwin 2016-12-09 03:05:05 UTC
Created attachment 1229787 [details]
patch for partial fix to pmGetInDomArchive

Description of problem: replaying pcp archives containing metrics with large indoms (e.g. proc) has severe scalability problems in the default interpolated replay mode. Switching to uninterpolated mode (e.g. pmrep -u and pmiostat -u) avoids this but you loose the advantages of interp mode. In the repro example below, interp mode with the same sampling interval as the native recording is O(100) times slower and is basically unusable - it replays at "go and have a cup of tea per step" speed.

Version-Release number of selected component (if applicable): pcp-3.11.7-1

How reproducible: easily. Note the test archive is ~ 158MB (avoid soggy noodle downloads) and contains no customer data so this BZ can be shared with the PCP upstream community.

Steps to Reproduce:
1. # wget -O- http://people.redhat.com/~mgoodwin/pcp/archives/goody.tgz | tar xzvf -
2. # time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null
3. # time pmrep -z -p -f%c -a goody -u -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null

Actual results:
Interp mode :
  real	18m11.026s
  user	17m11.077s
  sys	0m50.562s

Non-interp mode:
  real	0m13.090s
  user	0m11.950s
  sys	0m0.550s


Expected results: interp mode has extra work to do obviously, but shouldn't be 100x slower.

Additional info: The 'goody' archives have default PCP metrics plus per-process (atop-proc config) data recorded with 2 mins samples. The repro recipe extracts 4 instances out of ~ 250 for one proc metric. The replay performance is even worse if you request more than one metric, especially if you want different instances from the same indom. The issue isn't just limited to pmrep - this is in libpcp. pcp-atop replay is especially affected.

I have a patch that fixes part of the issue - the O(N^2) nested loop in pmGetInDomArchive() becomes O(2N) using a hash table for the inner loop searches when a large indom is detected. But that's definitely not the whole fix. I'll attach the patch anyway.

Comment 1 Dave Brolley 2017-02-01 21:13:49 UTC
Work is ongoing on the branch at

git://git.pcp.io/brolley/pcp interp-performance

The test case now runs in 25% of the original time on my system. This does not include Makr's patch, which I would like to rework somewhat.

Comment 2 Mark Goodwin 2017-02-02 03:56:40 UTC
Dave, I built your interp-performance branch and ran before/after tests :

-- current 3.11.8 --
$ time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null

real	16m0.684s
user	15m20.125s
sys	0m37.509s

-- build and install Dave's interp-performance branch --
$ time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null

real	5m3.948s
user	4m26.105s
sys	0m36.891s

So that's a substantial 3X improvement! I also ran qa -g sanity and it passed.
I also compared the output for both cases and they're identical.

THis must be close to being ready for review and merge, or should we wait off?

Thanks
-- Mark

Comment 3 Dave Brolley 2017-02-02 15:00:54 UTC
Hey Mark, I have a couple of more things I want to try and I also want to rework your initial patch a bit to get it in.

Comment 4 Dave Brolley 2017-02-15 15:29:39 UTC
Current work now merged upstream. Will appear in release 3.11.8.

Comment 6 Miloš Prchlík 2017-05-14 05:37:59 UTC
Verified with build pcp-3.8.11-5.el7, using reproducers provided in comments. The actual time on my boxes varies a bit but is basically in line with expected state - between 5 to 7 minutes for interp mode (20 to 50 seconds for non-interp mode) vs. 25+ minutes with unfixed pcp build.

Comment 7 Mark Goodwin 2017-05-15 22:41:35 UTC
Thanks Milos for confirming. I've also retesting with current 3.11.10-1 (release imminent) :

== original repro: interp mode, with selecting 4 instances - as good as expected

# pmrep -z -p -f%c -a goody -t2m -i '016079.*|012976.*|016073.*|016077.*' proc.psinfo.stime >/dev/null

real	3m45.410s
user	3m9.117s
sys	0m35.790s

== original repro: non-interp mode, selecting 4 instances - all good

# pmrep -z -p -f%c -a goody -u -i '016079.*|012976.*|016073.*|016077.*' proc.psinfo.stime >/dev/null

real	0m7.475s
user	0m7.161s
sys	0m0.290s

== new test: interp mode, replaying all instances :

# pmrep -z -p -f%c -a goody -t2m proc.psinfo.stime >/dev/null

real	44m34.537s
user	43m39.822s
sys	0m39.287s

== new test: non-interp mode, replaying all instances :

# pmrep -z -p -f%c -a goody -u proc.psinfo.stime >/dev/null

real	98m12.911s
user	98m3.170s
sys	0m0.497s

Clearly something is not scaling if we ask for all instances rather than just 4 and I think that would be a good test for a gprof profiling comparison. Also, replaying all instances in non-interp mode takes 2x as long as interp mode, which is unexpected and may be some other bug (?). As discussed on the conf call, we need a more radical/surgical indexing solution here .. 1.5h to replay an archive of proc data isn't usable.

For all the above test runs, the archive files were already in pagecache, no physical i/o performed. This is entirely CPU-MEM bound and we can read the data (out of pagecache) in ~ tenth of a second :

# time cat goody/* >/dev/null

real	0m0.179s
user	0m0.001s
sys	0m0.177s

I'm re-running the same tests, but instead fetching two metrics: proc.psinfo.{stime,utime} will post those results in a few hours.

Regards
-- Mark

Comment 10 errata-xmlrpc 2017-08-01 18:29:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1968