Hide Forgot
Created attachment 1229787 [details] patch for partial fix to pmGetInDomArchive Description of problem: replaying pcp archives containing metrics with large indoms (e.g. proc) has severe scalability problems in the default interpolated replay mode. Switching to uninterpolated mode (e.g. pmrep -u and pmiostat -u) avoids this but you loose the advantages of interp mode. In the repro example below, interp mode with the same sampling interval as the native recording is O(100) times slower and is basically unusable - it replays at "go and have a cup of tea per step" speed. Version-Release number of selected component (if applicable): pcp-3.11.7-1 How reproducible: easily. Note the test archive is ~ 158MB (avoid soggy noodle downloads) and contains no customer data so this BZ can be shared with the PCP upstream community. Steps to Reproduce: 1. # wget -O- http://people.redhat.com/~mgoodwin/pcp/archives/goody.tgz | tar xzvf - 2. # time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null 3. # time pmrep -z -p -f%c -a goody -u -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null Actual results: Interp mode : real 18m11.026s user 17m11.077s sys 0m50.562s Non-interp mode: real 0m13.090s user 0m11.950s sys 0m0.550s Expected results: interp mode has extra work to do obviously, but shouldn't be 100x slower. Additional info: The 'goody' archives have default PCP metrics plus per-process (atop-proc config) data recorded with 2 mins samples. The repro recipe extracts 4 instances out of ~ 250 for one proc metric. The replay performance is even worse if you request more than one metric, especially if you want different instances from the same indom. The issue isn't just limited to pmrep - this is in libpcp. pcp-atop replay is especially affected. I have a patch that fixes part of the issue - the O(N^2) nested loop in pmGetInDomArchive() becomes O(2N) using a hash table for the inner loop searches when a large indom is detected. But that's definitely not the whole fix. I'll attach the patch anyway.
Work is ongoing on the branch at git://git.pcp.io/brolley/pcp interp-performance The test case now runs in 25% of the original time on my system. This does not include Makr's patch, which I would like to rework somewhat.
Dave, I built your interp-performance branch and ran before/after tests : -- current 3.11.8 -- $ time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null real 16m0.684s user 15m20.125s sys 0m37.509s -- build and install Dave's interp-performance branch -- $ time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null real 5m3.948s user 4m26.105s sys 0m36.891s So that's a substantial 3X improvement! I also ran qa -g sanity and it passed. I also compared the output for both cases and they're identical. THis must be close to being ready for review and merge, or should we wait off? Thanks -- Mark
Hey Mark, I have a couple of more things I want to try and I also want to rework your initial patch a bit to get it in.
Current work now merged upstream. Will appear in release 3.11.8.
Verified with build pcp-3.8.11-5.el7, using reproducers provided in comments. The actual time on my boxes varies a bit but is basically in line with expected state - between 5 to 7 minutes for interp mode (20 to 50 seconds for non-interp mode) vs. 25+ minutes with unfixed pcp build.
Thanks Milos for confirming. I've also retesting with current 3.11.10-1 (release imminent) : == original repro: interp mode, with selecting 4 instances - as good as expected # pmrep -z -p -f%c -a goody -t2m -i '016079.*|012976.*|016073.*|016077.*' proc.psinfo.stime >/dev/null real 3m45.410s user 3m9.117s sys 0m35.790s == original repro: non-interp mode, selecting 4 instances - all good # pmrep -z -p -f%c -a goody -u -i '016079.*|012976.*|016073.*|016077.*' proc.psinfo.stime >/dev/null real 0m7.475s user 0m7.161s sys 0m0.290s == new test: interp mode, replaying all instances : # pmrep -z -p -f%c -a goody -t2m proc.psinfo.stime >/dev/null real 44m34.537s user 43m39.822s sys 0m39.287s == new test: non-interp mode, replaying all instances : # pmrep -z -p -f%c -a goody -u proc.psinfo.stime >/dev/null real 98m12.911s user 98m3.170s sys 0m0.497s Clearly something is not scaling if we ask for all instances rather than just 4 and I think that would be a good test for a gprof profiling comparison. Also, replaying all instances in non-interp mode takes 2x as long as interp mode, which is unexpected and may be some other bug (?). As discussed on the conf call, we need a more radical/surgical indexing solution here .. 1.5h to replay an archive of proc data isn't usable. For all the above test runs, the archive files were already in pagecache, no physical i/o performed. This is entirely CPU-MEM bound and we can read the data (out of pagecache) in ~ tenth of a second : # time cat goody/* >/dev/null real 0m0.179s user 0m0.001s sys 0m0.177s I'm re-running the same tests, but instead fetching two metrics: proc.psinfo.{stime,utime} will post those results in a few hours. Regards -- Mark
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1968