Bug 1403095
| Summary: | pcp replay performance doesn't scale for large indoms | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Mark Goodwin <mgoodwin> | ||||
| Component: | pcp | Assignee: | Dave Brolley <brolley> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Miloš Prchlík <mprchlik> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.3 | CC: | brolley, fche, lberk, mbenitez, mcermak, mgoodwin, mprchlik, nathans | ||||
| Target Milestone: | rc | Keywords: | Patch | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | pcp-3.11.8-4.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1451533 (view as bug list) | Environment: | |||||
| Last Closed: | 2017-08-01 18:29:33 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1451533 | ||||||
| Attachments: |
|
||||||
|
Description
Mark Goodwin
2016-12-09 03:05:05 UTC
Work is ongoing on the branch at git://git.pcp.io/brolley/pcp interp-performance The test case now runs in 25% of the original time on my system. This does not include Makr's patch, which I would like to rework somewhat. Dave, I built your interp-performance branch and ran before/after tests : -- current 3.11.8 -- $ time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null real 16m0.684s user 15m20.125s sys 0m37.509s -- build and install Dave's interp-performance branch -- $ time pmrep -z -p -f%c -a goody -t2m -i "016079.*|012976.*|016073.*|016077.*" proc.psinfo.stime >/dev/null real 5m3.948s user 4m26.105s sys 0m36.891s So that's a substantial 3X improvement! I also ran qa -g sanity and it passed. I also compared the output for both cases and they're identical. THis must be close to being ready for review and merge, or should we wait off? Thanks -- Mark Hey Mark, I have a couple of more things I want to try and I also want to rework your initial patch a bit to get it in. Current work now merged upstream. Will appear in release 3.11.8. Verified with build pcp-3.8.11-5.el7, using reproducers provided in comments. The actual time on my boxes varies a bit but is basically in line with expected state - between 5 to 7 minutes for interp mode (20 to 50 seconds for non-interp mode) vs. 25+ minutes with unfixed pcp build.
Thanks Milos for confirming. I've also retesting with current 3.11.10-1 (release imminent) :
== original repro: interp mode, with selecting 4 instances - as good as expected
# pmrep -z -p -f%c -a goody -t2m -i '016079.*|012976.*|016073.*|016077.*' proc.psinfo.stime >/dev/null
real 3m45.410s
user 3m9.117s
sys 0m35.790s
== original repro: non-interp mode, selecting 4 instances - all good
# pmrep -z -p -f%c -a goody -u -i '016079.*|012976.*|016073.*|016077.*' proc.psinfo.stime >/dev/null
real 0m7.475s
user 0m7.161s
sys 0m0.290s
== new test: interp mode, replaying all instances :
# pmrep -z -p -f%c -a goody -t2m proc.psinfo.stime >/dev/null
real 44m34.537s
user 43m39.822s
sys 0m39.287s
== new test: non-interp mode, replaying all instances :
# pmrep -z -p -f%c -a goody -u proc.psinfo.stime >/dev/null
real 98m12.911s
user 98m3.170s
sys 0m0.497s
Clearly something is not scaling if we ask for all instances rather than just 4 and I think that would be a good test for a gprof profiling comparison. Also, replaying all instances in non-interp mode takes 2x as long as interp mode, which is unexpected and may be some other bug (?). As discussed on the conf call, we need a more radical/surgical indexing solution here .. 1.5h to replay an archive of proc data isn't usable.
For all the above test runs, the archive files were already in pagecache, no physical i/o performed. This is entirely CPU-MEM bound and we can read the data (out of pagecache) in ~ tenth of a second :
# time cat goody/* >/dev/null
real 0m0.179s
user 0m0.001s
sys 0m0.177s
I'm re-running the same tests, but instead fetching two metrics: proc.psinfo.{stime,utime} will post those results in a few hours.
Regards
-- Mark
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1968 |