Bug 1359975

Summary: pmchart run-away mem leak replaying multi-archive when rewinding
Product: [Fedora] Fedora Reporter: Mark Goodwin <mgoodwin>
Component: pcpAssignee: Dave Brolley <brolley>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 24CC: brolley, fche, lberk, mbenitez, mgoodwin, nathans, pcp, scox
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 3.11.7-1 pcp-3.11.7-1.fc25 pcp-3.11.7-1.fc24 pcp-3.11.7-1.el5 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-02 19:50:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
possible reproducer
none
fix for one small leak
none
Proposed patch for indom duplication none

Description Mark Goodwin 2016-07-26 04:00:50 UTC
Description of problem: pmchart goes into a run-away memory leak and eventually gets killed by oom killer

Version-Release number of selected component (if applicable): pcp-3.11.4-1 / F24

How reproducible: easily

Steps to Reproduce:
1. wget http://people.redhat.com/~mgoodwin/pcp/archives/goody.tgz; tar xzf goody.tgz
2. pmchart -z -a goody -t 10m -O-0 -s 400 -v 400 -geometry 800x450 -c CPU
3. expose time controls, select FAST, press REWIND then press STOP

Actual results: pmtime dialog stops, but pmchart does not scroll and spins out of control and is eventually killed by oom.

Expected results: pmchart should scroll and stop as per pmtime

Additional info: this only occurs when replaying multiple archives (in the directory 'goody'). It is not reproducible if those archives are merged with pmlogextract and we only open the one merged archive.

Comment 1 Mark Goodwin 2016-11-25 22:58:49 UTC
This is still easily reproducible with pmchart from pcp-3.11.6 and latest from the current 3.11.7, as per the Steps to Reproduce in Comment #1

I re-tested since pmchart has had some changes recently. I guess maybe the priority should be bumped up a bit since it basically kills your desktop if you hit it (until the oom killer steps in to stop it).

Comment 2 Mark Goodwin 2016-11-25 23:21:43 UTC
Here's an ltrace of the repeating library call sequence - this repeats until the system eventually runs out of memory and the pmchart process is oom/killed

10440 pmFetch(7, 0x11e8866b0, 0x7ffc4a5b02c0, 7) = 0
10440 QVectorData::free(QVectorData*, int)(0x11e8866a0, 8, 0x149c400, 0) = 0
10440 __pmtimevalSub(0x3616fa0, 0x3616fb0, 0x62b2c, 0) = 0
10440 pmFreeResult(0x11e8866e0, 0x7ffc4a5b01c0, 7, 0) = 0
10440 pmUseContext(0, 1, 0x3616fd0, 0)           = 0
10440 pmCtime(0x7ffc4a5b0358, 0x7677a0, 0x7f89f42141e0, 0) = 0x7677a0
10440 strlen("Mon Jul 18 13:53:12 2016\n")       = 25
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x3689d10
10440 memmove(0x42c1560, "DDDDD\304$@\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42c1560
10440 memmove(0x42c9570, "\256&\026;\264\242\320?\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42c9570
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36aa820
10440 memmove(0x42d1580, "DDDDDD\020@\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42d1580
10440 memmove(0x370b360, "\342\306o\264\243&\327?\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x370b360
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36a7340
10440 memmove(0x42d9590, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42d9590
10440 memmove(0x36f0bc0, "\342\306o\264\243&\327?\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x36f0bc0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36ba4d0
10440 memmove(0x3713380, "\021\021\021\021\021\021\221?\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x3713380
10440 memmove(0x42e15a0, "\227\310}\253y-\327?\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42e15a0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36bfe20
10440 memmove(0x42e95b0, "UUUUUU"@\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42e95b0
10440 memmove(0x371b3a0, "`\271\3129\225\356\342?\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x371b3a0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36caeb0
10440 memmove(0x42f15c0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42f15c0
10440 memmove(0x36f8be0, "`\271\3129\225\356\342?\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x36f8be0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36c7730
10440 memmove(0x42f95d0, "~\261\344\027\v\006\257@\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42f95d0
10440 memmove(0x37233c0, "\0\0\0\0\0\0Y@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x37233c0
10440 qIsNaN(double)(0x366f350, 0xffffffff, 1, 5) = 0
10440 qIsNaN(double)(0x36a8610, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36b3b70, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36bb680, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36c1f70, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36ca620, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36d1a70, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x366f350, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36a8610, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36b3b70, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36bb680, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36c1f70, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36ca620, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36d1a70, 0xffffffff, 1, 1) = 0
10440 __pmtimevalFromReal(0x7ffc4a5b03d0, 1, 1, 1) = 0
10440 pmUseContext(0, 1, 0x3616f60, 0x7ffc4a5b0320) = 0
10440 pmSetMode(0x1030001, 0x7ffc4a5b03d0, 0xfffffda8, 0) = 0
10440 pmUseContext(0, 0, 0x3616fd0, 0)           = 0
10440 pmCtime(0x7ffc4a5b0358, 0x7677a0, 0x7f89f42141e0, 0) = 0x7677a0
10440 strlen("Mon Jul 18 13:43:12 2016\n")       = 25
10440 pmUseContext(0, 0xffffcfc1, 0, 0)          = 0
10440 QVectorData::allocate(int, int)(44, 8, 0, 0) = 0x11f80d620
10440 qMemSet(void*, int, unsigned long)(0x11f80d630, 0, 28, 0) = 0x11f80d630

-- pause, then the cycle repeats


10440 pmFetch(7, 0x11f80d630, 0x7ffc4a5b02c0, 7) = 0
10440 QVectorData::free(QVectorData*, int)(0x11f80d620, 8, 0x149c400, 0) = 0
10440 __pmtimevalSub(0x3616fa0, 0x3616fb0, 0x62b2c, 0) = 0
10440 pmExtractValue(1, 0x11ec7bf30, 3, 0x7ffc4a5b01d0) = 0
10440 pmExtractValue(1, 0x11ec7c1b0, 3, 0x7ffc4a5b01d0) = 0
10440 pmExtractValue(1, 0x11ec7c200, 3, 0x7ffc4a5b01d0) = 0
10440 pmExtractValue(1, 0x11ec7c250, 3, 0x7ffc4a5b01d0) = 0
10440 pmExtractValue(1, 0x11ec7c2a0, 3, 0x7ffc4a5b01d0) = 0
10440 pmExtractValue(1, 0x11ec7c2f0, 3, 0x7ffc4a5b01d0) = 0
10440 pmExtractValue(1, 0x11ec7c340, 3, 0x7ffc4a5b01d0) = 0
10440 pmFreeResult(0x11f80d660, 0x7f89f3ffa6d8, 7, 0) = 0
10440 pmUseContext(0, 1, 0x3616fd0, 0)           = 0
10440 pmCtime(0x7ffc4a5b0358, 0x7677a0, 0x7f89f42141e0, 0) = 0x7677a0
10440 strlen("Mon Jul 18 13:43:12 2016\n")       = 25
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x3689d10
10440 memmove(0x42c1560, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42c1560
10440 memmove(0x42c9570, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42c9570
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36aa820
10440 memmove(0x42d1580, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42d1580
10440 memmove(0x370b360, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x370b360
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36a7340
10440 memmove(0x42d9590, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42d9590
10440 memmove(0x36f0bc0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x36f0bc0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36ba4d0
10440 memmove(0x3713380, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x3713380
10440 memmove(0x42e15a0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42e15a0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36bfe20
10440 memmove(0x42e95b0, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42e95b0
10440 memmove(0x371b3a0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x371b3a0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36caeb0
10440 memmove(0x42f15c0, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42f15c0
10440 memmove(0x36f8be0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x36f8be0
10440 qQNaN()(1, 0xffffcfc1, 1, 0x3685400)       = 0x36c7730
10440 memmove(0x42f95d0, "\0\0\0\0\0\0\370\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x42f95d0
10440 memmove(0x37233c0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 25536) = 0x37233c0
10440 qIsNaN(double)(0x366f350, 0xffffffff, 1, 5) = 0
10440 qIsNaN(double)(0x36a8610, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36b3b70, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36bb680, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36c1f70, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36ca620, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36d1a70, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x366f350, 0xffffffff, 1, 0) = 0
10440 qIsNaN(double)(0x36a8610, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36b3b70, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36bb680, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36c1f70, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36ca620, 0xffffffff, 1, 1) = 0
10440 qIsNaN(double)(0x36d1a70, 0xffffffff, 1, 1) = 0
10440 __pmtimevalFromReal(0x7ffc4a5b03d0, 1, 1, 1) = 0
10440 pmUseContext(0, 1, 0x3616f60, 0x7ffc4a5b0320) = 0
10440 pmSetMode(0x1030001, 0x7ffc4a5b03d0, 0xfffffda8, 0) = 0
10440 pmUseContext(0, 0, 0x3616fd0, 0)           = 0
10440 pmCtime(0x7ffc4a5b0358, 0x7677a0, 0x7f89f42141e0, 0) = 0x7677a0
10440 strlen("Mon Jul 18 13:33:12 2016\n")       = 25
10440 pmUseContext(0, 0, 0, 0)                   = 0
10440 QVectorData::allocate(int, int)(44, 8, 0, 0) = 0x120794690
10440 qMemSet(void*, int, unsigned long)(0x1207946a0, 0, 28, 0) = 0x1207946a0

.. continues to repeat, as above

Comment 3 Frank Ch. Eigler 2016-11-26 04:06:11 UTC
env PCP_DEDBUG=14336  pmchart ...  

indicates that once you hit the backwards button, the app disregards stop, and keeps on chugging.

Comment 4 Frank Ch. Eigler 2016-11-26 04:32:51 UTC
One more bit of data:

% build a copy of pmchart with debuginfo in it, if able (pcp configure --with-qt=debug one way)
% gdb -args pmchart ....
(gdb) run
... until popping up the time selector
^C
(gdb) break mmap
(gdb) cont

#0  0x00002aaaad2ed3e0 in __mmap (addr=addr@entry=0x0, len=len@entry=1048576, prot=prot@entry=3, flags=flags@entry=34, fd=fd@entry=-1, offset=offset@entry=0) at ../sysdeps/unix/sysv/linux/wordsize-64/mmap.c:33
#1  0x00002aaaad26f070 in sysmalloc (nb=nb@entry=2032, av=0x2aaaad5acb00 <main_arena>) at malloc.c:2515
#2  0x00002aaaad26fa44 in _int_malloc (av=av@entry=0x2aaaad5acb00 <main_arena>, bytes=bytes@entry=2024) at malloc.c:3825
#3  0x00002aaaad2710b0 in __GI___libc_malloc (bytes=2024) at malloc.c:2911
#4  0x00002aaaaacfed0a in __pmLogLoadMeta (lcp=lcp@entry=0xb43790) at logmeta.c:345
#5  0x00002aaaaad03320 in __pmLogOpen (name=name@entry=0xaa76c0 "goody/20160719.12.55", ctxp=ctxp@entry=0xa08070)
    at logutil.c:1086
#6  0x00002aaaaace4349 in __pmFindOrOpenArchive (ctxp=ctxp@entry=0xa08070, name=0xaa76c0 "goody/20160719.12.55", multi_arch=multi_arch@entry=1) at context.c:555
#7  0x00002aaaaad03779 in __pmLogChangeArchive (ctxp=ctxp@entry=0xa08070, arch=arch@entry=1) at logutil.c:2829
#8  0x00002aaaaad053c6 in __pmLogSetTime (ctxp=ctxp@entry=0xa08070) at logutil.c:2292
#9  0x00002aaaaace78b8 in pmSetMode (mode=16973825, when=0x7fffffffcf90, delta=-600) at fetch.c:264
#10 0x00000000004a38a3 in QmcGroup::setArchiveMode(int, timeval const*, int) ()
#11 0x000000000048729e in GroupControl::adjustArchiveWorldViewBackward(QmcTime::Packet*, bool) (this=0xa07e70, packet=0xaad9c0, setup=false) at groupcontrol.cpp:478

... and try a (gdb)cont a few times; many such requests come in this way.

This seems to implicate libpcp multi-archive mode time scanning.

Comment 5 Mark Goodwin 2016-11-26 04:40:02 UTC
(In reply to Frank Ch. Eigler from comment #4)
>
> This seems to implicate libpcp multi-archive mode time scanning.

yes, and this more or less agrees with the repeating library call sequence in Comment #2, which is leaking memory on every iteration. It only occurs in multi-archive mode.

Thanks
-- Mark

Comment 6 Frank Ch. Eigler 2016-11-28 01:47:55 UTC
Created attachment 1225074 [details]
possible reproducer

this little script seems to access the archive in a similar idiosyncratic way as pmchart does (pmSetMode calls between pmFetch).  Needs the sample multi-archive directory under /tmp/goody.

% valgrind --leak-check=full FOO.py

After a few seconds' run, notice how it's getting slower and slower and slower.  Then ^C, valgrind reports:

[...]

==24643== 206,400 bytes in 10,320 blocks are definitely lost in loss record 2,497 of 2,497
==24643==    at 0x4C2BBAD: malloc (vg_replace_malloc.c:299)
==24643==    by 0xDCCFA72: __pmLogLoadMeta (logmeta.c:143)
==24643==    by 0xDCD431F: __pmLogOpen (logutil.c:1086)
==24643==    by 0xDCB5348: __pmFindOrOpenArchive (context.c:555)
==24643==    by 0xDCD4778: __pmLogChangeArchive (logutil.c:2829)
==24643==    by 0xDCD63C5: __pmLogSetTime (logutil.c:2292)
==24643==    by 0xDCB88B7: pmSetMode (fetch.c:264)
==24643==    by 0x11A20C57: ffi_call_unix64 (unix64.S:76)
==24643==    by 0x11A206B9: ffi_call (ffi64.c:525)
==24643==    by 0x1180D67E: _call_function_pointer (callproc.c:837)
==24643==    by 0x1180D67E: _ctypes_callproc (callproc.c:1180)
==24643==    by 0x11807371: PyCFuncPtr_call (_ctypes.c:3954)
==24643==    by 0x4E81FD2: PyObject_Call (abstract.c:2546)

% python FOO.py
[... runs ... and bloats memory more and more and more ...]

Comment 7 Frank Ch. Eigler 2016-11-28 03:57:59 UTC
Created attachment 1225100 [details]
fix for one small leak

One leak down with this patch.

One monster leak found with valgrind on the python reproducer, with a SIGUSR1 interrupt (to preclude libpcp cleanup on context creation):

==26053== 38,135,288 bytes in 5,072 blocks are still reachable in loss record 3,616 of 3,616
==26053==    at 0x4C2BBAD: malloc (vg_replace_malloc.c:299)
==26053==    by 0xDCD0FD7: __pmLogLoadMeta (logmeta.c:313)
==26053==    by 0xDCD567F: __pmLogOpen (logutil.c:1086)
==26053==    by 0xDCB66B8: __pmFindOrOpenArchive (context.c:555)
==26053==    by 0xDCD5AD8: __pmLogChangeArchive (logutil.c:2829)
==26053==    by 0xDCD5EC9: __pmLogChangeToPreviousArchive (logutil.c:3010)
==26053==    by 0xDCD645C: __pmLogRead (logutil.c:1617)
==26053==    by 0xDCD9DE2: cache_read (interp.c:230)
==26053==    by 0xDCDC2A7: __pmLogFetchInterp (interp.c:1034)
==26053==    by 0xDCD7609: __pmLogFetch (logutil.c:1944)
==26053==    by 0xDCB9664: pmFetch (fetch.c:154)
==26053==    by 0xDCBBA61: pmFetchGroup (fetchgroup.c:1559)

[etc]

What's happening is that as interp.c and pals scan back and forth in time across the archives, a new copies of the entire instance domains are constantly allocated and saved via addindom().  addindom() does not perform any duplicate elimination over e.g. idp->stamp, or the contents of the indom, so these copies pile up in memory at an impressive pace.

Comment 8 Mark Goodwin 2016-11-28 04:02:10 UTC
Heh, I just arrived at exactly the same patch :)  It fixes the original pmchart issue but your FOO.py repro still seems to leak (another python gc issue maybe??)

Comment 9 Frank Ch. Eigler 2016-11-28 04:06:09 UTC
(In reply to Mark Goodwin from comment #8)
> [...] (another python gc issue maybe??)

Nope, see above re. monster leak: a design problem in (multi-?) archive work.  indoms are being repeatedly stored in memory.

Comment 10 Mark Goodwin 2016-11-28 04:14:42 UTC
yes you are correct - it doesn't fix the original pmchart issue (but it does fix a leak nevertheless, so at least some progress has been made).

# pmchart -z -a goody -t 10m -O-0 -s 400 -v 400 -geometry 800x450 -c CPU
(where 'goody' is a directory containing multiple archives)

The 'monster' indom leak is going to need some careful thought ...

Cheers

Comment 12 Dave Brolley 2016-12-02 15:46:21 UTC
Created attachment 1227374 [details]
Proposed patch for indom duplication

Here is a proposed patch for filtering out the duplicate indoms. It stops the runaway memory growth, but pm chart still ends up in a loop reading from the multi-archive and remains unresposive.

Comment 13 Mark Goodwin 2016-12-06 04:09:33 UTC
Hi Dave, I applied your patch from Comment #12 but it didn't go too well.
Looks like an invalid free at __pmLogLoadMeta (logmeta.c:434), which is :

    430                 /*
    431                  * This pmid is already known, and matches.  We can free the newly
    432                  * read copy and use the one in the hash table.
    433                  */
    434                 free(dp);
    435                 dp = olddp;

i.e. your new patch seems to be incompatible with the following commit which is in my tree but maybe not in yours :

commit 6ed4b2452ff24356f90d37ac658cdbae4053e8c0
Author: Frank Ch. Eigler <fche>
Date:   Thu Dec 1 15:03:01 2016 +1100

    libpcp: plug a small mem leak in __pmLogLoadMeta()
    
    Plug a leaking pmDesc in __pmLogLoadMeta() for the case where the pmid
    is already in the hash table. Partial fix for RHBZ#1359975. There are
    still other leaks involving multi-archive replay remaining for this bug.

Should we just back-out that change instead since the code has now been re-worked?


$ valgrind pmchart -a goody -O-0 -c CPU
==25136== Memcheck, a memory error detector
==25136== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==25136== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==25136== Command: pmchart -a goody -O-0 -c CPU
==25136== 
==25136== Conditional jump or move depends on uninitialised value(s)
==25136==    at 0x4C2CD0D: free (vg_replace_malloc.c:530)
==25136==    by 0x4E65E7A: __pmLogLoadMeta (logmeta.c:434)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136== 
==25136== Invalid free() / delete / delete[] / realloc()
==25136==    at 0x4C2CD5A: free (vg_replace_malloc.c:530)
==25136==    by 0x4E65E7A: __pmLogLoadMeta (logmeta.c:434)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x4e68af9 is in the Text segment of /usr/lib64/libpcp.so.3
==25136==    at 0x4E68AF9: __pmLogChkLabel (logutil.c:233)
==25136== 
==25136== Invalid read of size 4
==25136==    at 0x4E65C01: __pmLogLoadMeta (logmeta.c:243)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e374 is 4 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid read of size 4
==25136==    at 0x4E65C0E: __pmLogLoadMeta (logmeta.c:248)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e37c is 12 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid read of size 4
==25136==    at 0x4E65C1B: __pmLogLoadMeta (logmeta.c:253)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e378 is 8 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid read of size 1
==25136==    at 0x4E65C28: __pmLogLoadMeta (logmeta.c:258)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e383 is 19 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid read of size 1
==25136==    at 0x4E65C36: __pmLogLoadMeta (logmeta.c:260)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e382 is 18 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid read of size 1
==25136==    at 0x4E65C44: __pmLogLoadMeta (logmeta.c:262)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e381 is 17 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid read of size 4
==25136==    at 0x4E65CA7: __pmLogLoadMeta (logmeta.c:349)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f35e370 is 0 bytes inside an unallocated block of size 112 in arena "client"
==25136== 
==25136== Invalid free() / delete / delete[] / realloc()
==25136==    at 0x4C2CD5A: free (vg_replace_malloc.c:530)
==25136==    by 0x4E68260: logFreePMNS (logutil.c:707)
==25136==    by 0x4E6A393: __pmLogOpen (logutil.c:1107)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Address 0x1f361640 is 0 bytes inside a block of size 20 free'd
==25136==    at 0x4C2CD5A: free (vg_replace_malloc.c:530)
==25136==    by 0x4E65C59: __pmLogLoadMeta (logmeta.c:273)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==  Block was alloc'd at
==25136==    at 0x4C2BBAD: malloc (vg_replace_malloc.c:299)
==25136==    by 0x4E65B92: __pmLogLoadMeta (logmeta.c:208)
==25136==    by 0x4E6A43F: __pmLogOpen (logutil.c:1086)
==25136==    by 0x4E4B348: __pmFindOrOpenArchive (context.c:555)
==25136==    by 0x4E4CAF1: initarchive (context.c:759)
==25136==    by 0x4E4CAF1: pmNewContext (context.c:1126)
==25136==    by 0x4A603B: QmcSource::retryConnect(int, QString&) (in /usr/bin/pmchart)
==25136==    by 0x4A6F16: QmcSource::QmcSource(int, QString&, int) (in /usr/bin/pmchart)
==25136==    by 0x4A71C6: QmcSource::getSource(int, QString&, int, bool) (in /usr/bin/pmchart)
==25136==    by 0x49C1B1: QmcGroup::use(int, QString const&, int) (in /usr/bin/pmchart)
==25136==    by 0x424DB0: main (in /usr/bin/pmchart)
==25136==

Comment 14 Mark Goodwin 2016-12-06 06:41:32 UTC
I backed out 6ed4b2452ff24356f90d37ac658cdbae4053e8c0 from my tree and rebuilt just with Dave's patch - it's much better behaved, even works actually, doesn't spin nor get killed like it used to using the repro from Comment #0

I think this patch is a tremendous improvement. 6ed4b2452ff24 can be reverted since on inspection it's included in Dave's patch (hence the double free reported in previous comment)

-- Mark

Comment 15 Dave Brolley 2016-12-06 15:28:34 UTC
Both leaks must be fixed in the final version. I'll review and make sure that's the case.

Comment 16 Mark Goodwin 2016-12-07 23:11:16 UTC
Dave's current patch - a quick test shows it seems to work

In the tree/branch:  git://git.pcp.io/brolley/pcp rhbz1359975

commit 0015487d1fd1c70f94bb75b08109a13de81dc878
Author: Dave Brolley <brolley>
Date:   Wed Dec 7 11:50:40 2016 -0500

    RHBZ 1359975: pmchart run-away mem leak replaying multi-archive when rewinding
    
    Two leaks were fixed. Both occurred during the transition between archives
    is a multi-archive context.
    
    1) a leak of pmDesc * used to check for consistency of the PMNS
       between archives.
    
    2) a build up of duplicate instance domains.

Comment 17 Dave Brolley 2016-12-12 19:50:08 UTC
Additional proposed commit (on top of the previous one):

In the tree/branch:  git://git.pcp.io/brolley/pcp rhbz1359975

commit 82e02ffea204d26e7617a4bcf4be0b9326c64457
Author: Dave Brolley <brolley>
Date:   Mon Dec 12 14:39:09 2016 -0500

    RHBZ 1359975: pmchart run-away mem leak replaying multi-archive when rewinding
    
    Refinement. Exposed by qa regressions. Ensure that instance domains are in
    the correct order in the hash chain:
    
    - Primary sort by timestamp (descending)
    - Secondary sort: latest added at the head of each time slot. This
      includes moving duplicates to the head of their time slot when detected.
    
    Before multi-archive contexts and duplicate-filtering, this happened
    automatically.

Comment 18 Dave Brolley 2016-12-14 16:19:19 UTC
More commits for this bug:

In the tree/branch:  git://git.pcp.io/brolley/pcp rhbz1359975

commit e574b71de8420c719ae2924a64ca9a5a4411bd52
Author: Dave Brolley <brolley>
Date:   Wed Dec 14 11:10:45 2016 -0500

    RHBZ 1359975: pmchart run-away mem leak replaying multi-archive when rewinding.
    
    Don't assume that instances within the same instance domain are
    always sorted in the same order.

commit f32bf1e7a669a2bee719e9d3f6c3828acfb8c518
Author: Dave Brolley <brolley>
Date:   Tue Dec 13 12:35:30 2016 -0500

    Create and use __pmTimevalCmp for comparing __pmTimeval.
    
    Similar to __pmtimevalcmp which compares struct timeval.

commit af8cac3b71b3104157a268c6ebb6c652c87d01ca
Author: Nathan Scott <nathans>
Date:   Tue Dec 13 12:06:52 2016 -0500

    Typos and style corrections for recent duplicate __pmLogInDom changes.

Comment 19 Dave Brolley 2016-12-14 19:00:43 UTC
All commits have been pushed upstream to git://git.pcp.io/pcp master

Comment 20 Fedora Update System 2016-12-22 00:41:41 UTC
pcp-3.11.7-1.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-10df136b72

Comment 21 Fedora Update System 2016-12-22 00:43:44 UTC
pcp-3.11.7-1.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-98254aaaa3

Comment 22 Fedora Update System 2016-12-22 00:45:07 UTC
pcp-3.11.7-1.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-9cbe6566bd

Comment 23 Fedora Update System 2016-12-22 19:19:19 UTC
pcp-3.11.7-1.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-98254aaaa3

Comment 24 Fedora Update System 2016-12-22 19:20:35 UTC
pcp-3.11.7-1.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-9cbe6566bd

Comment 25 Fedora Update System 2016-12-22 20:47:04 UTC
pcp-3.11.7-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-10df136b72

Comment 26 Fedora Update System 2017-01-02 19:50:59 UTC
pcp-3.11.7-1.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.

Comment 27 Fedora Update System 2017-01-03 01:21:30 UTC
pcp-3.11.7-1.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 28 Fedora Update System 2017-01-06 23:48:39 UTC
pcp-3.11.7-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.