RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1721107 - pmdaproc leaking memory
Summary: pmdaproc leaking memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcp
Version: 7.6
Hardware: All
OS: Linux
urgent
medium
Target Milestone: rc
: ---
Assignee: Mark Goodwin
QA Contact: Jan Kurik
URL:
Whiteboard:
Depends On:
Blocks: 1742051
TreeView+ depends on / blocked
 
Reported: 2019-06-17 11:20 UTC by Renaud Métrich
Modified: 2020-03-31 19:09 UTC (History)
8 users (show)

Fixed In Version: pcp-4.3.4-1
Doc Type: Bug Fix
Doc Text:
No Doc Update
Clone Of:
: 1742051 (view as bug list)
Environment:
Last Closed: 2020-03-31 19:09:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0994 0 None None None 2020-03-31 19:09:36 UTC

Description Renaud Métrich 2019-06-17 11:20:42 UTC
Description of problem:

A customer reports his pmdaproc process leaking memory (up to 490 MB). Running pmdaproc under valgrind shows indeed leaks, in particular when using hotproc:


==4962== Command: /var/lib/pcp/pmdas/proc/pmdaproc -d 3
==4962== Parent PID: 4961
==4962== 785 bytes in 47 blocks are definitely lost in loss record 69 of 112
==4962==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==4962==    by 0x409FA0: refresh_proc_pidlist.isra.3 (proc_pid.c:973)
==4962==    by 0x40A99E: refresh_proc_pid (proc_pid.c:1070)
==4962==    by 0x403AEB: proc_refresh.isra.1 (pmda.c:1686)
==4962==    by 0x403EA2: proc_fetch (pmda.c:3179)
==4962==    by 0x4E3F69F: __pmdaMainPDU (mainloop.c:169)
==4962==    by 0x4E400EF: pmdaMain (mainloop.c:397)
==4962==    by 0x403016: main (pmda.c:3626)
==4962==
==4962== 1,288 bytes in 61 blocks are definitely lost in loss record 71 of 112
==4962==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==4962==    by 0x53780C9: strdup (in /usr/lib64/libc-2.17.so)
==4962==    by 0x40A2B7: refresh_proc_pidlist.isra.3 (proc_pid.c:979)
==4962==    by 0x40A99E: refresh_proc_pid (proc_pid.c:1070)
==4962==    by 0x403AEB: proc_refresh.isra.1 (pmda.c:1686)
==4962==    by 0x403EA2: proc_fetch (pmda.c:3179)
==4962==    by 0x4E3F69F: __pmdaMainPDU (mainloop.c:169)
==4962==    by 0x4E400EF: pmdaMain (mainloop.c:397)
==4962==    by 0x403016: main (pmda.c:3626)
==4962==
==4962== 365,560 bytes in 17,315 blocks are definitely lost in loss record 111 of 112
==4962==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==4962==    by 0x409FA0: refresh_proc_pidlist.isra.3 (proc_pid.c:973)
==4962==    by 0x40C876: hotproc_eval_procs (proc_pid.c:468)
==4962==    by 0x50963C9: onalarm (AF.c:310)
==4962==    by 0x532227F: ??? (in /usr/lib64/libc-2.17.so)
==4962==    by 0x53DAF6D: __read_nocancel (in /usr/lib64/libc-2.17.so)
==4962==    by 0x507379A: UnknownInlinedFun (unistd.h:44)
==4962==    by 0x507379A: pduread (pdu.c:251)
==4962==    by 0x507435C: __pmGetPDU (pdu.c:434)
==4962==    by 0x4E3F4FD: __pmdaMainPDU (mainloop.c:81)
==4962==    by 0x4E400EF: pmdaMain (mainloop.c:397)
==4962==    by 0x403016: main (pmda.c:3626)
==4962==
==4962== 710,693 bytes in 32,809 blocks are definitely lost in loss record 112 of 112
==4962==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==4962==    by 0x53780C9: strdup (in /usr/lib64/libc-2.17.so)
==4962==    by 0x40A2B7: refresh_proc_pidlist.isra.3 (proc_pid.c:979)
==4962==    by 0x40C876: hotproc_eval_procs (proc_pid.c:468)
==4962==    by 0x50963C9: onalarm (AF.c:310)
==4962==    by 0x532227F: ??? (in /usr/lib64/libc-2.17.so)
==4962==    by 0x53DAF6D: __read_nocancel (in /usr/lib64/libc-2.17.so)
==4962==    by 0x507379A: UnknownInlinedFun (unistd.h:44)
==4962==    by 0x507379A: pduread (pdu.c:251)
==4962==    by 0x507435C: __pmGetPDU (pdu.c:434)
==4962==    by 0x4E3F4FD: __pmdaMainPDU (mainloop.c:81)
==4962==    by 0x4E400EF: pmdaMain (mainloop.c:397)
==4962==    by 0x403016: main (pmda.c:3626)
==4962==
==4962== LEAK SUMMARY:
==4962==    definitely lost: 1,078,326 bytes in 50,232 blocks
...


Version-Release number of selected component (if applicable):

pcp-4.1.0-5.el7_6.x86_64


How reproducible:

Always on customer setup


Steps to Reproduce:

1. Enable hotproc:

# cat /var/lib/pcp/pmdas/proc/hotproc.conf
#pmdahotproc
Version 1.0
(uname == "root" && cpuburn > 0.05) || (uname != "root" && cpuburn > 0.0001)

2. Modify pmcd/pmdaproc component to run under valgrind

Edit /etc/pcp/pmcd/pmcd.conf and replace "pmdaproc" line below:

Original line:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
proc    3       pipe    binary          /var/lib/pcp/pmdas/proc/pmdaproc -d 3
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

New line:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
proc    3       pipe    binary          /usr/bin/valgrind --tool=memcheck --leak-check=yes --log-file=/var/log/pcp/valgrind.pmdaproc.log /var/lib/pcp/pmdas/proc/pmdaproc -d 3
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

3. Restart pmcd service, making sure pcp_pmcd_t domain is permissive

# yum -y install valgrind
# debuginfo-install -y pcp
# systemctl stop pmcd
# semanage permissive -a pcp_pmcd_t
# systemctl start pmcd


Actual results:

Leaks reported by valgrind


Expected results:

No leak


Additional info:

src/pmdas/linux_proc/proc_pid.c:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 820 static void
 821 refresh_proc_pidlist(proc_pid_t *proc_pid, proc_pid_list_t *pids)
 822 {
 [...]
 831     if (indomp->it_numinst < pids->count) {
 832         indomp->it_set = (pmdaInstid *)realloc(indomp->it_set,
 833             pids->count * sizeof(pmdaInstid));
 834         memset(&indomp->it_set[indomp->it_numinst], 0,
 835             (pids->count - indomp->it_numinst) * sizeof(pmdaInstid));
 836     }
 [...]
 949         /* refresh the indom pointer */
 950         indomp->it_set[i].i_inst = ep->id;
 951         if (indomp->it_set[i].i_name) {
 952             int len = strlen(indomp->it_set[i].i_name);
 953             if (strncmp(indomp->it_set[i].i_name, ep->name, len) != 0) {
 954                 free(indomp->it_set[i].i_name);
 955                 indomp->it_set[i].i_name = NULL;
 956             }
 957             else if (pmDebugOptions.libpmda) {
 958                 fprintf(stderr, "refresh_proc_pidlist: Instance id=%d \"%s\" no change\n",
 959                     ep->id, indomp->it_set[i].i_name);
 960             }
 961         }
 962         if (indomp->it_set[i].i_name == NULL) {
 963             /*
 964              * The external instance name is the pid followed by
 965              * a copy of the psargs truncated at the first space.
 966              * e.g. "012345 /path/to/command". Command line args,
 967              * if any, are truncated. The full command line is
 968              * available in the proc.psinfo.psargs metric.
 969              */
 970             if ((p = strchr(ep->name, ' ')) != NULL) {
 971                 if ((p = strchr(p+1, ' ')) != NULL) {
 972                     int len = p - ep->name;
 973                     indomp->it_set[i].i_name = (char *)malloc(len+1);
 974                     strncpy(indomp->it_set[i].i_name, ep->name, len);
 975                     indomp->it_set[i].i_name[len] = '\0';
 976                 }
 977             }
 978             if (indomp->it_set[i].i_name == NULL)
 979                 indomp->it_set[i].i_name = strdup(ep->name);
 980         }
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

It's unclear to me where the issue exactly is, the code looks good to me: the indomp->it_set[i].i_name pointer is freed when required (line 954) then malloc'ed/strdup'ed again (lines 973 and 979), and indomp->it_set is never freed but always realloc'ed, so indomp->it_set[i].i_name should always be valid pointers.

Anyway, there is for sure an issue with the realloc() statement on line 832 in case of realloc() failing (pmdaproc will then crash on line 834 due to NULL pointer dereference).

Comment 3 Peter Vreman 2019-06-17 15:49:35 UTC
The 500MB was detected on a small VM. On e.g. a nagios server or satellite memory usages of 1+GB were seen.
And on even bigger systems with 500 cpus the memory can be 20+GB within 40 days. Although the number of concurrent running processes is stable at 2000-2500 process (mostly kernel threads due to the 500 cpus)
----
pcp        8411 14.7  0.1 22900504 20884936 ?   S    May06 8932:10      \_ /var/lib/pcp/pmdas/proc/pmdaproc -d 3
----

Comment 4 Nathan Scott 2019-06-18 20:31:59 UTC
Mark, you know this code well - could you resolve this issue for 7.8 please?  Taa.

Comment 5 Peter Vreman 2019-07-22 12:55:58 UTC
The high memory usage of the base OS is raising more questions from application teams. Especially the SAP teams managing the HANA Databases are challenging why they can use only 20GB out of 32GB for their database. For me it is hard to explain that the 'simple' Base OS uses (in case the application uses thread create/delete actively) up to 8GB out of 32GB VM.

On the bigger SAP HANA systems we see pmdaproc using 20+GB

Can it also be made sure that the latest SAP Solution release (that will still be supported 4 more years from now, at least until 2023Q3) includes the fix

Comment 6 Mark Goodwin 2019-07-29 06:33:10 UTC
Fixed upstream for pcp-4.3.4 (once it releases) :

commit 775bd386c2b9a6e553c61a5c0f86d29d9c2320a8
Author: Mark Goodwin <mgoodwin>
Date:   Thu Jul 25 13:30:05 2019 +1000

    pmdaproc: fix memory leak in pidlist refresh
    
    RHBZ #1721107
    
    The pidlist refresh function was leaking instance name buffers for
    processes that had exited between refreshes. Fix this by storing the
    instance name (truncated psargs) in a new field of the hash table
    entry, and harvest them correctly if they've exited after each refresh.
    The indom table instance names now share a pointer to the new field
    in the hash table entry so they dont need to be separately free'd.
    This similifies the code and memory management.
    
    Also put a 4K cap on the maximum length of a psargs string for the
    unlikely scenario of a psargs string that is not NULL terminated,
    i.e. use strndup() rather than strdup(). This may be possible if
    a rogue process modified it's argv[0] at run-time.
    
    For testing, the recipe in RHBZ#1721107 was repro'd using valgind
    and a workload with lots of process churn (i.e. a couple of large
    builds running). Before the fix, valgrind reported numerous similar
    leaks as reported in the bug. With the fix applied, those leaks
    are gone. The refresh is more efficient now too because we only
    have one copy of the psargs strings (in the hash table entry).
    
    Also thoroughly exercised all tests in the QA pmda.proc and
    pmda.hotproc groups. A new test might be possible, but the nature
    of the required workload would make it difficult (and expensive)
    to run.

Comment 16 errata-xmlrpc 2020-03-31 19:09:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0994


Note You need to log in before you can comment on or make changes to this bug.