Bug 1721107
| Summary: | pmdaproc leaking memory | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Renaud Métrich <rmetrich> | |
| Component: | pcp | Assignee: | Mark Goodwin <mgoodwin> | |
| Status: | CLOSED ERRATA | QA Contact: | Jan Kurik <jkurik> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 7.6 | CC: | agerstmayr, fkrska, jkurik, mgoodwin, mkolar, nathans, patrickm, peter.vreman | |
| Target Milestone: | rc | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | pcp-4.3.4-1 | Doc Type: | Bug Fix | |
| Doc Text: |
No Doc Update
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1742051 (view as bug list) | Environment: | ||
| Last Closed: | 2020-03-31 19:09:14 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1742051 | |||
The 500MB was detected on a small VM. On e.g. a nagios server or satellite memory usages of 1+GB were seen. And on even bigger systems with 500 cpus the memory can be 20+GB within 40 days. Although the number of concurrent running processes is stable at 2000-2500 process (mostly kernel threads due to the 500 cpus) ---- pcp 8411 14.7 0.1 22900504 20884936 ? S May06 8932:10 \_ /var/lib/pcp/pmdas/proc/pmdaproc -d 3 ---- Mark, you know this code well - could you resolve this issue for 7.8 please? Taa. The high memory usage of the base OS is raising more questions from application teams. Especially the SAP teams managing the HANA Databases are challenging why they can use only 20GB out of 32GB for their database. For me it is hard to explain that the 'simple' Base OS uses (in case the application uses thread create/delete actively) up to 8GB out of 32GB VM. On the bigger SAP HANA systems we see pmdaproc using 20+GB Can it also be made sure that the latest SAP Solution release (that will still be supported 4 more years from now, at least until 2023Q3) includes the fix
Fixed upstream for pcp-4.3.4 (once it releases) :
commit 775bd386c2b9a6e553c61a5c0f86d29d9c2320a8
Author: Mark Goodwin <mgoodwin>
Date: Thu Jul 25 13:30:05 2019 +1000
pmdaproc: fix memory leak in pidlist refresh
RHBZ #1721107
The pidlist refresh function was leaking instance name buffers for
processes that had exited between refreshes. Fix this by storing the
instance name (truncated psargs) in a new field of the hash table
entry, and harvest them correctly if they've exited after each refresh.
The indom table instance names now share a pointer to the new field
in the hash table entry so they dont need to be separately free'd.
This similifies the code and memory management.
Also put a 4K cap on the maximum length of a psargs string for the
unlikely scenario of a psargs string that is not NULL terminated,
i.e. use strndup() rather than strdup(). This may be possible if
a rogue process modified it's argv[0] at run-time.
For testing, the recipe in RHBZ#1721107 was repro'd using valgind
and a workload with lots of process churn (i.e. a couple of large
builds running). Before the fix, valgrind reported numerous similar
leaks as reported in the bug. With the fix applied, those leaks
are gone. The refresh is more efficient now too because we only
have one copy of the psargs strings (in the hash table entry).
Also thoroughly exercised all tests in the QA pmda.proc and
pmda.hotproc groups. A new test might be possible, but the nature
of the required workload would make it difficult (and expensive)
to run.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0994 |
Description of problem: A customer reports his pmdaproc process leaking memory (up to 490 MB). Running pmdaproc under valgrind shows indeed leaks, in particular when using hotproc: ==4962== Command: /var/lib/pcp/pmdas/proc/pmdaproc -d 3 ==4962== Parent PID: 4961 ==4962== 785 bytes in 47 blocks are definitely lost in loss record 69 of 112 ==4962== at 0x4C29BC3: malloc (vg_replace_malloc.c:299) ==4962== by 0x409FA0: refresh_proc_pidlist.isra.3 (proc_pid.c:973) ==4962== by 0x40A99E: refresh_proc_pid (proc_pid.c:1070) ==4962== by 0x403AEB: proc_refresh.isra.1 (pmda.c:1686) ==4962== by 0x403EA2: proc_fetch (pmda.c:3179) ==4962== by 0x4E3F69F: __pmdaMainPDU (mainloop.c:169) ==4962== by 0x4E400EF: pmdaMain (mainloop.c:397) ==4962== by 0x403016: main (pmda.c:3626) ==4962== ==4962== 1,288 bytes in 61 blocks are definitely lost in loss record 71 of 112 ==4962== at 0x4C29BC3: malloc (vg_replace_malloc.c:299) ==4962== by 0x53780C9: strdup (in /usr/lib64/libc-2.17.so) ==4962== by 0x40A2B7: refresh_proc_pidlist.isra.3 (proc_pid.c:979) ==4962== by 0x40A99E: refresh_proc_pid (proc_pid.c:1070) ==4962== by 0x403AEB: proc_refresh.isra.1 (pmda.c:1686) ==4962== by 0x403EA2: proc_fetch (pmda.c:3179) ==4962== by 0x4E3F69F: __pmdaMainPDU (mainloop.c:169) ==4962== by 0x4E400EF: pmdaMain (mainloop.c:397) ==4962== by 0x403016: main (pmda.c:3626) ==4962== ==4962== 365,560 bytes in 17,315 blocks are definitely lost in loss record 111 of 112 ==4962== at 0x4C29BC3: malloc (vg_replace_malloc.c:299) ==4962== by 0x409FA0: refresh_proc_pidlist.isra.3 (proc_pid.c:973) ==4962== by 0x40C876: hotproc_eval_procs (proc_pid.c:468) ==4962== by 0x50963C9: onalarm (AF.c:310) ==4962== by 0x532227F: ??? (in /usr/lib64/libc-2.17.so) ==4962== by 0x53DAF6D: __read_nocancel (in /usr/lib64/libc-2.17.so) ==4962== by 0x507379A: UnknownInlinedFun (unistd.h:44) ==4962== by 0x507379A: pduread (pdu.c:251) ==4962== by 0x507435C: __pmGetPDU (pdu.c:434) ==4962== by 0x4E3F4FD: __pmdaMainPDU (mainloop.c:81) ==4962== by 0x4E400EF: pmdaMain (mainloop.c:397) ==4962== by 0x403016: main (pmda.c:3626) ==4962== ==4962== 710,693 bytes in 32,809 blocks are definitely lost in loss record 112 of 112 ==4962== at 0x4C29BC3: malloc (vg_replace_malloc.c:299) ==4962== by 0x53780C9: strdup (in /usr/lib64/libc-2.17.so) ==4962== by 0x40A2B7: refresh_proc_pidlist.isra.3 (proc_pid.c:979) ==4962== by 0x40C876: hotproc_eval_procs (proc_pid.c:468) ==4962== by 0x50963C9: onalarm (AF.c:310) ==4962== by 0x532227F: ??? (in /usr/lib64/libc-2.17.so) ==4962== by 0x53DAF6D: __read_nocancel (in /usr/lib64/libc-2.17.so) ==4962== by 0x507379A: UnknownInlinedFun (unistd.h:44) ==4962== by 0x507379A: pduread (pdu.c:251) ==4962== by 0x507435C: __pmGetPDU (pdu.c:434) ==4962== by 0x4E3F4FD: __pmdaMainPDU (mainloop.c:81) ==4962== by 0x4E400EF: pmdaMain (mainloop.c:397) ==4962== by 0x403016: main (pmda.c:3626) ==4962== ==4962== LEAK SUMMARY: ==4962== definitely lost: 1,078,326 bytes in 50,232 blocks ... Version-Release number of selected component (if applicable): pcp-4.1.0-5.el7_6.x86_64 How reproducible: Always on customer setup Steps to Reproduce: 1. Enable hotproc: # cat /var/lib/pcp/pmdas/proc/hotproc.conf #pmdahotproc Version 1.0 (uname == "root" && cpuburn > 0.05) || (uname != "root" && cpuburn > 0.0001) 2. Modify pmcd/pmdaproc component to run under valgrind Edit /etc/pcp/pmcd/pmcd.conf and replace "pmdaproc" line below: Original line: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- proc 3 pipe binary /var/lib/pcp/pmdas/proc/pmdaproc -d 3 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- New line: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- proc 3 pipe binary /usr/bin/valgrind --tool=memcheck --leak-check=yes --log-file=/var/log/pcp/valgrind.pmdaproc.log /var/lib/pcp/pmdas/proc/pmdaproc -d 3 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 3. Restart pmcd service, making sure pcp_pmcd_t domain is permissive # yum -y install valgrind # debuginfo-install -y pcp # systemctl stop pmcd # semanage permissive -a pcp_pmcd_t # systemctl start pmcd Actual results: Leaks reported by valgrind Expected results: No leak Additional info: src/pmdas/linux_proc/proc_pid.c: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 820 static void 821 refresh_proc_pidlist(proc_pid_t *proc_pid, proc_pid_list_t *pids) 822 { [...] 831 if (indomp->it_numinst < pids->count) { 832 indomp->it_set = (pmdaInstid *)realloc(indomp->it_set, 833 pids->count * sizeof(pmdaInstid)); 834 memset(&indomp->it_set[indomp->it_numinst], 0, 835 (pids->count - indomp->it_numinst) * sizeof(pmdaInstid)); 836 } [...] 949 /* refresh the indom pointer */ 950 indomp->it_set[i].i_inst = ep->id; 951 if (indomp->it_set[i].i_name) { 952 int len = strlen(indomp->it_set[i].i_name); 953 if (strncmp(indomp->it_set[i].i_name, ep->name, len) != 0) { 954 free(indomp->it_set[i].i_name); 955 indomp->it_set[i].i_name = NULL; 956 } 957 else if (pmDebugOptions.libpmda) { 958 fprintf(stderr, "refresh_proc_pidlist: Instance id=%d \"%s\" no change\n", 959 ep->id, indomp->it_set[i].i_name); 960 } 961 } 962 if (indomp->it_set[i].i_name == NULL) { 963 /* 964 * The external instance name is the pid followed by 965 * a copy of the psargs truncated at the first space. 966 * e.g. "012345 /path/to/command". Command line args, 967 * if any, are truncated. The full command line is 968 * available in the proc.psinfo.psargs metric. 969 */ 970 if ((p = strchr(ep->name, ' ')) != NULL) { 971 if ((p = strchr(p+1, ' ')) != NULL) { 972 int len = p - ep->name; 973 indomp->it_set[i].i_name = (char *)malloc(len+1); 974 strncpy(indomp->it_set[i].i_name, ep->name, len); 975 indomp->it_set[i].i_name[len] = '\0'; 976 } 977 } 978 if (indomp->it_set[i].i_name == NULL) 979 indomp->it_set[i].i_name = strdup(ep->name); 980 } -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- It's unclear to me where the issue exactly is, the code looks good to me: the indomp->it_set[i].i_name pointer is freed when required (line 954) then malloc'ed/strdup'ed again (lines 973 and 979), and indomp->it_set is never freed but always realloc'ed, so indomp->it_set[i].i_name should always be valid pointers. Anyway, there is for sure an issue with the realloc() statement on line 832 in case of realloc() failing (pmdaproc will then crash on line 834 due to NULL pointer dereference).