Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1600262

Summary:	pmdaproc only reads the first 1024 bytes of the /proc/*/status file resulting in lost metric values
Product:	Red Hat Enterprise Linux 7	Reporter:	Kyle Walker <kwalker>
Component:	pcp	Assignee:	Mark Goodwin <mgoodwin>
Status:	CLOSED ERRATA	QA Contact:	Michal Kolar <mkolar>
Severity:	high	Docs Contact:	Vladimír Slávik <vslavik>
Priority:	urgent
Version:	7.5	CC:	fche, fkrska, lberk, mcermak, mgoodwin, mkolar, nathans, patrickm, tbowling, toneata, vslavik
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pcp-4.1.1	Doc Type:	Bug Fix
Doc Text:	.PCP now reports all process details on large systems Previously, the Performance Co-Pilot (PCP) toolkit failed to report certain process details in some cases on very large systems. The code reading the process details files was changed so that it can read data of arbitrary length, instead of only the first 1024 bytes. As a result, the described PCP error can no longer happen.	Story Points:	---
Clone Of:
Clones:	1636959 1636960 (view as bug list)		Environment:
Last Closed:	2019-08-06 12:48:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1636959, 1636960

Description Kyle Walker 2018-07-11 20:50:35 UTC

Description of problem:
 The pmdaproc collector agent only reads the first 1024 bytes of the /proc/*/status files. This results, on exceptionally large systems, in metrics at the end of the file being unavailable intermittently. This can result in metrics being unavailable even for smaller systems. 

For example:

# diff -yW70 <(cat /proc/1/status) <(head -c 1024  /proc/1/status)
Name:	systemd				Name:	systemd
Umask:	0000				Umask:	0000
State:	S (sleeping)			State:	S (sleeping)
Tgid:	1				Tgid:	1
Ngid:	0				Ngid:	0
Pid:	1				Pid:	1
PPid:	0				PPid:	0
TracerPid:	0			TracerPid:	0
Uid:	0	0	0		Uid:	0	0	0
Gid:	0	0	0		Gid:	0	0	0
FDSize:	128				FDSize:	128
Groups:					Groups:	
VmPeak:	  193568 kB			VmPeak:	  193568 kB
VmSize:	  128032 kB			VmSize:	  128032 kB
VmLck:	       0 kB			VmLck:	       0 kB
VmPin:	       0 kB			VmPin:	       0 kB
VmHWM:	    6624 kB			VmHWM:	    6624 kB
VmRSS:	    6308 kB			VmRSS:	    6308 kB
RssAnon:	    2548 kB		RssAnon:	    2548 kB
RssFile:	    3760 kB		RssFile:	    3760 kB
RssShmem:	       0 kB		RssShmem:	       0 kB
VmData:	   84220 kB			VmData:	   84220 kB
VmStk:	     132 kB			VmStk:	     132 kB
VmExe:	    1408 kB			VmExe:	    1408 kB
VmLib:	    3716 kB			VmLib:	    3716 kB
VmPTE:	     124 kB			VmPTE:	     124 kB
VmSwap:	       0 kB			VmSwap:	       0 kB
Threads:	1			Threads:	1
SigQ:	0/3860			  |	SigQ:	1/3860
SigPnd:	0000000000000000		SigPnd:	0000000000000000
ShdPnd:	0000000000000000		ShdPnd:	0000000000000000
SigBlk:	7be3c0fe28014a03		SigBlk:	7be3c0fe28014a03
SigIgn:	0000000000001000		SigIgn:	0000000000001000
SigCgt:	00000001800004ec		SigCgt:	00000001800004ec
CapInh:	0000000000000000		CapInh:	0000000000000000
CapPrm:	0000001fffffffff		CapPrm:	0000001fffffffff
CapEff:	0000001fffffffff		CapEff:	0000001fffffffff
CapBnd:	0000001fffffffff		CapBnd:	0000001fffffffff
CapAmb:	0000000000000000		CapAmb:	0000000000000000
Seccomp:	0			Seccomp:	0
Speculation_Store_Bypass:		Speculation_Store_Bypass:
Cpus_allowed:	3			Cpus_allowed:	3
Cpus_allowed_list:	0-1		Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000	  /	Mems_allowed:	00000000,00000
Mems_allowed_list:	0	  <
voluntary_ctxt_switches:	  <
nonvoluntary_ctxt_switches:	  <

In the above, the proc.psinfo.{nvctxsw,vctxsw} entries will be 0.


Version-Release number of selected component (if applicable):
    pcp-3.12.2-5.el7.x86_64


How reproducible:
    Easily

Steps to Reproduce:
1. Install the pcp-zeroconf package
    # yum install pcp-zeroconf -y

2. Instantiate a hotproc configuration to match "systemd"
    # pmstore hotproc.control.config 'fname == "systemd"' 

3. Verify the contents of the proc.psinfo.nvctxsw entries against the underlying file contents:
    # pminfo -f hotproc.psinfo.nvctxsw

Actual results:

# tail -1 /proc/1/status
nonvoluntary_ctxt_switches:	2348

# pminfo -f hotproc.psinfo.nvctxsw

hotproc.psinfo.nvctxsw
    inst [1 or "000001 /usr/lib/systemd/systemd"] value 0

Expected results:

# tail -1 /proc/1/status
nonvoluntary_ctxt_switches:	2348

# pminfo -f hotproc.psinfo.nvctxsw

hotproc.psinfo.nvctxsw
    inst [1 or "000001 /usr/lib/systemd/systemd"] value 2348


Additional info:
 Though the above example is simplest to achieve, the issue can present itself as PM_ERR_APPVERSION for string metrics such as {,hot}proc.psinfo.cpusallowed if the system is sufficiently large enough the that Cpus_allowed_list field is beyond the 1024 bytes limit. Issue seems to be present upstream.

Comment 2 Mark Goodwin 2018-07-12 04:52:31 UTC

Thanks for the report Kyle, it is indeed a bug - the buffer is only 1K. The simplistic patch would just increase that to 8K (or so), but a looping read would be better. Related code should also be audited, which I'll undertake too.

Regards

Comment 3 Kyle Walker 2018-07-12 16:10:53 UTC

(In reply to Mark Goodwin from comment #2)
> Thanks for the report Kyle, it is indeed a bug - the buffer is only 1K. The
> simplistic patch would just increase that to 8K (or so), but a looping read
> would be better.

I agree. The looping read doesn't look like the simplest change based on the current implementation, but much more effective than increasing the buffer.

Thanks for picking it up so quickly!

- Kyle Walker

Comment 4 Mark Goodwin 2018-07-20 07:07:57 UTC

upstream fix posted (still need a QA test). See: 

https://github.com/goodwinos/pcp/commit/73ce19502de10cad6c8a10b2cc7d787bb4717575


commit 73ce19502de10cad6c8a10b2cc7d787bb4717575
Author: Mark Goodwin <mgoodwin>
Date:   Fri Jul 20 16:24:45 2018 +1000

    pmdaproc: rework fetch_proc_pid_stat to use a looping read
    
    RHBZ #1600262 https://bugzilla.redhat.com/show_bug.cgi?id=1600262
    
    fetch_proc_pid_stat() was reading /proc/PID/status with a fixed
    size 1024 byte buffer. This is no longer big enough on most
    machines, causing some metrics to have incorrect values; notably
    the Mems_allowed, Mems_allowed_list, voluntary_ctxt_switches and
    nonvoluntary_ctxt_switches fields were not being read or parsed.
    
    Rework this to use a looping read with a suitably sized malloc'd
    buffer. QA test forthcoming.
    
    modified:   src/pmdas/linux_proc/proc_pid.c

Comment 5 Mark Goodwin 2018-07-31 06:18:15 UTC

Additional fixes and QA tests for this BZ that will be in pcp-4.1.1

commit a3f27c07ea7f480c16070ada22835375d7c311a0
Author: Mark Goodwin <mgoodwin>
Date:   Sun Jul 29 16:22:52 2018 +1000

    pmdaproc: tweak proc.memory.maps to return empty string for kernel workers
    
    Recent changes broke qa/860 by returning -ENODATA when a process
    has zero length /proc/PID/maps. WHilst this is technically correct,
    it's not historically correct and we need to clear the error and
    instead return a zero length string (as pmdaproc has always done).

commit a5a9195e2c16dd042e547001d67ab47c5b50b3ad
Author: Mark Goodwin <mgoodwin>
Date:   Tue Jul 24 13:41:21 2018 +1000

    pmdaproc: additional proc_pid rework to fix short buffer issues
    
    RHBZ #1600262
    
    Fix short buffer/read issues with some additional proc metrics.
    proc.psinfo.environ (/proc/PID/environ), proc.memory.maps
    (/proc/PID/maps), proc.psinfo.nvctxsw and proc.psinfo.vctxsw
    (/proc/PID/status) and a few others.
    
    modified:   src/pmdas/linux_proc/proc_pid.c

commit 12304c9d71711951b763bbdfd03a4834e0d03114
Author: Mark Goodwin <mgoodwin>
Date:   Sun Jul 29 16:01:03 2018 +1000

    qa: remake qa/1350 to check voluntary_ctxt_switches
    
    nonvoluntary_ctxt_switches in /proc/PID/status can sometimes be zero

commit 0b1fb9c7d8e04173e1036a2186b96393727223e3
Author: Mark Goodwin <mgoodwin>
Date:   Sun Jul 29 12:11:04 2018 +1000

    qa: add test 1350 to test pmdaproc reading /proc/PID/status
    
    RHBZ #1600262 https://bugzilla.redhat.com/show_bug.cgi?id=1600262
    
    new file:   qa/1350
    new file:   qa/1350.out
    modified:   qa/group

Comment 6 Fedora Update System 2018-08-04 00:21:52 UTC

pcp-4.1.1-2.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-0baea70d93

Comment 7 Fedora Update System 2018-08-04 00:34:05 UTC

pcp-4.1.1-2.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-bedf3e464f

Comment 8 Fedora Update System 2018-08-04 20:32:04 UTC

pcp-4.1.1-2.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-bedf3e464f

Comment 9 Fedora Update System 2018-08-04 22:37:00 UTC

pcp-4.1.1-2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-0baea70d93

Comment 18 Michal Kolar 2019-01-30 11:31:55 UTC

Reproduced against pcp-4.1.0-4.el7 and verified against pcp-4.3.0-2.el7.

Comment 20 errata-xmlrpc 2019-08-06 12:48:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2111