Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1293661

Summary:	-t 1s global option not used in all cases
Product:	Red Hat Enterprise Linux 6	Reporter:	Dwight (Bud) Brown <bubrown>
Component:	pcp	Assignee:	Nathan Scott <nathans>
Status:	CLOSED ERRATA	QA Contact:	Miloš Prchlík <mprchlik>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.7	CC:	brolley, fche, lberk, loberman, mbenitez, mcermak, mgoodwin, mprchlik, myllynen
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pcp-3.10.9-8.el6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-21 11:20:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dwight (Bud) Brown 2015-12-22 15:44:12 UTC

Description of problem:
set sample collection rate to 1s, pmlogger.log shows 1s sample rate, but pmstat -a  outputs at a different sampling time.

Version-Release number of selected component (if applicable):
3.10.3 (latest yum install available from Red Hat)

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
@ Tue Dec 22 00:10:12 2015
 loadavg                      memory      swap        io    system                 cpu
   1 min   swpd   free   buff  cache   pi   po   bi   bo   in   cs  us  sy  id  wa  st
    2.99      0 37621m 352332  5152m    ?    ?    ?    ?    ?    ?   ?   ?   ?   ?   ?
    2.99      0 37620m 352332  5153m    0    0    0   30  109   83   0   0 100   0   0
    2.99      0 37619m 352332  5153m    0    0    0    6  115   89   0   0 100   0   0
    3.07      0 36781m 352332  5967m    0    0    0    8  985  167   8   3  89   0   0
    3.23      0 35422m 352332  7285m    0    0  137  134 1116   91  10   3  87   0   0
    3.29      0 34067m 352332  8601m    0    0    0   30 1147  106   9   3  87   0   0
    3.34      0 37519m 352336  5155m    0    0    1 1416 1453  118  11   2  87   0   0
    3.40      0 37518m 352336  5156m    0    0    0  115 1117   97  13   0  87   0   0
    3.36      0 37518m 352336  5156m    0    0    0   32 1111  111  13   0  87   0   0
    3.58      0 37508m 352336  5157m    0    0    1  23K 1144  122  13   0  82   5   0
    3.77      0 37508m 352344  5158m    0    0    1  63K 1212 3085  12   1  70  17   0
    3.95      0 37508m 352352  5158m    0    0    1  63K 1187  596  13   0  70  17   0
    4.11      0 37508m 352360  5159m    0    0    3  68K 1174 1324  12   1  72  16   0
    4.18      0 37752m 352364  5011m    0    0    3  71K  350 8628   0   1  73  26   0
    4.25      0 37751m 352372  5012m    0    0    2  68K  240  179   0   1  85  14   0
    4.25      0 37751m 352372  5012m    0    0    1  71K  148  201   0   0  84  16   0
    4.52      0 37751m 352384  5012m    0    0    2  69K  160  161   0   0  86  13   0
    4.56      0 37751m 352392  5012m    0    0    1  65K  199 1365   0   1  84  15   0
    4.59      0 37750m 352408  5013m    0    0    8  58K  194 4014   0   1  78  21   0
    4.62      0 37749m 352424  5013m    0    0    4  60K  162   96   0   0  87  12   0
    4.65      0 37749m 352428  5014m    0    0    3  63K  209 6753   0   1  76  23   0
@ Tue Dec 22 00:11:57 2015

105s between timestamps, 21 samples = 105/21 = 5s per sample vs 1s expected.

pmlogger indicates data is collected at 1s sample rate:

:
} logged every 1 sec:

Expected results:


Additional info:

Need to align/coordinate with pmiostat -x t -a <same archive>, but cannot with the above output.

Comment 2 Frank Ch. Eigler 2015-12-22 15:58:34 UTC

To paraphrase, you are seeing high-rate logging at the requested interval in the archive.  You are expecting to see an archive-consumer tool like "pmstat -a ARCHIVE" to automatically match the sampling interval.

That's an intuitive expectation, but not one that pcp customarily provides.  

Clients choose their own preferred sampling interval.  It can easily be higher or lower temporal resolution than the underlying archive, leading libpcp to interpolate or subsample.  The sampling rate within archives is not a fixed quantity either; pmlogger can vary its rates due to pmlc, or the data might have come from an importer with varying time intervals.

Comment 3 Mark Goodwin 2015-12-22 22:42:34 UTC

Thanks Frank - concur, this is not a bug, it's a feature(!). In general, there is no single 'sampling interval' in a PCP archive - different metrics can be logged at different sampling intervals, and those intervals can be changed dynamically. e.g. you could write a pmie(1) rule that periodically checks for high i/o wait, and when detected, invoke pmlc(1) to increase the sampling rate for assorted disk metrics and also enable logging for assorted per-process metrics (e.g proc.io.{read,write}_bytes) for the next hour. Some time later, the analyst can examine the logs and determine the application/process level activity that correlates with the increased wait time or disk activity.

The interpolation, sub-sampling and averaging by PCP clients when replaying archives doesn't at first seem intuitive, but it really is a great feature.
Another example: replay an archive with pmiostat -a somearchive -t 1h  to see hourly averaged i/o stats. Then zoom in on a particular time of interesting activity with the -S and -t flags, e.g. pmiostat -a somearchive -S 'Wed Dec 23 08:31:20' -t 1m -s 60
will report starting at that date and time for 60 samples of one minute each, i.e. zoom into an "interesting hour". The reverse ctime parser supported by -S and other flags is really handy, especially since you can cut/paste from syslog for the interesting times, etc. See description of all the time related command line arguments in the "TIME WINDOW SPECIFICATION" section of pcpintro(1).

Comment 4 Mark Goodwin 2015-12-23 05:53:16 UTC

above example should be -S '@Wed Dec 23 08:31:20'

Need the literal  @  to designate an absolute timestamp as opposed to a time interval or offset. See pcpintro(1)

Comment 5 Dwight (Bud) Brown 2015-12-23 16:43:54 UTC

The issue is we currently are able to easily coordinate measured timestamped data for cpu, memory, io, interrupt, et.al. resources using sysstat and even more easily with collectl. However with PCP, each replay client seems to be doing its own thing, which might individually seem like a good idea, but overly complicates coordinating resource use across the different types of resources.

Cases require manipulating just one resource report, but multiples which depends on the timestamp and the measured data (rather than resampled/interpolated) at that timestamp across the multiple resources in order to understand whats going on.

The standard "sweet spot" sample period we use for storage is 10s. Anything higher, and too often the data is averaged out over too long of a sample time to be trustworthy of an interpretation. In some cases even going to 1s intervals is insufficient (at which point we switch over to using blktrace). At least a couple times a month I have to resort to using blktrace to validate to the customer that the iostat data being captured/displayed is accurate. Having a tool that resamples/interpolates data will make that much more difficult.

"those intervals can be changed dynamically. e.g. you could write a pmie(1) rule that periodically checks for high i/o wait, and when detected, invoke pmlc(1)"

again, great for monitoring but makes analysis more difficult because the sampling period is non-constant and shifting. With a steady sampling rate you can see the backlog shift in queue depth on a device that ultimately causes high await time vs just faster samples after the problem has already happened -- i.e. faster samples won't show you the actual historical root cause which was happening prior to the await time increase.

Comment 6 Dwight (Bud) Brown 2015-12-23 16:44:47 UTC

Cases *don't* require manipulating just one resource report...

Comment 7 Mark Goodwin 2015-12-23 22:34:46 UTC

Ok, so configure the PCP logging config to record all the required metrics with the same interval, notionally 10s. Then use -t 10s with all tools when replaying. FOr consistency, all PCP tools should support the -A and -S flags (see pcpintro(1), so they can all align replay, starting at the same common offset from the start of the log. This *should* suffice, ?

Comment 11 Mark Goodwin 2015-12-24 00:42:08 UTC

Ah yep thanks Frank - I'd forgotten about that option for pmSetMode(3).
Perhaps more conveniently and less intrusively, we could introduce an environment variable for the default pmSetMode setting in archive mode, either PM_MODE_FORW or PM_MODE_INTERP, defaulting to the latter. Otherwise we'd burn another flag (-U) everywhere.

I still maintain that once everyone new to PCP gets used to interp mode, it will become preferred - apart from a few very special corner cases :D

Cheers

Comment 12 Frank Ch. Eigler 2015-12-24 00:52:53 UTC

"Otherwise we'd burn another flag (-U) everywhere."

IMHO, flags are there to be burned!

   Burn free, as free as the wind blows,
   as free as the grass grows,
   burn free to follow your heart.

But seriously, an env. var that acts so much like a command line option should just -be- a command line option.  We'd have the same sort of conflict resolution logic between the two of them.

(I fully agree about interp mode being a good & useful default.)

Comment 16 Nathan Scott 2016-01-18 00:38:11 UTC

Bud has pointed out to me that this BZ has been closed prematurely, given all the followup comments and possible fixes there - re-opening.

Comment 17 Mark Goodwin 2016-02-10 05:45:59 UTC

Upstream commit (as follows below) fixes this for the most needed case (pmiostat aka pcp-iostat). This will be in next release: pcp-3.11.1-1. Note that customer performance data collected using already-released PCP versions can be replayed with the new version of pmiostat supporting -u (non-interpolating mode), which can be installed on your workstation. So there probably isn't precedent to push this into RHEL with a spec/patch release - can we update this BZ to target fedora/rawhide instead of RHEL6?

commit d8fdf340a692d53ec530372c1df17981825351da
Author: Mark Goodwin <mgoodwin>
Date:   Wed Feb 10 13:00:39 2016 +1100

    pmiostat: support uninterpolated archive replay with -u flag
    
    The -u flag disables interpolation when replaying an archive.
    Instead the native underlying replay interval in the archive
    is used. See the -u option in the man page for details.

Comment 24 Miloš Prchlík 2017-01-28 14:28:28 UTC

Verified with build pcp-3.10.9-9.el6.

Comment 26 errata-xmlrpc 2017-03-21 11:20:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0735.html