Hide Forgot
Description of problem: set sample collection rate to 1s, pmlogger.log shows 1s sample rate, but pmstat -a outputs at a different sampling time. Version-Release number of selected component (if applicable): 3.10.3 (latest yum install available from Red Hat) How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: @ Tue Dec 22 00:10:12 2015 loadavg memory swap io system cpu 1 min swpd free buff cache pi po bi bo in cs us sy id wa st 2.99 0 37621m 352332 5152m ? ? ? ? ? ? ? ? ? ? ? 2.99 0 37620m 352332 5153m 0 0 0 30 109 83 0 0 100 0 0 2.99 0 37619m 352332 5153m 0 0 0 6 115 89 0 0 100 0 0 3.07 0 36781m 352332 5967m 0 0 0 8 985 167 8 3 89 0 0 3.23 0 35422m 352332 7285m 0 0 137 134 1116 91 10 3 87 0 0 3.29 0 34067m 352332 8601m 0 0 0 30 1147 106 9 3 87 0 0 3.34 0 37519m 352336 5155m 0 0 1 1416 1453 118 11 2 87 0 0 3.40 0 37518m 352336 5156m 0 0 0 115 1117 97 13 0 87 0 0 3.36 0 37518m 352336 5156m 0 0 0 32 1111 111 13 0 87 0 0 3.58 0 37508m 352336 5157m 0 0 1 23K 1144 122 13 0 82 5 0 3.77 0 37508m 352344 5158m 0 0 1 63K 1212 3085 12 1 70 17 0 3.95 0 37508m 352352 5158m 0 0 1 63K 1187 596 13 0 70 17 0 4.11 0 37508m 352360 5159m 0 0 3 68K 1174 1324 12 1 72 16 0 4.18 0 37752m 352364 5011m 0 0 3 71K 350 8628 0 1 73 26 0 4.25 0 37751m 352372 5012m 0 0 2 68K 240 179 0 1 85 14 0 4.25 0 37751m 352372 5012m 0 0 1 71K 148 201 0 0 84 16 0 4.52 0 37751m 352384 5012m 0 0 2 69K 160 161 0 0 86 13 0 4.56 0 37751m 352392 5012m 0 0 1 65K 199 1365 0 1 84 15 0 4.59 0 37750m 352408 5013m 0 0 8 58K 194 4014 0 1 78 21 0 4.62 0 37749m 352424 5013m 0 0 4 60K 162 96 0 0 87 12 0 4.65 0 37749m 352428 5014m 0 0 3 63K 209 6753 0 1 76 23 0 @ Tue Dec 22 00:11:57 2015 105s between timestamps, 21 samples = 105/21 = 5s per sample vs 1s expected. pmlogger indicates data is collected at 1s sample rate: : } logged every 1 sec: Expected results: Additional info: Need to align/coordinate with pmiostat -x t -a <same archive>, but cannot with the above output.
To paraphrase, you are seeing high-rate logging at the requested interval in the archive. You are expecting to see an archive-consumer tool like "pmstat -a ARCHIVE" to automatically match the sampling interval. That's an intuitive expectation, but not one that pcp customarily provides. Clients choose their own preferred sampling interval. It can easily be higher or lower temporal resolution than the underlying archive, leading libpcp to interpolate or subsample. The sampling rate within archives is not a fixed quantity either; pmlogger can vary its rates due to pmlc, or the data might have come from an importer with varying time intervals.
Thanks Frank - concur, this is not a bug, it's a feature(!). In general, there is no single 'sampling interval' in a PCP archive - different metrics can be logged at different sampling intervals, and those intervals can be changed dynamically. e.g. you could write a pmie(1) rule that periodically checks for high i/o wait, and when detected, invoke pmlc(1) to increase the sampling rate for assorted disk metrics and also enable logging for assorted per-process metrics (e.g proc.io.{read,write}_bytes) for the next hour. Some time later, the analyst can examine the logs and determine the application/process level activity that correlates with the increased wait time or disk activity. The interpolation, sub-sampling and averaging by PCP clients when replaying archives doesn't at first seem intuitive, but it really is a great feature. Another example: replay an archive with pmiostat -a somearchive -t 1h to see hourly averaged i/o stats. Then zoom in on a particular time of interesting activity with the -S and -t flags, e.g. pmiostat -a somearchive -S 'Wed Dec 23 08:31:20' -t 1m -s 60 will report starting at that date and time for 60 samples of one minute each, i.e. zoom into an "interesting hour". The reverse ctime parser supported by -S and other flags is really handy, especially since you can cut/paste from syslog for the interesting times, etc. See description of all the time related command line arguments in the "TIME WINDOW SPECIFICATION" section of pcpintro(1).
above example should be -S '@Wed Dec 23 08:31:20' Need the literal @ to designate an absolute timestamp as opposed to a time interval or offset. See pcpintro(1)
The issue is we currently are able to easily coordinate measured timestamped data for cpu, memory, io, interrupt, et.al. resources using sysstat and even more easily with collectl. However with PCP, each replay client seems to be doing its own thing, which might individually seem like a good idea, but overly complicates coordinating resource use across the different types of resources. Cases require manipulating just one resource report, but multiples which depends on the timestamp and the measured data (rather than resampled/interpolated) at that timestamp across the multiple resources in order to understand whats going on. The standard "sweet spot" sample period we use for storage is 10s. Anything higher, and too often the data is averaged out over too long of a sample time to be trustworthy of an interpretation. In some cases even going to 1s intervals is insufficient (at which point we switch over to using blktrace). At least a couple times a month I have to resort to using blktrace to validate to the customer that the iostat data being captured/displayed is accurate. Having a tool that resamples/interpolates data will make that much more difficult. "those intervals can be changed dynamically. e.g. you could write a pmie(1) rule that periodically checks for high i/o wait, and when detected, invoke pmlc(1)" again, great for monitoring but makes analysis more difficult because the sampling period is non-constant and shifting. With a steady sampling rate you can see the backlog shift in queue depth on a device that ultimately causes high await time vs just faster samples after the problem has already happened -- i.e. faster samples won't show you the actual historical root cause which was happening prior to the await time increase.
Cases *don't* require manipulating just one resource report...
Ok, so configure the PCP logging config to record all the required metrics with the same interval, notionally 10s. Then use -t 10s with all tools when replaying. FOr consistency, all PCP tools should support the -A and -S flags (see pcpintro(1), so they can all align replay, starting at the same common offset from the start of the log. This *should* suffice, ?
Ah yep thanks Frank - I'd forgotten about that option for pmSetMode(3). Perhaps more conveniently and less intrusively, we could introduce an environment variable for the default pmSetMode setting in archive mode, either PM_MODE_FORW or PM_MODE_INTERP, defaulting to the latter. Otherwise we'd burn another flag (-U) everywhere. I still maintain that once everyone new to PCP gets used to interp mode, it will become preferred - apart from a few very special corner cases :D Cheers
"Otherwise we'd burn another flag (-U) everywhere." IMHO, flags are there to be burned! Burn free, as free as the wind blows, as free as the grass grows, burn free to follow your heart. But seriously, an env. var that acts so much like a command line option should just -be- a command line option. We'd have the same sort of conflict resolution logic between the two of them. (I fully agree about interp mode being a good & useful default.)
Bud has pointed out to me that this BZ has been closed prematurely, given all the followup comments and possible fixes there - re-opening.
Upstream commit (as follows below) fixes this for the most needed case (pmiostat aka pcp-iostat). This will be in next release: pcp-3.11.1-1. Note that customer performance data collected using already-released PCP versions can be replayed with the new version of pmiostat supporting -u (non-interpolating mode), which can be installed on your workstation. So there probably isn't precedent to push this into RHEL with a spec/patch release - can we update this BZ to target fedora/rawhide instead of RHEL6? commit d8fdf340a692d53ec530372c1df17981825351da Author: Mark Goodwin <mgoodwin> Date: Wed Feb 10 13:00:39 2016 +1100 pmiostat: support uninterpolated archive replay with -u flag The -u flag disables interpolation when replaying an archive. Instead the native underlying replay interval in the archive is used. See the -u option in the man page for details.
Verified with build pcp-3.10.9-9.el6.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0735.html