Created attachment 1112686 [details] repro script - needs the sample PMDA enabled Description of problem: libpcp seems to be interpolating counters across mark records in merged archives. If a counter gets reset to zero between the two archives (e.g. following a reboot), then most tools replaying the merged archive will report a negative rate. pmval and pmdumptext actually check if the rate converted value is negative and report '?', but seems to me the library should return PM_ERR_VALUE since the interpolated result is unlikely to be correct. Version-Release number of selected component (if applicable): pcp-3.11 How reproducible: easily, see attached repro script Steps to Reproduce: 1. create an archive containing a counter metric 2. reset the counter metric to zero (e.g. sample.byte_ctr) 3. create another archive containing the same counter metric 4. merge the two archives 5. use a client to report the interpolated rate converted value across the time of the resulting mark record Actual results: negative rate converted value Expected results: either every tool should check for a counter going backwards (this is different to a counter wrap), or the library should return PM_ERR_VALUE since the interpolated counter value is likely to be bogus. Additional info: a common support scenario is to merge all archives on a customer system, then replay with a large sampling interval, e.g. pmiostat -t 4h on a merged archive spanning a week or more to see when the problems are occurring. On most such merged archives, I'm seeing many negative values. Example output from the repro.sh script $ ./repro.sh first archive Log Label (Log Format Version 2) Performance metrics from host kilcunda commencing Fri Jan 8 13:00:18.729 2016 ending Fri Jan 8 13:00:22.749 2016 second archive Log Label (Log Format Version 2) Performance metrics from host kilcunda commencing Fri Jan 8 13:00:23.853 2016 ending Fri Jan 8 13:00:52.873 2016 merged archive Log Label (Log Format Version 2) Performance metrics from host kilcunda commencing Fri Jan 8 13:00:18.729 2016 ending Fri Jan 8 13:00:52.873 2016 raw (uninterpolated) values in merged archive : metric: sample.byte_ctr archive: third host: kilcunda start: Fri Jan 8 13:00:18 2016 end: Fri Jan 8 13:00:52 2016 semantics: cumulative counter units: byte samples: 15 13:00:18.749 14067 13:00:19.749 14603 13:00:20.749 15163 13:00:21.749 15616 13:00:22.749 16398 13:00:22.750 Archive logging suspended 13:00:23.873 0 13:00:24.873 534 13:00:25.873 1085 13:00:26.873 1567 13:00:27.873 1909 13:00:28.873 2089 13:00:29.873 2290 13:00:30.873 2879 13:00:31.873 3291 mark record is here: 13:00:22.750 <mark> RAW values across mark record : s.byte_ctr byte 13:00:20 15152 13:00:30 2794 13:00:40 7755 13:00:50 12635 RATE converted values across mark record : s.byte_ctr byte/s 13:00:20 N/A 13:00:30 -1235.800 13:00:40 496.100 13:00:50 488.000
> or the library should return PM_ERR_VALUE since the interpolated counter > value is likely to be bogus. Paraphrasing, you would like to see this happen: ? pmSetMode(PM_MODE_INTERP ...) # time = 13:00:20 pmFetch(counter_pmid) -> 15152 # time = 13:00:30 , includes <MARK> this time interval pmFetch(counter_pmid) -> numvals=0 (missing value), instead of 2794 # time = 13:00:40 pmFetch(counter_pmid) -> 7755 # time = 13:00:50 pmFetch(counter_pmid) -> 12635 Then rate-conversion logic in pmrep/etc. running for the 13:00:30-13:00:40 time interval would result in what fetchgroups would be represent as a PM_ERR_AGAIN. (Likewise for the 13:00:20-13:00:30 time interval.) So libpcp would have to figure out that it's interpolating a PM_SEM_COUNTER metric across a MARK, and reject it. (I'm surprised it's not already doing that.)
The behavior Mark is seeing is just plain WRONG. At some point we (probably I) have introduced a regression in interp.c ... most likely in the performance optimization changes. You should not need any app changes, the correct response when interpolating from 13:00:20 to 13:00:00 in the example above is "No values available". Thanks for repro.sh, I've reproduced the problem locally and will start to triage ... I expect this will be ugly!
(In reply to Ken McDonell from comment #2) > The behavior Mark is seeing is just plain WRONG. +1 -- and agreed re relatively recent regression, ISTR this working not too long ago. > Thanks for repro.sh, I've reproduced the problem locally and will start to > triage ... Thanks Ken.
I'm suspect this is NOT a regression in interp.c (there is already a bunch of QA tests for interpolation across <mark> records, so that's sort of what I'd have expected). But it is a problem associated with the use of the -O2sec time window on this archive ... In my archive the <mark> is at 07:13:21.705 pmrep gets the WRONG answers with -O2sec $ pmrep -O2s -t 10s -pf%H:%M:%S -a third sample.byte_ctr s.byte_ctr byte/s 07:13:19 N/A 07:13:29 -48.900 <--- sample after <mark> 07:13:39 494.100 07:13:49 488.700 and pmrep gets the RIGHT answers without -O $ pmrep -t 10s -pf%H:%M:%S -a third sample.byte_ctr s.byte_ctr byte/s 07:13:17 N/A 07:13:27 N/A <--- sample after <mark> 07:13:37 457.400 07:13:47 550.500 I'll post as I discover more.
This is pretty subtle, but I don't think there is an error in interpolation here. We may have a (different? bigger?) problem associated with rate conversion. Consider this sequence of values in an archive time value t0 n/a t1 v1 t2 v2 t3 <mark> t4 v4 t5 v5 Now consider pmFetch requests at time t for the following cases: t0 <= t < t1 no value available t1 <= t < t2 value based on v1 and v2 t2 <= t < t4 no value available t4 <= t < t5 value based on v4 and v5 This is both correct and what the current implementation does. It is also the same behaviour that libpcp has always exhibited, so there is no regression. The problem would be most evident in cases where the <mark> record is associated with a pmcd restart, rather than the daily cron archive rotation, and the reporting interval from the client is long in relation to the pmlogger sampling intervals. Now consider an application that is doing rate conversion on the values for consecutive samples from the example above. There is no problem when consecutive samples are both in either of the regions t0 <= t <t4 or t4 <= t < t5. The problem comes when the consecutive samples come from t1 <= t < t2 AND t4 <= t < t5 ... in this case both values are well defined from an interpolation perspective, but the resultant rate calculation is bogus. A solution here is not within the realm of pmfetch() (and hence nothing to do with interp.c) as there are two pmFetch() calls involved, both of which return valid and well-defined values. Seems like we're going to need some additional functionality here ... perhaps a real pmlogreduce that converts these counter metrics to rates and could accommodate <mark> records and "minimal interval coverage" to decide if there is a sensible representative value for an interval, and/or a new API function to allow PMAPI clients using archive contexts to determine if the archive data is continuous from time tx to time ty. Not sure what to do with this bug now.
Thanks for your analysis! One possibility for solution would be reinterpreting the responsibility of interp.c, not simply to deliver a value between two samples in an archive (ie., samples at time=t{1,2,3,4,5}), but to also consider the times of the pmFetch time-cursor values t=t{A,B,C,D}. interp.c could recognize that between the previous pmFetch and the current one (tA and tB), there occurred a <mark>, and refuse to provide a result: t1 value tA (fetch) t2 value t3 <mark> t4 value tB (fetch) t5 value ... so even though tA and tB could be defined in isolation (as a function of t{1,2} and t{4,5} respectively), interp.c could reject it because of the <mark> between tA and tB. (Maybe it could do this for SEM_COUNTER only.) This would cost applications a lost interp-mode sample across mark boundaries, but that's all. pmrep etc. would just work (tm). This reading of PM_MODE_INTERP doesn't seem incompatible with pmSetMode's documentation, or the programming guide. The first just says it computes "values [...] in the "proximity of the time origin", not just the two adjacent ones. This could be a good time to document <mark> in pmSetMode's man page.
> This reading of PM_MODE_INTERP doesn't seem incompatible with pmSetMode's > documentation, or the programming guide. Thinking about it more, it's even more appropriate in a way. The whole idea of INTERP mode is that the client doesn't want to know the ebb and flow of the actual underlying data: it wants to pretend as if only its pmFetch() timing mattered. So one could argue that interp.c would be more correct to analyze by pmFetch time intervals rather than archive records' time intervals. (This could go beyond the <mark> analysis we're talking about now; it could relate to taking interpolation/averaging over multiple samples in the given time window, not just the immediately-adjacent ones.)
Frank, What you're proposing is a very important change to the design assumptions that underpin pmFetch() and indeed the broader PCP architecture. Namely: - archives and live sources should be semantically as close as we can make them - one client can fetch different metrics at different points in time - pmFetch() delivers a snapshot of metric values at a single point in time - as a general rule temporal averaging (and rate conversion is one example of this) should be a client-side function (to reduce complexity, state and semantic uncertainty at pmcd) I think the point of interp mode is to provide values at a point in time that might be different to the sample times in the archive, which is subtly different to your suggestion of "the client doesn't want to know the ebb and flow of the actual underlying data". Critically, interp.c has no visibility to the history of pmFetch() calls, it operates on the state of the archive and the point in time for the current pmFetch(). So, I'm loathe to push functionality into interp.c when I believe that functionality is better done in the PMAPI client. Note that all of this discussion is moot for non-counter metrics, so any interp.c changes would have to work for a mixture of counter and non-counter metrics in a pmFetch() and possibly different metric-instances appearing in consecutive pmFetch() requests. I can't see any way to overload the data continuity status in the result from the current pmFetch(). I think I can implement a pmCheckContinuity (suggestions for a better name would be appreciated) that takes two timestamps (as returned by pmFetch) and returns a true/false value depending on the presence of a <mark> record in the timestamped interval ... and I think I can do this in a way that involves no extra archive scanning for the common use case. Then PMAPI clients processing archives that want to do rate conversion on counters and care could call pmCheckContinuity() after each pmFetch() and take appropriate action. This is obviously tricky, I just wish we had a better forum than BZ (like a workshop) to resolve issues like this.
> What you're proposing is a very important change to the design assumptions > that underpin pmFetch() and indeed the broader PCP architecture. Namely: Yeah. > - archives and live sources should be semantically as close as we can make > them > - one client can fetch different metrics at different points in time > - pmFetch() delivers a snapshot of metric values at a single point in time > - as a general rule temporal averaging (and rate conversion is one example > of this) should be a client-side function (to reduce complexity, state and > semantic uncertainty at pmcd) Actually I think each of these properties would be preserved in the fetch-timing-aware interpolation model. > I think the point of interp mode is to provide values at a point in time > that might be different to the sample times in the archive, which is subtly > different to your suggestion of "the client doesn't want to know the ebb and > flow of the actual underlying data". Yeah, subtly different, but probably closer to the user/programmer intuition. > Critically, interp.c has no visibility to the history of pmFetch() calls, it > operates on the state of the archive and the point in time for the current > pmFetch(). Sure, but that is an implementation detail within libpcp. interp.c could have access to the current context's pmSetMode time delta; it wouldn't need to keep an actual history of prior pmFetches or their pmResult timestamps. > So, I'm loathe to push functionality into interp.c when I believe that > functionality is better done in the PMAPI client. That's a reasonable alternative, if changing pmFetch itself is unpalatable. > Note that all of this discussion is moot for non-counter metrics Why so? For an SEM_INSTANT value interpolated between time [x,x+dt), it would be reasonable to inspect all actual values in or bounding that interval, in case e.g. the value immediately before x+dt was a random outlier. A weighted or windowed sum over all the values could be quite a defensible calculation too. > [...] Then PMAPI clients processing archives that want to do rate > conversion on counters and care could call pmCheckContinuity()a > after each pmFetch() and take appropriate action. [...] That would be workable too, though would mandate changes to all existing clients, or at least all those that do rate conversion.
(In reply to Ken McDonell from comment #8) [..] > I think I can implement a pmCheckContinuity (suggestions for a better name > would be appreciated) that takes two timestamps (as returned by pmFetch) and > returns a true/false value depending on the presence of a <mark> record in > the timestamped interval ... and I think I can do this in a way that > involves no extra archive scanning for the common use case. Then PMAPI > clients processing archives that want to do rate conversion on counters and > care could call pmCheckContinuity() after each pmFetch() and take > appropriate action. Hi Ken If pmCheckContinuity() is feasible, then couldn't pmFetchArchive() do the same for the current timestamp plus (or minus) the current pmSetMode delta? and thus return PM_ERR_VALUE when an interpolated fetch spans a <mark>? I have an archive that I'm about to commit to qa/archives that has a mark record in it and all the tools seem to be able to replay it without getting any PM_ERR_VALUE errors for a fetch that spans the mark timestamp. Both pmval and pmdumptext check for counters going backward (and more recently pmiostat does too) and report '?', but pmrep just faithfully reports a negative rate. Cheers -- Mark
> Both pmval and pmdumptext check for counters going backward (and more recently > pmiostat does too) and report '?', but pmrep just faithfully reports a > negative rate. Negative rates are an orthogonal issue. If cross-<mark>-interpolated counter values happened to be larger than before, you'd see a positive rate, but it would be a false positive. The correct answer would still be "no answer".
(In reply to Frank Ch. Eigler from comment #11) > > Both pmval and pmdumptext check for counters going backward (and more recently > > pmiostat does too) and report '?', but pmrep just faithfully reports a > > negative rate. > > Negative rates are an orthogonal issue. If cross-<mark>-interpolated > counter values happened to be larger than before, you'd see a positive > rate, but it would be a false positive. The correct answer would > still be "no answer". yes I know - negative rates are symptomatic of when the correct answer should have been no answer. The most common symptom is a counter going backwards (and not due to a wrap). Cheers
(In reply to Frank Ch. Eigler from comment #11) >... If cross-<mark>-interpolated > counter values happened to be larger than before, you'd see a positive > rate, but it would be a false positive. The correct answer would > still be "no answer". I should point out this ^^ is the scenario that scares me - if a counter didn't go backwards across a mark, then the reported rate is likely to be bogus, but currently undetectable. No answer is the only right answer here :)
I changed my mind, and agree with Mark and Frank. Fixed in commit 9ab96e1 in my tree.
Fixed in pcp-3.11.0-1
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
As Mark said in #c15 - fixed in pcp-3.11.0-1