Bug 1296750 - incorrect interpolation across <mark> record in a merged archive
Summary: incorrect interpolation across <mark> record in a merged archive
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 24
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Nathan Scott
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-08 02:11 UTC by Mark Goodwin
Modified: 2016-03-15 00:57 UTC (History)
8 users (show)

Fixed In Version: pcp-3.11.0-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-15 00:57:18 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
repro script - needs the sample PMDA enabled (838 bytes, application/x-shellscript)
2016-01-08 02:11 UTC, Mark Goodwin
no flags Details

Description Mark Goodwin 2016-01-08 02:11:42 UTC
Created attachment 1112686 [details]
repro script - needs the sample PMDA enabled

Description of problem: libpcp seems to be interpolating counters across mark records in merged archives. If a counter gets reset to zero between the two archives (e.g. following a reboot), then most tools replaying the merged archive will report a negative rate.

pmval and pmdumptext actually check if the rate converted value is negative and report '?', but seems to me the library should return PM_ERR_VALUE since the interpolated result is unlikely to be correct.

Version-Release number of selected component (if applicable): pcp-3.11

How reproducible: easily, see attached repro script

Steps to Reproduce:
1. create an archive containing a counter metric
2. reset the counter metric to zero (e.g. sample.byte_ctr)
3. create another archive containing the same counter metric
4. merge the two archives
5. use a client to report the interpolated rate converted value across the time of the resulting mark record

Actual results: negative rate converted value

Expected results: either every tool should check for a counter going backwards (this is different to a counter wrap), or the library should return PM_ERR_VALUE since the interpolated counter value is likely to be bogus.

Additional info: a common support scenario is to merge all archives on a customer system, then replay with a large sampling interval, e.g. pmiostat -t 4h on a merged archive spanning a week or more to see when the problems are occurring. On most such merged archives, I'm seeing many negative values.

Example output from the repro.sh script

$ ./repro.sh 
first archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
  commencing Fri Jan  8 13:00:18.729 2016
  ending     Fri Jan  8 13:00:22.749 2016

second archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
  commencing Fri Jan  8 13:00:23.853 2016
  ending     Fri Jan  8 13:00:52.873 2016

merged archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
  commencing Fri Jan  8 13:00:18.729 2016
  ending     Fri Jan  8 13:00:52.873 2016

raw (uninterpolated) values in merged archive :

metric:    sample.byte_ctr
archive:   third
host:      kilcunda
start:     Fri Jan  8 13:00:18 2016
end:       Fri Jan  8 13:00:52 2016
semantics: cumulative counter
units:     byte
samples:   15
13:00:18.749      14067
13:00:19.749      14603
13:00:20.749      15163
13:00:21.749      15616
13:00:22.749      16398
13:00:22.750  Archive logging suspended
13:00:23.873          0
13:00:24.873        534
13:00:25.873       1085
13:00:26.873       1567
13:00:27.873       1909
13:00:28.873       2089
13:00:29.873       2290
13:00:30.873       2879
13:00:31.873       3291

mark record is here:
13:00:22.750  <mark>

RAW values across mark record :
          s.byte_ctr
                byte
13:00:20       15152
13:00:30        2794
13:00:40        7755
13:00:50       12635

RATE converted values across mark record :
          s.byte_ctr
              byte/s
13:00:20         N/A
13:00:30   -1235.800
13:00:40     496.100
13:00:50     488.000

Comment 1 Frank Ch. Eigler 2016-01-08 19:32:44 UTC
> or the library should return PM_ERR_VALUE since the interpolated counter
> value is likely to be bogus.

Paraphrasing, you would like to see this happen: ?

pmSetMode(PM_MODE_INTERP ...)
# time = 13:00:20
pmFetch(counter_pmid) -> 15152
# time = 13:00:30 , includes <MARK> this time interval
pmFetch(counter_pmid) -> numvals=0 (missing value), instead of 2794
# time = 13:00:40
pmFetch(counter_pmid) -> 7755
# time = 13:00:50
pmFetch(counter_pmid) -> 12635

Then rate-conversion logic in pmrep/etc. running for the
13:00:30-13:00:40 time interval would result in what
fetchgroups would be represent as a PM_ERR_AGAIN.
(Likewise for the 13:00:20-13:00:30 time interval.)

So libpcp would have to figure out that it's interpolating a
PM_SEM_COUNTER metric across a MARK, and reject it.  (I'm
surprised it's not already doing that.)

Comment 2 Ken McDonell 2016-01-08 20:19:58 UTC
The behavior Mark is seeing is just plain WRONG.

At some point we (probably I) have introduced a regression in interp.c ... most likely in the performance optimization changes.

You should not need any app changes, the correct response when interpolating from 13:00:20 to 13:00:00 in the example above is "No values available".

Thanks for repro.sh, I've reproduced the problem locally and will start to triage ... I expect this will be ugly!

Comment 3 Nathan Scott 2016-01-10 23:07:05 UTC
(In reply to Ken McDonell from comment #2)
> The behavior Mark is seeing is just plain WRONG.

+1  -- and agreed re relatively recent regression, ISTR this working not too long ago.

> Thanks for repro.sh, I've reproduced the problem locally and will start to
> triage ... 

Thanks Ken.

Comment 4 Ken McDonell 2016-01-11 20:06:30 UTC
I'm suspect this is NOT a regression in interp.c (there is already a bunch of QA tests for interpolation across <mark> records, so that's sort of what I'd have expected).

But it is a problem associated with the use of the -O2sec time window on this archive ...

In my archive the <mark> is at 07:13:21.705

pmrep gets the WRONG answers with -O2sec

$ pmrep -O2s -t 10s -pf%H:%M:%S -a third sample.byte_ctr
          s.byte_ctr
              byte/s
07:13:19         N/A
07:13:29     -48.900                <--- sample after <mark>
07:13:39     494.100
07:13:49     488.700

and pmrep gets the RIGHT answers without -O  

$ pmrep -t 10s -pf%H:%M:%S -a third sample.byte_ctr
          s.byte_ctr
              byte/s
07:13:17         N/A
07:13:27         N/A                 <--- sample after <mark>
07:13:37     457.400
07:13:47     550.500

I'll post as I discover more.

Comment 5 Ken McDonell 2016-01-14 05:52:24 UTC
This is pretty subtle, but I don't think there is an error in interpolation here.  We may have a (different? bigger?) problem associated with rate conversion.

Consider this sequence of values in an archive

time value
t0   n/a
t1   v1
t2   v2
t3   <mark>
t4   v4
t5   v5

Now consider pmFetch requests at time t for the following cases:

t0 <= t < t1    no value available
t1 <= t < t2    value based on v1 and v2
t2 <= t < t4    no value available
t4 <= t < t5    value based on v4 and v5

This is both correct and what the current implementation does.  It is also the same behaviour that libpcp has always exhibited, so there is no regression.

The problem would be most evident in cases where the <mark> record is associated with a pmcd restart, rather than the daily cron archive rotation, and the reporting interval from the client is long in relation to the pmlogger sampling intervals.

Now consider an application that is doing rate conversion on the values for consecutive samples from the example above.  There is no problem when consecutive samples are both in either of the regions t0 <= t <t4 or t4 <= t < t5.  The problem comes when the consecutive samples come from t1 <= t < t2 AND t4 <= t < t5 ... in this case both values are well defined from an interpolation perspective, but the resultant rate calculation is bogus.

A solution here is not within the realm of pmfetch() (and hence nothing to do with interp.c) as there are two pmFetch() calls involved, both of which return valid and well-defined values.

Seems like we're going to need some additional functionality here ... perhaps a real pmlogreduce that converts these counter metrics to rates and could accommodate <mark> records and "minimal interval coverage" to decide if there is a sensible representative value for an interval, and/or a new API function to allow PMAPI clients using archive contexts to determine if the archive data is continuous from time tx to time ty.

Not sure what to do with this bug now.

Comment 6 Frank Ch. Eigler 2016-01-14 17:03:13 UTC
Thanks for your analysis!

One possibility for solution would be reinterpreting the responsibility
of interp.c, not simply to deliver a value between two samples in an
archive (ie., samples at time=t{1,2,3,4,5}), but to also consider the
times of the pmFetch time-cursor values t=t{A,B,C,D}.

interp.c could recognize that between the previous pmFetch and the
current one (tA and tB), there occurred a <mark>, and refuse to provide
a result:

t1  value
tA           (fetch)
t2  value
t3  <mark>
t4  value
tB           (fetch)
t5  value

... so even though tA and tB could be defined in isolation (as a function
of t{1,2} and t{4,5} respectively), interp.c could reject it because of 
the <mark> between tA and tB.  (Maybe it could do this for SEM_COUNTER
only.)

This would cost applications a lost interp-mode sample across mark
boundaries, but that's all.  pmrep etc. would just work (tm).

This reading of PM_MODE_INTERP doesn't seem incompatible with pmSetMode's
documentation, or the programming guide.  The first just says it computes
"values [...] in the "proximity of the time origin", not just the two
adjacent ones.  This could be a good time to document <mark> in pmSetMode's
man page.

Comment 7 Frank Ch. Eigler 2016-01-14 19:19:50 UTC
> This reading of PM_MODE_INTERP doesn't seem incompatible with pmSetMode's
> documentation, or the programming guide.

Thinking about it more, it's even more appropriate in a way.  The whole
idea of INTERP mode is that the client doesn't want to know the ebb and
flow of the actual underlying data: it wants to pretend as if only its
pmFetch() timing mattered.  So one could argue that interp.c would be
more correct to analyze by pmFetch time intervals rather than archive
records' time intervals.

(This could go beyond the <mark> analysis we're talking about now; it
could relate to taking interpolation/averaging over multiple samples
in the given time window, not just the immediately-adjacent ones.)

Comment 8 Ken McDonell 2016-01-15 01:33:08 UTC
Frank,

What you're proposing is a very important change to the design assumptions that underpin pmFetch() and indeed the broader PCP architecture.  Namely:

- archives and live sources should be semantically as close as we can make them
- one client can fetch different metrics at different points in time
- pmFetch() delivers a snapshot of metric values at a single point in time
- as a general rule temporal averaging (and rate conversion is one example of this) should be a client-side function (to reduce complexity, state and semantic uncertainty at pmcd)

I think the point of interp mode is to provide values at a point in time that might be different to the sample times in the archive, which is subtly different to your suggestion of "the client doesn't want to know the ebb and flow of the actual underlying data".

Critically, interp.c has no visibility to the history of pmFetch() calls, it operates on the state of the archive and the point in time for the current pmFetch().

So, I'm loathe to push functionality into interp.c when I believe that functionality is better done in the PMAPI client.

Note that all of this discussion is moot for non-counter metrics, so any interp.c changes would have to work for a mixture of counter and non-counter metrics in a pmFetch() and possibly different metric-instances appearing in consecutive pmFetch() requests.

I can't see any way to overload the data continuity status in the result from the current pmFetch().

I think I can implement a pmCheckContinuity (suggestions for a better name would be appreciated) that takes two timestamps (as returned by pmFetch) and returns a true/false value depending on the presence of a <mark> record in the timestamped interval ... and I think I can do this in a way that involves no extra archive scanning for the common use case.  Then PMAPI clients processing archives that want to do rate conversion on counters and care could call pmCheckContinuity() after each pmFetch() and take appropriate action.

This is obviously tricky, I just wish we had a better forum than BZ (like a workshop) to resolve issues like this.

Comment 9 Frank Ch. Eigler 2016-01-15 02:34:17 UTC
> What you're proposing is a very important change to the design assumptions
> that underpin pmFetch() and indeed the broader PCP architecture.  Namely:

Yeah.

> - archives and live sources should be semantically as close as we can make
> them
> - one client can fetch different metrics at different points in time
> - pmFetch() delivers a snapshot of metric values at a single point in time
> - as a general rule temporal averaging (and rate conversion is one example
> of this) should be a client-side function (to reduce complexity, state and
> semantic uncertainty at pmcd)

Actually I think each of these properties would be preserved in the
fetch-timing-aware interpolation model.


> I think the point of interp mode is to provide values at a point in time
> that might be different to the sample times in the archive, which is subtly
> different to your suggestion of "the client doesn't want to know the ebb and
> flow of the actual underlying data".

Yeah, subtly different, but probably closer to the user/programmer
intuition.


> Critically, interp.c has no visibility to the history of pmFetch() calls, it
> operates on the state of the archive and the point in time for the current
> pmFetch().

Sure, but that is an implementation detail within libpcp.  interp.c
could have access to the current context's pmSetMode time delta;
it wouldn't need to keep an actual history of prior pmFetches or their
pmResult timestamps.


> So, I'm loathe to push functionality into interp.c when I believe that
> functionality is better done in the PMAPI client.

That's a reasonable alternative, if changing pmFetch itself is
unpalatable.


> Note that all of this discussion is moot for non-counter metrics

Why so?  For an SEM_INSTANT value interpolated between time [x,x+dt),
it would be reasonable to inspect all actual values in or bounding that
interval, in case e.g. the value immediately before x+dt was a random
outlier.  A weighted or windowed sum over all the values could be
quite a defensible calculation too.


> [...] Then PMAPI clients processing archives that want to do rate
> conversion on counters and care could call pmCheckContinuity()a
> after each pmFetch() and take appropriate action. [...]

That would be workable too, though would mandate changes to all
existing clients, or at least all those that do rate conversion.

Comment 10 Mark Goodwin 2016-01-15 08:32:26 UTC
(In reply to Ken McDonell from comment #8)
[..]
> I think I can implement a pmCheckContinuity (suggestions for a better name
> would be appreciated) that takes two timestamps (as returned by pmFetch) and
> returns a true/false value depending on the presence of a <mark> record in
> the timestamped interval ... and I think I can do this in a way that
> involves no extra archive scanning for the common use case.  Then PMAPI
> clients processing archives that want to do rate conversion on counters and
> care could call pmCheckContinuity() after each pmFetch() and take
> appropriate action.

Hi Ken

If pmCheckContinuity() is feasible, then couldn't pmFetchArchive() do the same for the current timestamp plus (or minus) the current pmSetMode delta? and thus return PM_ERR_VALUE when an interpolated fetch spans a <mark>?

I have an archive that I'm about to commit to qa/archives that has a mark record in it and all the tools seem to be able to replay it without getting any PM_ERR_VALUE errors for a fetch that spans the mark timestamp. Both pmval and pmdumptext check for counters going backward (and more recently pmiostat does too) and report '?', but pmrep just faithfully reports a negative rate.

Cheers
-- Mark

Comment 11 Frank Ch. Eigler 2016-01-15 14:50:55 UTC
> Both pmval and pmdumptext check for counters going backward (and more recently
> pmiostat does too) and report '?', but pmrep just faithfully reports a
> negative rate.

Negative rates are an orthogonal issue.  If cross-<mark>-interpolated
counter values happened to be larger than before, you'd see a positive
rate, but it would be a false positive.  The correct answer would
still be "no answer".

Comment 12 Mark Goodwin 2016-01-15 23:16:15 UTC
(In reply to Frank Ch. Eigler from comment #11)
> > Both pmval and pmdumptext check for counters going backward (and more recently
> > pmiostat does too) and report '?', but pmrep just faithfully reports a
> > negative rate.
> 
> Negative rates are an orthogonal issue.  If cross-<mark>-interpolated
> counter values happened to be larger than before, you'd see a positive
> rate, but it would be a false positive.  The correct answer would
> still be "no answer".

yes I know - negative rates are symptomatic of when the correct answer
should have been no answer. The most common symptom is a counter going
backwards (and not due to a wrap).

Cheers

Comment 13 Mark Goodwin 2016-01-16 00:18:46 UTC
(In reply to Frank Ch. Eigler from comment #11)
>... If cross-<mark>-interpolated
> counter values happened to be larger than before, you'd see a positive
> rate, but it would be a false positive.  The correct answer would
> still be "no answer".

I should point out this ^^ is the scenario that scares me - if a counter
didn't go backwards across a mark, then the reported rate is likely to be
bogus, but currently undetectable. No answer is the only right answer here :)

Comment 14 Ken McDonell 2016-01-19 03:40:45 UTC
I changed my mind, and agree with Mark and Frank.

Fixed in commit 9ab96e1 in my tree.

Comment 15 Mark Goodwin 2016-02-02 23:45:21 UTC
Fixed in pcp-3.11.0-1

Comment 16 Jan Kurik 2016-02-24 15:46:52 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase

Comment 17 Nathan Scott 2016-03-15 00:57:18 UTC
As Mark said in #c15 - fixed in pcp-3.11.0-1


Note You need to log in before you can comment on or make changes to this bug.