This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1325037 - archive PM_MODE_FORW fetches not scanning to first occurrence of given metric
archive PM_MODE_FORW fetches not scanning to first occurrence of given metric
Status: NEW
Product: Fedora
Classification: Fedora
Component: pcp (Show other bugs)
25
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: pcp-maint
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-04-07 22:17 EDT by Frank Ch. Eigler
Modified: 2017-08-23 11:13 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
pmlogreduce'd archive tarball (516.99 KB, application/octet-stream)
2016-04-07 22:17 EDT, Frank Ch. Eigler
no flags Details
archive file (pmlogger against a container) (3.46 KB, application/x-gzip)
2016-04-08 07:10 EDT, Frank Ch. Eigler
no flags Details

  None (edit)
Description Frank Ch. Eigler 2016-04-07 22:17:47 EDT
Created attachment 1144938 [details]
pmlogreduce'd archive tarball

As I read the pmFetch / pmSetMode APIs, if an archive contains a given metric, but not in the first few of its recorded result-sets, a PM_MODE_FORW fetch should skip over as many as required until the given metric is found with a value.  That doesn't seem to be happening with archives that are the output of pmlogreduce.

With given archive tarball:

% pmval -U reduced-20151231.000006.meta hinv.ncpu -s 1
host:      easy.elastic.org
start:     Wed Dec 23 19:00:11 2015
end:       Wed Dec 30 18:50:11 2015
semantics: discrete instantaneous value
units:     none
samples:   1
19:00:11.533  No values available

even though % pmdumplog .... hinv.ncpu clearly shows a value at 19:10:11.533.

A longer % pmval run (without the -s 1) does show the hinv.ncpu value later.
Comment 1 Frank Ch. Eigler 2016-04-08 07:10 EDT
Created attachment 1145087 [details]
archive file (pmlogger against a container)

Possibly related; if a metric is completely absent, sometimes junk pmResult comes back from a pmFetch.  With archive in newly attached tarball:

% PCP_DEBUG=65535  pmval -U FILE.meta hinv.ncpu

pmResult dump from 0x2577aa0 timestamp: 1456356755.639190 18:32:35.639 numpmid: 107
  60.26.0 (kernel.all.uptime): Bad file descriptor
  60.25.0 (kernel.all.nusers): Bad file descriptor
  60.7.67 (nfs4.server.reqs): Bad file descriptor
  60.7.66 (nfs4.server.calls): Bad file descriptor
  60.7.65 (nfs4.client.reqs): Bad file descriptor
  60.7.64 (nfs4.client.calls): Bad file descriptor
[...]
  [no hinv.ncpu metrics in this archive]

Those -EBADF codes are a mystery.  They also show up with pmdumplog.
Comment 2 Jan Kurik 2016-07-26 00:49:51 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle.
Changing version to '25'.
Comment 3 Ken McDonell 2017-05-02 02:07:58 EDT
Frank,

I'm looking at the first problem you reported.  The behaviour you observe is exactly what would be expected with this archive ... it contains the following pmResults after the label record:

0. no header pmResult

1. pmResult dump from 0x1ca16c0 timestamp: 1451519411.533333 18:50:11.533 numpmid: 309 ... this is BEFORE the start of the archive (in time) according to the log label ... Wed Dec 23 19:00:11.533 2015.  hinv.ncpu == 2 in this pmResult, but because it is before the start of the archive, most tools will neither see not process it

2. pmResult dump from 0x1c9dfb0 timestamp: 1450915211.533333 19:00:11.533 numpmid: 339 ... which is at the start of the archive according to the log label, but most of the metrics in this pmResult (including hinv.ncpu) have numval == 0, the only exceptions are the metrics that should have appeared in the header (pmcd.pmlogger.archive, pmcd.pmlogger.port and pmcd.pmlogger.host), namely
2.3.2 (pmcd.pmlogger.archive): numval: 1 valfmt: 1 vlist[]:
    inst [10762 or ???] value "/var/log/pcp/pmmgr/easy.elastic.org/archive-20151224.000011"
  2.3.0 (pmcd.pmlogger.port): numval: 1 valfmt: 0 vlist[]:
    inst [10762 or ???] value 4330
  2.3.3 (pmcd.pmlogger.host): numval: 1 valfmt: 1 vlist[]:
    inst [10762 or ???] value "very.elastic.org"

3. pmResult dump from 0x1ca8190 timestamp: 1450915811.533333 19:10:11.533 numpmid: 336 ... which contains a hinv.ncpu == 2 value.

I do not believe this is a libpcp nor a pmval nor a pmdumplog issue.  Given the archive, the results are as expected.

But there is clearly a problem with this archive, so we need to focus on how it was created ... I am sure the input archives to pmlogreduce are long-gone by now, so all I can ask you to do is to watch for this happening again and try to snag the input archive(s) and the pmlogreduce command line options so I can triage that.

Now the second problem you report ... this archive is even more severely damaged than the first.

Initially, the timestamps and label record are all out of whack.

$ TZ=EST+5 pmloglabel -L archive-20160224.233205
Log Label (Log Format Version 2)
Performance metrics from host vm-rhel7
  commencing Wed Feb 24 18:32:05.616 2016
  ending     Wed Feb 24 18:32:35.639 2016
Archive timezone: EST+5
PID for pmlogger: 26139
$ pmdumplog -Dpdu -M -z archive-20160224.233205 2>&1 | grep '^pmResult'
pmResult dump from 0x24e32b0 timestamp: 1456356755.639190 18:32:35.639 numpmid: 107
pmResult dump from 0x24df890 timestamp: 1456356725.616728 18:32:05.616 numpmid: 3
pmResult dump from 0x24e1440 timestamp: 1456356725.621474 18:32:05.621 numpmid: 54

So the first pmResult is bogus and contains numval == -2 for all 107 metrics, and numval < 0 => error number and EBADF is 2 ... so the "junk" is in the archive.

Again, we need to understand how this archive was created.
Comment 4 Ken McDonell 2017-05-02 02:14:25 EDT
Correction for the second case, the numval in the bad pmResult is -9 not -2 ... which is indeed EBADF ... I just botched the ntohl() and hextodecimal()

$ pmerr -9
Code: -9 0xfffffff7 Text: Bad file descriptor

__pmLogRead timestamp=18:32:35.639 0x21edf3c ... 0x21ee2a8
000: 933dce56 d6c00900 6b000000   68000f f7ffffff   64000f f7ffffff 431c000f
008: f7ffffff 421c000f f7ffffff 411c000f f7ffffff 401c000f f7ffffff 3f1c000f
016: f7ffffff 3e1c000f f7ffffff 3d1c000f f7ffffff 3c1c000f f7ffffff 321c000f
024: f7ffffff  c1c000f f7ffffff  41c000f f7ffffff  11c000f f7ffffff  814000f
032: f7ffffff  d0c000f f7ffffff  b0c000f f7ffffff  a0c000f f7ffffff  90c000f
040: f7ffffff  80c000f f7ffffff  30c000f f7ffffff  20c000f f7ffffff  10c000f
048: f7ffffff    c000f f7ffffff  208000f f7ffffff    8000f f7ffffff 3a04000f
056: f7ffffff 3904000f f7ffffff 3804000f f7ffffff 3704000f f7ffffff 3604000f
...
Comment 5 Frank Ch. Eigler 2017-05-02 13:37:31 EDT
Ken, thanks for looking in on this problem.

Indeed, the input archives are long gone, but they were created via pmmgr's invocations of pmlogger / pmlogcheck / pmlogrewrite (just because) / pmlogextract (to merge archives from a given day, and to age out too-old data) / pmlogreduce (some weeks later).  Fairly vanilla stuff; if you run a pmmgr with "-v -v", you'll see the exact operations in pmmgr.log.

With a little scan, I can find no other archives with this sort of log-label vs. first-real-entry timestamp glitch.  (I have just one with a loglabel-to-pmresult timestamp mismatch, and that is with a <mark> with its 0.001s offset.)

Would it be worth adding a pmlogcheck test to detect this case?

Note You need to log in before you can comment on or make changes to this bug.