Bug 1325037

Summary: archive PM_MODE_FORW fetches not scanning to first occurrence of given metric
Product: [Fedora] Fedora Reporter: Frank Ch. Eigler <fche>
Component: pcpAssignee: pcp-maint <pcp-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 28CC: fche, kenj, lberk, mgoodwin, nathans, pcp
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-05 03:33:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
pmlogreduce'd archive tarball
none
archive file (pmlogger against a container) none

Description Frank Ch. Eigler 2016-04-08 02:17:47 UTC
Created attachment 1144938 [details]
pmlogreduce'd archive tarball

As I read the pmFetch / pmSetMode APIs, if an archive contains a given metric, but not in the first few of its recorded result-sets, a PM_MODE_FORW fetch should skip over as many as required until the given metric is found with a value.  That doesn't seem to be happening with archives that are the output of pmlogreduce.

With given archive tarball:

% pmval -U reduced-20151231.000006.meta hinv.ncpu -s 1
host:      easy.elastic.org
start:     Wed Dec 23 19:00:11 2015
end:       Wed Dec 30 18:50:11 2015
semantics: discrete instantaneous value
units:     none
samples:   1
19:00:11.533  No values available

even though % pmdumplog .... hinv.ncpu clearly shows a value at 19:10:11.533.

A longer % pmval run (without the -s 1) does show the hinv.ncpu value later.

Comment 1 Frank Ch. Eigler 2016-04-08 11:10:36 UTC
Created attachment 1145087 [details]
archive file (pmlogger against a container)

Possibly related; if a metric is completely absent, sometimes junk pmResult comes back from a pmFetch.  With archive in newly attached tarball:

% PCP_DEBUG=65535  pmval -U FILE.meta hinv.ncpu

pmResult dump from 0x2577aa0 timestamp: 1456356755.639190 18:32:35.639 numpmid: 107
  60.26.0 (kernel.all.uptime): Bad file descriptor
  60.25.0 (kernel.all.nusers): Bad file descriptor
  60.7.67 (nfs4.server.reqs): Bad file descriptor
  60.7.66 (nfs4.server.calls): Bad file descriptor
  60.7.65 (nfs4.client.reqs): Bad file descriptor
  60.7.64 (nfs4.client.calls): Bad file descriptor
[...]
  [no hinv.ncpu metrics in this archive]

Those -EBADF codes are a mystery.  They also show up with pmdumplog.

Comment 2 Jan Kurik 2016-07-26 04:49:51 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle.
Changing version to '25'.

Comment 3 Ken McDonell 2017-05-02 06:07:58 UTC
Frank,

I'm looking at the first problem you reported.  The behaviour you observe is exactly what would be expected with this archive ... it contains the following pmResults after the label record:

0. no header pmResult

1. pmResult dump from 0x1ca16c0 timestamp: 1451519411.533333 18:50:11.533 numpmid: 309 ... this is BEFORE the start of the archive (in time) according to the log label ... Wed Dec 23 19:00:11.533 2015.  hinv.ncpu == 2 in this pmResult, but because it is before the start of the archive, most tools will neither see not process it

2. pmResult dump from 0x1c9dfb0 timestamp: 1450915211.533333 19:00:11.533 numpmid: 339 ... which is at the start of the archive according to the log label, but most of the metrics in this pmResult (including hinv.ncpu) have numval == 0, the only exceptions are the metrics that should have appeared in the header (pmcd.pmlogger.archive, pmcd.pmlogger.port and pmcd.pmlogger.host), namely
2.3.2 (pmcd.pmlogger.archive): numval: 1 valfmt: 1 vlist[]:
    inst [10762 or ???] value "/var/log/pcp/pmmgr/easy.elastic.org/archive-20151224.000011"
  2.3.0 (pmcd.pmlogger.port): numval: 1 valfmt: 0 vlist[]:
    inst [10762 or ???] value 4330
  2.3.3 (pmcd.pmlogger.host): numval: 1 valfmt: 1 vlist[]:
    inst [10762 or ???] value "very.elastic.org"

3. pmResult dump from 0x1ca8190 timestamp: 1450915811.533333 19:10:11.533 numpmid: 336 ... which contains a hinv.ncpu == 2 value.

I do not believe this is a libpcp nor a pmval nor a pmdumplog issue.  Given the archive, the results are as expected.

But there is clearly a problem with this archive, so we need to focus on how it was created ... I am sure the input archives to pmlogreduce are long-gone by now, so all I can ask you to do is to watch for this happening again and try to snag the input archive(s) and the pmlogreduce command line options so I can triage that.

Now the second problem you report ... this archive is even more severely damaged than the first.

Initially, the timestamps and label record are all out of whack.

$ TZ=EST+5 pmloglabel -L archive-20160224.233205
Log Label (Log Format Version 2)
Performance metrics from host vm-rhel7
  commencing Wed Feb 24 18:32:05.616 2016
  ending     Wed Feb 24 18:32:35.639 2016
Archive timezone: EST+5
PID for pmlogger: 26139
$ pmdumplog -Dpdu -M -z archive-20160224.233205 2>&1 | grep '^pmResult'
pmResult dump from 0x24e32b0 timestamp: 1456356755.639190 18:32:35.639 numpmid: 107
pmResult dump from 0x24df890 timestamp: 1456356725.616728 18:32:05.616 numpmid: 3
pmResult dump from 0x24e1440 timestamp: 1456356725.621474 18:32:05.621 numpmid: 54

So the first pmResult is bogus and contains numval == -2 for all 107 metrics, and numval < 0 => error number and EBADF is 2 ... so the "junk" is in the archive.

Again, we need to understand how this archive was created.

Comment 4 Ken McDonell 2017-05-02 06:14:25 UTC
Correction for the second case, the numval in the bad pmResult is -9 not -2 ... which is indeed EBADF ... I just botched the ntohl() and hextodecimal()

$ pmerr -9
Code: -9 0xfffffff7 Text: Bad file descriptor

__pmLogRead timestamp=18:32:35.639 0x21edf3c ... 0x21ee2a8
000: 933dce56 d6c00900 6b000000   68000f f7ffffff   64000f f7ffffff 431c000f
008: f7ffffff 421c000f f7ffffff 411c000f f7ffffff 401c000f f7ffffff 3f1c000f
016: f7ffffff 3e1c000f f7ffffff 3d1c000f f7ffffff 3c1c000f f7ffffff 321c000f
024: f7ffffff  c1c000f f7ffffff  41c000f f7ffffff  11c000f f7ffffff  814000f
032: f7ffffff  d0c000f f7ffffff  b0c000f f7ffffff  a0c000f f7ffffff  90c000f
040: f7ffffff  80c000f f7ffffff  30c000f f7ffffff  20c000f f7ffffff  10c000f
048: f7ffffff    c000f f7ffffff  208000f f7ffffff    8000f f7ffffff 3a04000f
056: f7ffffff 3904000f f7ffffff 3804000f f7ffffff 3704000f f7ffffff 3604000f
...

Comment 5 Frank Ch. Eigler 2017-05-02 17:37:31 UTC
Ken, thanks for looking in on this problem.

Indeed, the input archives are long gone, but they were created via pmmgr's invocations of pmlogger / pmlogcheck / pmlogrewrite (just because) / pmlogextract (to merge archives from a given day, and to age out too-old data) / pmlogreduce (some weeks later).  Fairly vanilla stuff; if you run a pmmgr with "-v -v", you'll see the exact operations in pmmgr.log.

With a little scan, I can find no other archives with this sort of log-label vs. first-real-entry timestamp glitch.  (I have just one with a loglabel-to-pmresult timestamp mismatch, and that is with a <mark> with its 0.001s offset.)

Would it be worth adding a pmlogcheck test to detect this case?

Comment 6 Fedora End Of Life 2017-11-16 18:31:11 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 7 Fedora End Of Life 2018-02-20 15:28:18 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 8 Nathan Scott 2019-03-05 03:33:34 UTC
Closing as per Ken's notes in #c3 - "Given the archive, the results are as expected."