Red Hat Bugzilla – Bug 1325037
archive PM_MODE_FORW fetches not scanning to first occurrence of given metric
Last modified: 2017-08-23 11:13:20 EDT
Created attachment 1144938 [details] pmlogreduce'd archive tarball As I read the pmFetch / pmSetMode APIs, if an archive contains a given metric, but not in the first few of its recorded result-sets, a PM_MODE_FORW fetch should skip over as many as required until the given metric is found with a value. That doesn't seem to be happening with archives that are the output of pmlogreduce. With given archive tarball: % pmval -U reduced-20151231.000006.meta hinv.ncpu -s 1 host: easy.elastic.org start: Wed Dec 23 19:00:11 2015 end: Wed Dec 30 18:50:11 2015 semantics: discrete instantaneous value units: none samples: 1 19:00:11.533 No values available even though % pmdumplog .... hinv.ncpu clearly shows a value at 19:10:11.533. A longer % pmval run (without the -s 1) does show the hinv.ncpu value later.
Created attachment 1145087 [details] archive file (pmlogger against a container) Possibly related; if a metric is completely absent, sometimes junk pmResult comes back from a pmFetch. With archive in newly attached tarball: % PCP_DEBUG=65535 pmval -U FILE.meta hinv.ncpu pmResult dump from 0x2577aa0 timestamp: 1456356755.639190 18:32:35.639 numpmid: 107 60.26.0 (kernel.all.uptime): Bad file descriptor 60.25.0 (kernel.all.nusers): Bad file descriptor 60.7.67 (nfs4.server.reqs): Bad file descriptor 60.7.66 (nfs4.server.calls): Bad file descriptor 60.7.65 (nfs4.client.reqs): Bad file descriptor 60.7.64 (nfs4.client.calls): Bad file descriptor [...] [no hinv.ncpu metrics in this archive] Those -EBADF codes are a mystery. They also show up with pmdumplog.
This bug appears to have been reported against 'rawhide' during the Fedora 25 development cycle. Changing version to '25'.
Frank, I'm looking at the first problem you reported. The behaviour you observe is exactly what would be expected with this archive ... it contains the following pmResults after the label record: 0. no header pmResult 1. pmResult dump from 0x1ca16c0 timestamp: 1451519411.533333 18:50:11.533 numpmid: 309 ... this is BEFORE the start of the archive (in time) according to the log label ... Wed Dec 23 19:00:11.533 2015. hinv.ncpu == 2 in this pmResult, but because it is before the start of the archive, most tools will neither see not process it 2. pmResult dump from 0x1c9dfb0 timestamp: 1450915211.533333 19:00:11.533 numpmid: 339 ... which is at the start of the archive according to the log label, but most of the metrics in this pmResult (including hinv.ncpu) have numval == 0, the only exceptions are the metrics that should have appeared in the header (pmcd.pmlogger.archive, pmcd.pmlogger.port and pmcd.pmlogger.host), namely 2.3.2 (pmcd.pmlogger.archive): numval: 1 valfmt: 1 vlist[]: inst [10762 or ???] value "/var/log/pcp/pmmgr/easy.elastic.org/archive-20151224.000011" 2.3.0 (pmcd.pmlogger.port): numval: 1 valfmt: 0 vlist[]: inst [10762 or ???] value 4330 2.3.3 (pmcd.pmlogger.host): numval: 1 valfmt: 1 vlist[]: inst [10762 or ???] value "very.elastic.org" 3. pmResult dump from 0x1ca8190 timestamp: 1450915811.533333 19:10:11.533 numpmid: 336 ... which contains a hinv.ncpu == 2 value. I do not believe this is a libpcp nor a pmval nor a pmdumplog issue. Given the archive, the results are as expected. But there is clearly a problem with this archive, so we need to focus on how it was created ... I am sure the input archives to pmlogreduce are long-gone by now, so all I can ask you to do is to watch for this happening again and try to snag the input archive(s) and the pmlogreduce command line options so I can triage that. Now the second problem you report ... this archive is even more severely damaged than the first. Initially, the timestamps and label record are all out of whack. $ TZ=EST+5 pmloglabel -L archive-20160224.233205 Log Label (Log Format Version 2) Performance metrics from host vm-rhel7 commencing Wed Feb 24 18:32:05.616 2016 ending Wed Feb 24 18:32:35.639 2016 Archive timezone: EST+5 PID for pmlogger: 26139 $ pmdumplog -Dpdu -M -z archive-20160224.233205 2>&1 | grep '^pmResult' pmResult dump from 0x24e32b0 timestamp: 1456356755.639190 18:32:35.639 numpmid: 107 pmResult dump from 0x24df890 timestamp: 1456356725.616728 18:32:05.616 numpmid: 3 pmResult dump from 0x24e1440 timestamp: 1456356725.621474 18:32:05.621 numpmid: 54 So the first pmResult is bogus and contains numval == -2 for all 107 metrics, and numval < 0 => error number and EBADF is 2 ... so the "junk" is in the archive. Again, we need to understand how this archive was created.
Correction for the second case, the numval in the bad pmResult is -9 not -2 ... which is indeed EBADF ... I just botched the ntohl() and hextodecimal() $ pmerr -9 Code: -9 0xfffffff7 Text: Bad file descriptor __pmLogRead timestamp=18:32:35.639 0x21edf3c ... 0x21ee2a8 000: 933dce56 d6c00900 6b000000 68000f f7ffffff 64000f f7ffffff 431c000f 008: f7ffffff 421c000f f7ffffff 411c000f f7ffffff 401c000f f7ffffff 3f1c000f 016: f7ffffff 3e1c000f f7ffffff 3d1c000f f7ffffff 3c1c000f f7ffffff 321c000f 024: f7ffffff c1c000f f7ffffff 41c000f f7ffffff 11c000f f7ffffff 814000f 032: f7ffffff d0c000f f7ffffff b0c000f f7ffffff a0c000f f7ffffff 90c000f 040: f7ffffff 80c000f f7ffffff 30c000f f7ffffff 20c000f f7ffffff 10c000f 048: f7ffffff c000f f7ffffff 208000f f7ffffff 8000f f7ffffff 3a04000f 056: f7ffffff 3904000f f7ffffff 3804000f f7ffffff 3704000f f7ffffff 3604000f ...
Ken, thanks for looking in on this problem. Indeed, the input archives are long gone, but they were created via pmmgr's invocations of pmlogger / pmlogcheck / pmlogrewrite (just because) / pmlogextract (to merge archives from a given day, and to age out too-old data) / pmlogreduce (some weeks later). Fairly vanilla stuff; if you run a pmmgr with "-v -v", you'll see the exact operations in pmmgr.log. With a little scan, I can find no other archives with this sort of log-label vs. first-real-entry timestamp glitch. (I have just one with a loglabel-to-pmresult timestamp mismatch, and that is with a <mark> with its 0.001s offset.) Would it be worth adding a pmlogcheck test to detect this case?