Bug 1483557 - Loss of statistics due to PCP log rotation
Summary: Loss of statistics due to PCP log rotation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcp
Version: 7.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: pcp-maint
QA Contact: Michal Kolar
URL:
Whiteboard:
Depends On: 1472153
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-21 12:05 UTC by Renaud Métrich
Modified: 2021-06-10 12:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 17:06:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0926 0 None None None 2018-04-10 17:07:42 UTC

Description Renaud Métrich 2017-08-21 12:05:26 UTC
Description of problem:

By default, the PCP logs are rotated every day at 00:10, via a cron entry:
# grep pmlogger_daily /etc/cron.d/pcp-pmlogger
10     0  *  *  *  pcp  /usr/libexec/pcp/bin/pmlogger_daily -X xz -x 3

When logs are rotated, samples are getting lost, causing "pmval" to display "No values available" for the time interval where log got rotated.

This is annoying when consolidating logs for large intervals, such as 8 hours (in such case, a whole 8 hour slice gets lost)

Version-Release number of selected component (if applicable):

pcp-3.11.8-7.el7.x86_64

How reproducible:

Always

Steps to Reproduce:
1. start PCP
  systemctl start pmcd.service pmlogger.service

2. wait for 10 minutes or more to gather statistics

3. force a log rotation as specified in the cron entry:
  sudo -u pcp /usr/libexec/pcp/bin/pmlogger_daily -X xz -x 3

4. wait for 10 minutes or more to gather statistics

Actual results:

"No values available" is displayed for the time slice the log rotation happened

Expected results:

No loss of statistics

Additional info:

See sample below for log rotation around 13:37.

"Every minute" samples

# pmval -a /var/log/pcp/pmlogger/vm-rhel73 -t 1m -z network.interface.in.bytes
[...]
                          eth0                    lo    
13:14:02.045  No values available
13:15:02.045             233.5                   0.0    
13:16:02.045              65.92                  0.0    
13:17:02.045             121.7                   0.0    
13:18:02.045             240.9                   0.0    
13:19:02.045             233.5                   0.0    
13:20:02.045             335.2                   0.0    
13:21:02.045             102.6                   0.0    
13:22:02.045              32.23                  0.0    
13:23:02.045             213.3                   0.0    
13:24:02.045             115.1                   0.0    
13:25:02.045             150.5                   0.0    
13:26:02.045              54.67                  0.0    
13:27:02.045              41.17                  0.0    
13:28:02.045              38.43                  0.0    
13:29:02.045             147.0                   0.0    
13:30:02.045              55.83                  0.0    
13:31:02.045             118.0                   0.0    
13:32:02.045             186.5                   0.0    
13:33:02.045              81.88                  0.0    
13:34:02.045              92.93                  0.0    
13:35:02.045              60.00                  0.0    
13:36:02.045             180.8                   0.0    
13:37:02.045  No values available

                          eth0                    lo    
13:38:02.045  No values available
13:39:02.045              94.83                  0.0    
13:40:02.045              91.12                  0.0    

"Every 10 minutes" consolidation (whole 13:35 -> 13:45 slice is lost):

# pmval -a /var/log/pcp/pmlogger/vm-rhel73 -t 10m -z network.interface.in.bytes
[...]
                          eth0                    lo    
13:15:02.045  No values available
13:25:02.045             161.1                   0.0    
13:35:02.045              87.64                  0.0    
13:45:02.045  No values available

Comment 2 Mark Goodwin 2017-08-21 22:18:53 UTC
This is due to the '<mark>' record that gets inserted between archives, either when multiple archives are merged or when you replay more than one archive.
A <mark> record is a pmlogger record that signifies a temporal gap in an archive (due to said merging and certain other events). libpcp currently will return no values when the current replay interval traverses the <mark>.

For context, see BZ #1296750 - incorrect interpolation across <mark> record in a merged archive

We're actively working on this and expect to have a solution in the current upstream release (pcp-3.12.2) soon - there are some circumstances where <mark> records are tolerable for replay purposes.

In the mean-time, you should be able to use pmval and other tools in non-interpolating mode using the -U flag, see the man page.

Comment 3 Frank Ch. Eigler 2017-08-22 19:52:45 UTC
> By default, the PCP logs are rotated every day at 00:10, via a cron entry:
> # grep pmlogger_daily /etc/cron.d/pcp-pmlogger
> 10     0  *  *  *  pcp  /usr/libexec/pcp/bin/pmlogger_daily -X xz -x 3
> 
> When logs are rotated, samples are getting lost, causing "pmval" to display
> "No values available" for the time interval where log got rotated.
> 
> This is annoying when consolidating logs for large intervals, such as 8
> hours (in such case, a whole 8 hour slice gets lost)

You may find pmmgr a useful alternative to pmlogger_daily, when it comes to consolidation & sensitivity to daily processing edge cases, because you have greater control over the granularity of the log files.  For example,

# yum install pcp-manager
# echo '7days' > /etc/pcp/pmlogmerge
# echo '7days' > /etc/pcp/pmlogmerge-retain
# echo '-t 3600' > /etc/pcp/pmmgr/pmlogreduce
# service pmmgr on; service pmlogger off
# admire /var/log/pcp/pmmgr/$HOSTNAME

This would give you 7-day-long archives, with older ones being compressed by
time-wise subsampling (3600s).

Comment 4 Nathan Scott 2017-08-25 00:08:16 UTC
Hi Renaud,

We're discussing possible ways to tackle this properly, the underlying issue is exactly as Mark described.  In the meantime, you might find the stripmark utility in the PCP testsuite to be of use ...

https://github.com/performancecopilot/pcp/blob/master/qa/src/stripmark.c

You can use this to remove "mark" records from archives - use this only for the case where you know there was not a discontinuation in the PCP data (i.e. end of days archive processing).  The utility will strip all mark records, and obviously its use is a manual step that should not be necessary - I suggest it only as a stop-gap measure until we come up with a longer-term viable plan.

Using pmmgr is not a long-term solution to this issue either.  We'll fix it properly without these workarounds as soon as we are able to.

cheers.

Comment 9 Michal Kolar 2018-02-21 11:55:45 UTC
Verified against pcp-3.12.2-5.el7.

Comment 13 errata-xmlrpc 2018-04-10 17:06:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0926


Note You need to log in before you can comment on or make changes to this bug.