Bug 2083897

Summary:	'pcp dstat' outputs 'missed clock ticks' on systems with many devices
Product:	Red Hat Enterprise Linux 8	Reporter:	Christian Horn <chorn>
Component:	pcp	Assignee:	Nathan Scott <nathans>
Status:	CLOSED ERRATA	QA Contact:	Jan Kurik <jkurik>
Severity:	unspecified	Docs Contact:	Jacob Taylor Valdez <jvaldez>
Priority:	unspecified
Version:	8.0	CC:	agerstmayr, jkurik, nathans, thgardne
Target Milestone:	rc	Keywords:	FutureFeature, Triaged
Target Release:	8.8	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pcp-5.3.7-11.el8	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-16 08:13:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christian Horn 2022-05-11 00:37:31 UTC

Description of problem:
'pcp dstat' outputs 'missed clock ticks' on systems with many devices 

Version-Release number of selected component (if applicable):
All pcp-4 and pcp-5 versions seem affected

How reproducible:
always

Steps to Reproduce:
1. deploy rhel7, 8 or 9, can be virtual i.e. KVM guest
2. yum -y install pcp-zeroconf pcp-system-tools
3. modprobe scsi-debug add_host=16 num_parts=32 num_tgts=150
4. pcp dstat

Actual results:
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--                                                                                                                                                 
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw                                                                                                                                                  
  2  17  80   0   0|   0     0 | 163   159 |   0     0 | 917   157                                                                                                                                                  
 20   4  75   0   0|   0     0 | 398   430 |   0     0 |1572   129                                                                                                                                                  
 21   4  75   0   0|   0     0 | 687   208 |   0     0 |1561   114  missed 2 ticks                                                                                                                                  
 24   7  68   0   0|   0     0 | 314   207 |   0     0 |2094   239                                                                                                                                                  
 23   3  72   0   0|   0     0 | 113   238 |   0     0 |1753   131  missed 2 ticks                                                                                                                                  
 20   4  75   0   0|   0     0 |1983   220 |   0     0 |1558   117                                                                                                                                                  
 20   4  75   0   0|   0     0 | 115   242 |   0     0 |1572   156  missed 2 ticks

Expected results:
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--                                                                                                                                                 
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw                                                                                                                                                  
  2  17  80   0   0|   0     0 | 163   159 |   0     0 | 917   157                                                                                                                                                  
 20   4  75   0   0|   0     0 | 398   430 |   0     0 |1572   129                                                                                                                                                  
 21   4  75   0   0|   0     0 | 687   208 |   0     0 |1561   114                                                                                                                                   
 24   7  68   0   0|   0     0 | 314   207 |   0     0 |2094   239                                                                                                                                                  

Additional info:
- the issue is just seen with high number of devices
- on rhel7, a further fix is required
  ( https://access.redhat.com/solutions/6957052 ) to get dsk/total
  output with high number of devices
- pcp-6 does not have the issue, a build of upstream-pcp-6 from 2022-05-10
  on rhel8.6GA does not show the issue
- This does now fit rhel7 errata criteria, but getting a fix into 
  rhel8 and 9 might make sense (rebase to pcp-6 would also fix it)
- I suspect the 'missed XX ticks' can be ignored, but did not look 
  deeper.  Also increasing sample interval (i.e. pcp dstat -t 5)
  does not solve the issue.  Opinions on that are welcome.

Comment 1 Nathan Scott 2022-05-11 05:27:26 UTC

Hi Christian,

Could you provide some timing measurements from this system for me please?  Thanks!

$ time cat /proc/diskstats 
$ time cat /proc/stat 
$ time cat /proc/net/dev

Looking at the code, we cannot tackle this using a larger sample time - its using a hard-coded sanity check that communicating with the kernel takes less than 500msec.

cheers.

Comment 2 Christian Horn 2022-05-11 05:49:00 UTC

Hi,

let me get these, plus output from

 $ time cat /proc/diskstats >/dev/null
 $ time cat /proc/stat >/dev/null
 $ time cat /proc/net/dev>/dev/null

just to rule out a slow terminal impacts the runtime here.

On the KVM guest the times are as follows:
1609  block devices: 0.01s 0.001s 0.006s
12009 block devices: 0.22s 0.001s 0.014s

Getting these from the customers real system, he also got
'missed 6 ticks' and 'missed 10 ticks' outputs, not just
'missed 2 ticks' as my reproducer.

Comment 11 Christian Horn 2022-06-14 00:48:19 UTC

We got a question on the details of

self.pmconfig.validate_metrics(False, 16364)

in /usr/libexec/pcp/bin/pcp-dstat:

- Can we know what each value in the statement above indicate,
  like pmconfig, validate_metrics, False and the number that is followed ? 
- What is the unit of the number 16364 like GB or MB like that ?

As per my understanding, this applies:
- This should not be seen as a hirarchical metric name,
  but as an object name, i.e. created with
    self.pmconfig = pmconfig.pmConfig(self)
  in /usr/libexec/pcp/bin/pcp-dstat
- 16364 is relating to the number of metrics dstat can deal with.
  We hit here an unusually high number of block devices, leading to
  more metrics (indom) being dealed with, bringing up this issue.

Comments/corrections?

Comment 12 Nathan Scott 2022-06-14 01:08:53 UTC

(In reply to Christian Horn from comment #11)
> We got a question on the details of
> 
> self.pmconfig.validate_metrics(False, 16364)
> 
> in /usr/libexec/pcp/bin/pcp-dstat:
> 
> - Can we know what each value in the statement above indicate,
>   like pmconfig, validate_metrics, False and the number that is followed ? 
> - What is the unit of the number 16364 like GB or MB like that ?
> 
> As per my understanding, this applies:
> - This should not be seen as a hirarchical metric name,
>   but as an object name, i.e. created with
>     self.pmconfig = pmconfig.pmConfig(self)
>   in /usr/libexec/pcp/bin/pcp-dstat

That's correct.  It's python code - we're calling the validate_metrics method here, which is part of the pmConfig python class, and that in turn is part of the DstatTool class (self).

> - 16364 is relating to the number of metrics dstat can deal with.
>   We hit here an unusually high number of block devices, leading to
>   more metrics (indom) being dealed with, bringing up this issue.

Not quite - its an upper bound on number of instances, not number of metrics.
Instances are like sda, sdb, sdc - metrics are like disk.dev.read, disk.dev.write.
So, this value (which has no units) applies a limit to the number of values that can be associated with the instances of a metric.

> Comments/corrections?

HTH.

Comment 13 Christian Horn 2022-06-14 02:27:44 UTC

Thanks a bunch!

Comment 16 Thomas Gardner 2022-09-15 19:24:35 UTC

For what it's worth, I asked the customer of case #03270143 what their preference would be regarding the options listed in comment #4.  This particular customer voted for:

Make the 500ms warning time configurable from the command line
Command line option to disable these warnings entirely

I would say, you could probably do both with one command line option.  Going with the first option, a special value of either 0 or -1 or the like could be used as a special case to effectively do the second option.  It's a reasonably commonly used method for things like this.

Comment 17 Nathan Scott 2022-09-20 22:00:16 UTC

Thanks Thomas.  At this stage I'm leaning towards the simpler option of just the command line option to disable the warnings entirely.

On reflection (since my notes in #c4) I don't see a strong case to allow the user to fine-tune the timing here.  After they've been informed initially of the kernel sampling relative slowness, they're most likely to just make a mental note of that and then want to clear the issue from dstat reporting permanently.

If anyone feels strongly for/against this approach, please let me know - otherwise I'll get on with making it happen.

Comment 18 Nathan Scott 2022-09-26 05:05:42 UTC

Upstream now:

commit 5e19baea8fdb32a685a2bbf06450679baf36185b
Author: Nathan Scott <nathans>
Date:   Wed Sep 21 09:35:49 2022 +1000

    pcp-dstat: add --nomissed command line option for large systems
    
    After a number of complaints about the "N missed ticks" reporting
    from dstat from users on very large systems with many disks, this
    introduces a command line argument to optionally suppress it.
    
    In the cases we've seen, its been confirmed that the increased
    sample time is spent in the kernel with the large device count -
    not something we can do anything about anyway.
    
    This change has been tested by artificially inducing the failure
    mode as I have no local systems showing this behaviour live.  We
    cannot use archives for this either as we do not do sub-sampling
    for historical data in pcp-dstat.
    
    Resolves Red Hat BZ #2083897

Comment 19 Thomas Gardner 2022-09-29 18:29:39 UTC

Are we gonna pull that in to ours, then?

Comment 20 Nathan Scott 2022-09-29 22:05:38 UTC

At some point, for sure.  Exact timing is still under discussion but expect an update here soon.

Comment 21 Jan Kurik 2022-10-20 10:14:19 UTC

As I have no local systems where I successfully reproduced this behavior , I am marking the verification as SanityOnly.
The warning message suppression is implemented (code review has been done) and documented in man page as well as in the '--help' cmdline option.

Comment 23 Christian Horn 2022-11-24 00:28:54 UTC

The details in #0 should be good for verification.

Comment 25 errata-xmlrpc 2023-05-16 08:13:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2745