RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2083897 - 'pcp dstat' outputs 'missed clock ticks' on systems with many devices
Summary: 'pcp dstat' outputs 'missed clock ticks' on systems with many devices
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pcp
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.8
Assignee: Nathan Scott
QA Contact: Jan Kurik
Jacob Taylor Valdez
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-11 00:37 UTC by Christian Horn
Modified: 2023-05-16 08:40 UTC (History)
4 users (show)

Fixed In Version: pcp-5.3.7-11.el8
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-16 08:13:26 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-121655 0 None None None 2022-05-11 00:40:04 UTC
Red Hat Knowledge Base (Solution) 6957205 0 None None None 2022-05-11 00:37:30 UTC
Red Hat Product Errata RHBA-2023:2745 0 None None None 2023-05-16 08:13:33 UTC

Description Christian Horn 2022-05-11 00:37:31 UTC
Description of problem:
'pcp dstat' outputs 'missed clock ticks' on systems with many devices 

Version-Release number of selected component (if applicable):
All pcp-4 and pcp-5 versions seem affected

How reproducible:
always

Steps to Reproduce:
1. deploy rhel7, 8 or 9, can be virtual i.e. KVM guest
2. yum -y install pcp-zeroconf pcp-system-tools
3. modprobe scsi-debug add_host=16 num_parts=32 num_tgts=150
4. pcp dstat

Actual results:
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--                                                                                                                                                 
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw                                                                                                                                                  
  2  17  80   0   0|   0     0 | 163   159 |   0     0 | 917   157                                                                                                                                                  
 20   4  75   0   0|   0     0 | 398   430 |   0     0 |1572   129                                                                                                                                                  
 21   4  75   0   0|   0     0 | 687   208 |   0     0 |1561   114  missed 2 ticks                                                                                                                                  
 24   7  68   0   0|   0     0 | 314   207 |   0     0 |2094   239                                                                                                                                                  
 23   3  72   0   0|   0     0 | 113   238 |   0     0 |1753   131  missed 2 ticks                                                                                                                                  
 20   4  75   0   0|   0     0 |1983   220 |   0     0 |1558   117                                                                                                                                                  
 20   4  75   0   0|   0     0 | 115   242 |   0     0 |1572   156  missed 2 ticks

Expected results:
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--                                                                                                                                                 
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw                                                                                                                                                  
  2  17  80   0   0|   0     0 | 163   159 |   0     0 | 917   157                                                                                                                                                  
 20   4  75   0   0|   0     0 | 398   430 |   0     0 |1572   129                                                                                                                                                  
 21   4  75   0   0|   0     0 | 687   208 |   0     0 |1561   114                                                                                                                                   
 24   7  68   0   0|   0     0 | 314   207 |   0     0 |2094   239                                                                                                                                                  

Additional info:
- the issue is just seen with high number of devices
- on rhel7, a further fix is required
  ( https://access.redhat.com/solutions/6957052 ) to get dsk/total
  output with high number of devices
- pcp-6 does not have the issue, a build of upstream-pcp-6 from 2022-05-10
  on rhel8.6GA does not show the issue
- This does now fit rhel7 errata criteria, but getting a fix into 
  rhel8 and 9 might make sense (rebase to pcp-6 would also fix it)
- I suspect the 'missed XX ticks' can be ignored, but did not look 
  deeper.  Also increasing sample interval (i.e. pcp dstat -t 5)
  does not solve the issue.  Opinions on that are welcome.

Comment 1 Nathan Scott 2022-05-11 05:27:26 UTC
Hi Christian,

Could you provide some timing measurements from this system for me please?  Thanks!

$ time cat /proc/diskstats 
$ time cat /proc/stat 
$ time cat /proc/net/dev

Looking at the code, we cannot tackle this using a larger sample time - its using a hard-coded sanity check that communicating with the kernel takes less than 500msec.

cheers.

Comment 2 Christian Horn 2022-05-11 05:49:00 UTC
Hi,

let me get these, plus output from

 $ time cat /proc/diskstats >/dev/null
 $ time cat /proc/stat >/dev/null
 $ time cat /proc/net/dev>/dev/null

just to rule out a slow terminal impacts the runtime here.

On the KVM guest the times are as follows:
1609  block devices: 0.01s 0.001s 0.006s
12009 block devices: 0.22s 0.001s 0.014s

Getting these from the customers real system, he also got
'missed 6 ticks' and 'missed 10 ticks' outputs, not just
'missed 2 ticks' as my reproducer.

Comment 11 Christian Horn 2022-06-14 00:48:19 UTC
We got a question on the details of

self.pmconfig.validate_metrics(False, 16364)

in /usr/libexec/pcp/bin/pcp-dstat:

- Can we know what each value in the statement above indicate,
  like pmconfig, validate_metrics, False and the number that is followed ? 
- What is the unit of the number 16364 like GB or MB like that ?

As per my understanding, this applies:
- This should not be seen as a hirarchical metric name,
  but as an object name, i.e. created with
    self.pmconfig = pmconfig.pmConfig(self)
  in /usr/libexec/pcp/bin/pcp-dstat
- 16364 is relating to the number of metrics dstat can deal with.
  We hit here an unusually high number of block devices, leading to
  more metrics (indom) being dealed with, bringing up this issue.

Comments/corrections?

Comment 12 Nathan Scott 2022-06-14 01:08:53 UTC
(In reply to Christian Horn from comment #11)
> We got a question on the details of
> 
> self.pmconfig.validate_metrics(False, 16364)
> 
> in /usr/libexec/pcp/bin/pcp-dstat:
> 
> - Can we know what each value in the statement above indicate,
>   like pmconfig, validate_metrics, False and the number that is followed ? 
> - What is the unit of the number 16364 like GB or MB like that ?
> 
> As per my understanding, this applies:
> - This should not be seen as a hirarchical metric name,
>   but as an object name, i.e. created with
>     self.pmconfig = pmconfig.pmConfig(self)
>   in /usr/libexec/pcp/bin/pcp-dstat

That's correct.  It's python code - we're calling the validate_metrics method here, which is part of the pmConfig python class, and that in turn is part of the DstatTool class (self).

> - 16364 is relating to the number of metrics dstat can deal with.
>   We hit here an unusually high number of block devices, leading to
>   more metrics (indom) being dealed with, bringing up this issue.

Not quite - its an upper bound on number of instances, not number of metrics.
Instances are like sda, sdb, sdc - metrics are like disk.dev.read, disk.dev.write.
So, this value (which has no units) applies a limit to the number of values that can be associated with the instances of a metric.

> Comments/corrections?

HTH.

Comment 13 Christian Horn 2022-06-14 02:27:44 UTC
Thanks a bunch!

Comment 16 Thomas Gardner 2022-09-15 19:24:35 UTC
For what it's worth, I asked the customer of case #03270143 what their preference would be regarding the options listed in comment #4.  This particular customer voted for:

Make the 500ms warning time configurable from the command line
Command line option to disable these warnings entirely

I would say, you could probably do both with one command line option.  Going with the first option, a special value of either 0 or -1 or the like could be used as a special case to effectively do the second option.  It's a reasonably commonly used method for things like this.

Comment 17 Nathan Scott 2022-09-20 22:00:16 UTC
Thanks Thomas.  At this stage I'm leaning towards the simpler option of just the command line option to disable the warnings entirely.

On reflection (since my notes in #c4) I don't see a strong case to allow the user to fine-tune the timing here.  After they've been informed initially of the kernel sampling relative slowness, they're most likely to just make a mental note of that and then want to clear the issue from dstat reporting permanently.

If anyone feels strongly for/against this approach, please let me know - otherwise I'll get on with making it happen.

Comment 18 Nathan Scott 2022-09-26 05:05:42 UTC
Upstream now:

commit 5e19baea8fdb32a685a2bbf06450679baf36185b
Author: Nathan Scott <nathans>
Date:   Wed Sep 21 09:35:49 2022 +1000

    pcp-dstat: add --nomissed command line option for large systems
    
    After a number of complaints about the "N missed ticks" reporting
    from dstat from users on very large systems with many disks, this
    introduces a command line argument to optionally suppress it.
    
    In the cases we've seen, its been confirmed that the increased
    sample time is spent in the kernel with the large device count -
    not something we can do anything about anyway.
    
    This change has been tested by artificially inducing the failure
    mode as I have no local systems showing this behaviour live.  We
    cannot use archives for this either as we do not do sub-sampling
    for historical data in pcp-dstat.
    
    Resolves Red Hat BZ #2083897

Comment 19 Thomas Gardner 2022-09-29 18:29:39 UTC
Are we gonna pull that in to ours, then?

Comment 20 Nathan Scott 2022-09-29 22:05:38 UTC
At some point, for sure.  Exact timing is still under discussion but expect an update here soon.

Comment 21 Jan Kurik 2022-10-20 10:14:19 UTC
As I have no local systems where I successfully reproduced this behavior , I am marking the verification as SanityOnly.
The warning message suppression is implemented (code review has been done) and documented in man page as well as in the '--help' cmdline option.

Comment 23 Christian Horn 2022-11-24 00:28:54 UTC
The details in #0 should be good for verification.

Comment 25 errata-xmlrpc 2023-05-16 08:13:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcp bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2745


Note You need to log in before you can comment on or make changes to this bug.