Bug 2083897
| Summary: | 'pcp dstat' outputs 'missed clock ticks' on systems with many devices | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Christian Horn <chorn> |
| Component: | pcp | Assignee: | Nathan Scott <nathans> |
| Status: | CLOSED ERRATA | QA Contact: | Jan Kurik <jkurik> |
| Severity: | unspecified | Docs Contact: | Jacob Taylor Valdez <jvaldez> |
| Priority: | unspecified | ||
| Version: | 8.0 | CC: | agerstmayr, jkurik, nathans, thgardne |
| Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
| Target Release: | 8.8 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pcp-5.3.7-11.el8 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-16 08:13:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Christian Horn
2022-05-11 00:37:31 UTC
Hi Christian, Could you provide some timing measurements from this system for me please? Thanks! $ time cat /proc/diskstats $ time cat /proc/stat $ time cat /proc/net/dev Looking at the code, we cannot tackle this using a larger sample time - its using a hard-coded sanity check that communicating with the kernel takes less than 500msec. cheers. Hi, let me get these, plus output from $ time cat /proc/diskstats >/dev/null $ time cat /proc/stat >/dev/null $ time cat /proc/net/dev>/dev/null just to rule out a slow terminal impacts the runtime here. On the KVM guest the times are as follows: 1609 block devices: 0.01s 0.001s 0.006s 12009 block devices: 0.22s 0.001s 0.014s Getting these from the customers real system, he also got 'missed 6 ticks' and 'missed 10 ticks' outputs, not just 'missed 2 ticks' as my reproducer. We got a question on the details of
self.pmconfig.validate_metrics(False, 16364)
in /usr/libexec/pcp/bin/pcp-dstat:
- Can we know what each value in the statement above indicate,
like pmconfig, validate_metrics, False and the number that is followed ?
- What is the unit of the number 16364 like GB or MB like that ?
As per my understanding, this applies:
- This should not be seen as a hirarchical metric name,
but as an object name, i.e. created with
self.pmconfig = pmconfig.pmConfig(self)
in /usr/libexec/pcp/bin/pcp-dstat
- 16364 is relating to the number of metrics dstat can deal with.
We hit here an unusually high number of block devices, leading to
more metrics (indom) being dealed with, bringing up this issue.
Comments/corrections?
(In reply to Christian Horn from comment #11) > We got a question on the details of > > self.pmconfig.validate_metrics(False, 16364) > > in /usr/libexec/pcp/bin/pcp-dstat: > > - Can we know what each value in the statement above indicate, > like pmconfig, validate_metrics, False and the number that is followed ? > - What is the unit of the number 16364 like GB or MB like that ? > > As per my understanding, this applies: > - This should not be seen as a hirarchical metric name, > but as an object name, i.e. created with > self.pmconfig = pmconfig.pmConfig(self) > in /usr/libexec/pcp/bin/pcp-dstat That's correct. It's python code - we're calling the validate_metrics method here, which is part of the pmConfig python class, and that in turn is part of the DstatTool class (self). > - 16364 is relating to the number of metrics dstat can deal with. > We hit here an unusually high number of block devices, leading to > more metrics (indom) being dealed with, bringing up this issue. Not quite - its an upper bound on number of instances, not number of metrics. Instances are like sda, sdb, sdc - metrics are like disk.dev.read, disk.dev.write. So, this value (which has no units) applies a limit to the number of values that can be associated with the instances of a metric. > Comments/corrections? HTH. Thanks a bunch! For what it's worth, I asked the customer of case #03270143 what their preference would be regarding the options listed in comment #4. This particular customer voted for: Make the 500ms warning time configurable from the command line Command line option to disable these warnings entirely I would say, you could probably do both with one command line option. Going with the first option, a special value of either 0 or -1 or the like could be used as a special case to effectively do the second option. It's a reasonably commonly used method for things like this. Thanks Thomas. At this stage I'm leaning towards the simpler option of just the command line option to disable the warnings entirely. On reflection (since my notes in #c4) I don't see a strong case to allow the user to fine-tune the timing here. After they've been informed initially of the kernel sampling relative slowness, they're most likely to just make a mental note of that and then want to clear the issue from dstat reporting permanently. If anyone feels strongly for/against this approach, please let me know - otherwise I'll get on with making it happen. Upstream now:
commit 5e19baea8fdb32a685a2bbf06450679baf36185b
Author: Nathan Scott <nathans>
Date: Wed Sep 21 09:35:49 2022 +1000
pcp-dstat: add --nomissed command line option for large systems
After a number of complaints about the "N missed ticks" reporting
from dstat from users on very large systems with many disks, this
introduces a command line argument to optionally suppress it.
In the cases we've seen, its been confirmed that the increased
sample time is spent in the kernel with the large device count -
not something we can do anything about anyway.
This change has been tested by artificially inducing the failure
mode as I have no local systems showing this behaviour live. We
cannot use archives for this either as we do not do sub-sampling
for historical data in pcp-dstat.
Resolves Red Hat BZ #2083897
Are we gonna pull that in to ours, then? At some point, for sure. Exact timing is still under discussion but expect an update here soon. As I have no local systems where I successfully reproduced this behavior , I am marking the verification as SanityOnly. The warning message suppression is implemented (code review has been done) and documented in man page as well as in the '--help' cmdline option. The details in #0 should be good for verification. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcp bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2745 |