1779419 – dstat does not show disk statistics

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1779419 - dstat does not show disk statistics

Summary: dstat does not show disk statistics

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcp
Sub Component:
Version:	8.1
Hardware:	ppc64le
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	8.2
Assignee:	Nathan Scott
QA Contact:	Jan Kurik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-03 23:34 UTC by olaf.weiser@de.ibm.com
Modified:	2022-07-25 04:08 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pcp-5.0.2
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 15:40:22 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
output of commands (260.00 KB, application/x-tar) 2019-12-04 14:50 UTC, olaf.weiser@de.ibm.com	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-30710	None	None	None	2022-05-09 08:25:27 UTC
Red Hat Knowledge Base (Solution)	6957052	None	None	None	2022-05-09 08:45:50 UTC
Red Hat Product Errata	RHBA-2020:1628	None	None	None	2020-04-28 15:41:05 UTC

Description olaf.weiser@de.ibm.com 2019-12-03 23:34:51 UTC

Description of problem:
dstat does not show disk statistics

Version-Release number of selected component (if applicable):


How reproducible:
our system has dual path SAS drives  connected - 
total amount is 637 disk 

lsblk | grep "└─" | wc -l
1274


Steps to Reproduce:
1. create a file systems on some disks 
2. create some workload (dd or whatever) 
3. monitor by dstat 


Actual results:
[root@IOnode1 ~]# dstat 1 5
You did not select any stats, using -cdngy by default.
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  1   2  97   0   0|           |  13k  740 |   0     0 |1839  3464 
  0   1  99   0   0|           |  12k  300 |   0     0 |1752  3310 
  0   1  99   0   0|           |  13k  308 |   0     0 |1417  3213 
  0   1  98   0   0|           |  13k  420 |   0     0 |1461  3227 


Expected results:
some values in the column dsk/total

Additional info:

Comment 1 Nathan Scott 2019-12-03 23:45:43 UTC

Hi there Olaf,

The values in the disk column comes from ...


$  head -14 /etc/pcp/dstat/disk
#
# pcp-dstat(1) configuration file - see pcp-dstat(5)
#

[disk]
label = dsk/%I
printtype = b
precision = 0
grouptype = 2
reads = disk.dev.read_bytes
reads.label = read
writes = disk.dev.write_bytes
writes.label = writ


... could you attach the output from the following command on this system?

$ pminfo -L -f disk.dev.read_bytes
$ cat /proc/diskstats


Thanks!

Comment 2 olaf.weiser@de.ibm.com 2019-12-03 23:49:13 UTC

[root@ ~]# pminfo -L -f disk.dev.read_bytes | wc -l
1274
[root@ ~]# pminfo -L -f disk.dev.read_bytes | head -20

disk.dev.read_bytes
    inst [0 or "sda"] value 46088
    inst [1 or "sdb"] value 4148913
    inst [2 or "sde"] value 40380656
    inst [3 or "sdc"] value 36606744
    inst [4 or "sdr"] value 91868924
    inst [5 or "sdk"] value 40163592
    inst [6 or "sdd"] value 35397968
    inst [7 or "sdaf"] value 40063352
    inst [8 or "sdaa"] value 54474488
    inst [9 or "sdah"] value 55986024
    inst [10 or "sdg"] value 40295152
    inst [11 or "sdaj"] value 40170128
    inst [12 or "sdx"] value 39838808
    inst [13 or "sdab"] value 55742472
    inst [14 or "sdn"] value 91288616
    inst [15 or "sdp"] value 56656768
    inst [16 or "sdl"] value 91364784
    inst [17 or "sdh"] value 61786808


[root@ ~]# cat /proc/diskstats | wc -l
2551
[root@ ~]# cat /proc/diskstats | head -20
   8       0 sda 982 0 92177 3212 115 8 8176 733 0 4560 2310 0 0 0 0
   8       1 sda1 750 0 72585 2059 87 8 8176 523 0 3870 1520 0 0 0 0
   8      16 sdb 73077 11217 8297826 443241 4639861 305612 214851048 40944798 0 26608180 27116860 0 0 0 0
   8      17 sdb1 1298 0 161232 8308 0 0 0 0 0 5720 5480 0 0 0 0
   8      18 sdb2 662 0 170089 4408 44 0 4528 415 0 3590 3130 0 0 0 0
   8      19 sdb3 70899 11217 7948705 425815 3838946 305612 214846520 22308160 0 26163540 12570110 0 0 0 0
   8      64 sde 1224255 7781 80761312 745075 1274 0 5169680 34945 0 481770 440650 0 0 0 0
   8      65 sde1 13992 0 449568 2445 0 0 0 0 0 2490 10 0 0 0 0
   8      32 sdc 19625 0 73213488 784906 186683 0 762787208 12988736 0 3694200 12747410 0 0 0 0
  65      16 sdr 1248817 10671 183737848 1908524 204695 0 837042984 17617101 0 4600100 17979960 0 0 0 0
  65      17 sdr1 14014 0 540032 3426 0 0 0 0 0 2990 610 0 0 0 0
   8     160 sdk 1224443 7542 80327184 706375 1266 0 5144880 33811 0 477930 403840 0 0 0 0
   8     161 sdk1 13992 0 449568 2481 0 0 0 0 0 2520 20 0 0 0 0
   8      48 sdd 19095 0 70795936 735641 181047 0 740351848 10617225 0 3605480 10357910 0 0 0 0
  65     240 sdaf 1222902 9016 80126704 769254 1243 0 5053904 33106 0 485770 448070 0 0 0 0
  65     241 sdaf1 13965 27 449568 2510 0 0 0 0 0 2560 50 0 0 0 0
  65     160 sdaa 1232979 8369 108948976 1199837 176427 0 721319160 10416379 0 3781320 10351750 0 0 0 0
  65     161 sdaa1 14009 0 519472 3102 0 0 0 0 0 2930 480 0 0 0 0
  66      16 sdah 1234511 7510 111972048 1212308 196714 1 804880552 14425492 0 4105150 14257540 0 0 0 0
  66      17 sdah1 14014 0 540032 3587 0 0 0 0 0 3000 670 0 0 0 0

pls tell me, if you need all lines ..

Comment 3 Nathan Scott 2019-12-03 23:53:14 UTC

| pls tell me, if you need all lines ..

Yes please (via 'Add an attachment' in bugzilla), that should allow me to reproduce the problem exactly as you see it and confirm any fix we come up with.

Comment 4 olaf.weiser@de.ibm.com 2019-12-04 14:50:18 UTC

Created attachment 1642094 [details]
output of commands

Comment 5 Nathan Scott 2019-12-05 03:37:57 UTC

Thanks Olaf, I've been able to reproduce the problem with those files.  I also have a workaround for you.

The fundamental issue is that you have more devices than a hard limit being imposed by one of our python libraries, and the way dstat interfaces to that library is not sufficiently dynamic to handle this limitation (resolving this is what I'll work on next).

The temporary workaround is to edit your local dstat (python) script as follows:

diff --git a/src/pcp/dstat/pcp-dstat.py b/src/pcp/dstat/pcp-dstat.py
index 29d63bf01..632f83795 100755
--- a/src/pcp/dstat/pcp-dstat.py
+++ b/src/pcp/dstat/pcp-dstat.py
@@ -1024,7 +1024,7 @@ class DstatTool(object):
             sys.exit(1)
 
         self.pmconfig.validate_common_options()
-        self.pmconfig.validate_metrics()
+        self.pmconfig.validate_metrics(False, 2048)
 
         for i, plugin in enumerate(self.totlist):
             for name in self.metrics:


This raises an internal limit to cater for your case of more than 1024 disk instances.  In the meantime, I'll work on making the code dynamically adjust this value based on the number of instances observed.

cheers.

Comment 6 olaf.weiser@de.ibm.com 2019-12-05 21:11:26 UTC

Hallo Nathan, 
I changed that line as proposed .. and now .. it works like a charm 
thank you very much .. .

to doc what it looks like 

before:
[root@ ~]# dstat
You did not select any stats, using -cdngy by default.
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  1   2  97   0   0|           |  33k  133k|   0     0 |5309  3772 
  1   1  98   0   0|           |  26k  191k|   0     0 |4844  3597 
  0   1  99   0   0|           |  14k 1628 |   0     0 |1705  3341 ^C

after / with the changed line:
[root@ ~]# dstat
You did not select any stats, using -cdngy by default.
----total-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  0   1  98   0   0|   0     0 |  13k  642 |   0     0 |1347  3222 
  1   1  98   0   0| 291M    0 |  12k  306 |   0     0 |1541  1383 
  1   1  98   0   0| 251M    0 |  11k  330 |   0     0 |1208  1837 
  1   1  98   0   0|  20M   36k|  12k  330 |   0     0 |1387  3110 ^C

Comment 7 olaf.weiser@de.ibm.com 2019-12-09 19:38:50 UTC

Nathan, 
anything else you need from our side ?

Comment 8 Nathan Scott 2019-12-09 21:31:28 UTC

Nothing else thanks Olaf - the complete fix is now merged upstream and will make its way into the next PCP RHEL update (later this week for 8.2).

Comment 11 errata-xmlrpc 2020-04-28 15:40:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1628

Comment 12 Christian Horn 2022-05-09 08:13:31 UTC

Hi Laff and Nathan,

just an FYI that we are hitting this also on rhel7.
As per rhel7 life cycle, this is probably not fitting rhel7.x errata criteria.
Reproducer for anybody who wants to replicate, a KVM guest with rhel7 is enough:

  # modprobe scsi-debug add_host=16 num_parts=32 num_tgts=64
  # wc -l /proc/partitions 
  5130 /proc/partitions
  # pcp dstat
  You did not select any stats, using -cdngy by default.
  ----total-usage---- -dsk/total- -net/total- ---paging-- ---system--
  usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
    1   3  96   0   0|           |  63   159 |   0     0 | 236   158 
    0   1  99   0   0|           | 398   666 |   0     0 | 103   106 
    0   1  99   0   0|           |  66   308 |   0     0 | 105    79

The patch from #5 works on the replicator.
Christian

Comment 13 Nathan Scott 2022-07-20 06:21:09 UTC

Thanks Christian,

One more thing worth mentioning for anyone reading this BZ...

Since dstat (aka pcp-dstat) is now a PCP client tool, you can
use a remote (newer) system for running analysis.  IOW if you
have a production system that cannot be upgraded you can make
use of pcp-dstat on a remote, recent (even Fedora or Ubuntu)
distribution to run dstat pointing at the production host,

$ pcp --host <hostname> dstat

Thus, you can use the newer, fixed version of pcp-dstat using
the pmcd and metrics of a RHEL7 host.  Similarly, we can take
PCP archives from that production host and analyse them with
more recent versions of PCP, ...

$ pcp --archive <path> dstat --all --time

cheers.

Comment 14 Christian Horn 2022-07-25 04:08:43 UTC

Great idea, had not thought of that.  kbase accordingly modified,
that might help some affected customers.

Note You need to log in before you can comment on or make changes to this bug.