Bug 1724288

Summary: Wrong wait calculation for collectl-dm-sD and collectl-sD in /etc/pcp/pmrep/pmrep.conf
Product: [Fedora] Fedora Reporter: Alexandros Panagiotou <apanagio>
Component: pcpAssignee: Mark Goodwin <mgoodwin>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 29CC: brolley, fche, lberk, mgoodwin, myllynen, nathans
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcp-4.3.3-1 pcp-4.3.4-1.fc30 pcp-4.3.4-1.fc29 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-20 01:48:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexandros Panagiotou 2019-06-26 16:32:24 UTC
Description of problem:
The way the average wait value (for both reads and writes) is calculated in the collectl-sD and collectl-dm-sD reports of pmrep is wrong because it just adds the r_await and w_await. This results in lines like the following:

           Mapped Device   DM   rkB/s rrqm/s     r/s rareq-sz r_await     wkB/s   wrqm/s     w/s wareq-sz w_await    wait aqu-sz  svctm  %util
                                 KB/s count/ count/s Kbyte/co ms/coun      KB/s  count/s count/s Kbyte/co ms/coun ms/coun        s/coun
11:00:25      <dev_name> dm-3 6435.30   0.00  804.30     8.00    3.27      5.60     0.20    0.80     7.00    3.25    6.52   2.63   1.16  93.65

6.52ms cannot be the average waiting time for all requests to dm-3, when its read requests were waiting on average 3.27ms and the write requests 3.25ms

Version-Release number of selected component (if applicable):
pcp-system-tools-4.3.2-1.fc29.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Collect an pcp archive 
2. Run the collectl-dm-sD or collectl-sD report from pmrep on that archive - e.g.:

   pmrep -a <archive_name>  -t10s :collectl-dm-sD

   
3. Compare the columns r_await, w_await and wait

Actual results:

wait is calculated as:

wait.formula         = delta(disk.dev.read_rawactive) / delta(disk.dev.read) + delta(disk.dev.write_rawactive) / delta(disk.dev.write)

which is practically r_wait + w_wait

Expected results:

Maybe a better formula would be:

wait.formula         = (delta(disk.dev.read_rawactive) + delta(disk.dev.write_rawactive)) / (delta(disk.dev.read) + delta(disk.dev.write))

Additional info:
I'm currently using fedora 29 (thus the BZ for fedora 29), but the same calculation is likely to be in pcp versions shipped with RHEL and other versions of fedora as well

Comment 1 Alexandros Panagiotou 2019-06-27 18:29:23 UTC
Hello,
One more note about the collectl-dm-sD report in pmrep.conf: 

The width of hinv.map.dmname is set to 4. This fits dm-3 but truncates dm-35 making it look like dm-3. Probably setting it to 6 is safer.

Regards,
Alexandros

Comment 2 Mark Goodwin 2019-06-27 23:59:18 UTC

Hi Alexandros, I agree your proposed formula change is correct:

wait.formula = (delta(disk.dev.read_rawactive) + delta(disk.dev.write_rawactive)) / (delta(disk.dev.read) + delta(disk.dev.write))

Since we already have disk.dev.total_rawactive and disk.dev.total, it can be simplified to:

wait.formula = delta(disk.dev.total_rawactive) / delta(disk.dev.total)

So I will make that change, thanks!

Will also update the global derived metrics definitions for disk.{dev,dm,md}.await since these are also incorrect.

Regards

Comment 3 Mark Goodwin 2019-06-28 02:31:02 UTC
Fixed with the following upstream commit for pcp-4.3.3-1 :

commit c119577d449868e45268bacad670e9a6b5dd9a7a
Author: Mark Goodwin <mgoodwin>
Date:   Fri Jun 28 12:18:18 2019 +1000

    pmrep: fix wait.formula for collectl-dm-sD and collectl-sD
    
    Resolves: Fedora BZ #1724288
    
    wait = r_wait + w_wait is incorrect, see BZ #1724288.
    Instead use the disk.*.total (reads + writes) metrics to
    derive the total average wait time.
    
    Fixed in both pmrep.conf and derived iostat.conf.
    
    QA tested for groups derive, pmrep and pmiostat.

Comment 4 Fedora Update System 2019-06-28 06:40:05 UTC
FEDORA-2019-4076c8c0d7 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-4076c8c0d7

Comment 5 Fedora Update System 2019-06-28 06:46:46 UTC
FEDORA-2019-cdb6bafc6d has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-cdb6bafc6d

Comment 6 Fedora Update System 2019-06-28 18:25:46 UTC
pcp-4.3.3-1.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-4076c8c0d7

Comment 7 Fedora Update System 2019-06-28 21:44:28 UTC
pcp-4.3.3-1.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-cdb6bafc6d

Comment 8 Alexandros Panagiotou 2019-07-01 10:32:24 UTC
Hello Marc,
Thanks for the very quick response on this one. I'm far from being an expert on the metrics that are collected, so I certainly trust you more than myself. Do you think it makes sense to also make the field for hinv.map.dmname wider in collectl-dm-sD? It is a different thing, so strictly speaking it should be a different BZ, but I think it is a fairly trivial presentation issue, thus mentioning it here. As is right now, with only 4 characters, it truncates device names (e.g. dm-35 will appear as dm-3 which can be surprising at a first look). 6 would probably be better - I guess dm device names with more than 3 digits are quite rare. I'm not sure if this applies to other reports or tools.

Regards,
Alexandros

Comment 9 Mark Goodwin 2019-07-04 06:55:34 UTC
(In reply to Alexandros Panagiotou from comment #8)
> Hello Marc,
> Thanks for the very quick response on this one. I'm far from being an expert
> on the metrics that are collected, so I certainly trust you more than
> myself. Do you think it makes sense to also make the field for
> hinv.map.dmname wider in collectl-dm-sD? It is a different thing, so
> strictly speaking it should be a different BZ, but I think it is a fairly
> trivial presentation issue, thus mentioning it here. As is right now, with
> only 4 characters, it truncates device names (e.g. dm-35 will appear as dm-3
> which can be surprising at a first look). 6 would probably be better - I
> guess dm device names with more than 3 digits are quite rare. I'm not sure
> if this applies to other reports or tools.
> 
 
Hi Alexandros,

yes definitely makes sense to increase the column width to avoid truncating the DM name.
I made that change too, as part of the same commit, see upstream c119577d44986 :
https://github.com/performancecopilot/pcp/commit/c119577d449868e45268bacad670e9a6b5dd9a7a

Regards

Comment 10 Alexandros Panagiotou 2019-07-04 15:59:41 UTC
Hello,
Indeed, I must have been blind when I was checking that the other day.

Thanks!
Alexandros

Comment 11 Fedora Update System 2019-08-16 01:53:01 UTC
FEDORA-2019-97183bed56 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-97183bed56

Comment 12 Fedora Update System 2019-08-16 01:53:31 UTC
FEDORA-2019-44b383ec91 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-44b383ec91

Comment 13 Fedora Update System 2019-08-17 01:27:37 UTC
pcp-4.3.4-1.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-97183bed56

Comment 14 Fedora Update System 2019-08-17 02:23:45 UTC
pcp-4.3.4-1.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-44b383ec91

Comment 15 Fedora Update System 2019-08-20 01:48:53 UTC
pcp-4.3.4-1.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 16 Fedora Update System 2019-08-25 03:03:18 UTC
pcp-4.3.4-1.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.