Bug 1569854

Summary:	pcp-pmda-dm dmstats not working
Product:	Red Hat Enterprise Linux 7	Reporter:	Marko Myllynen <myllynen>
Component:	pcp	Assignee:	Mark Goodwin <mgoodwin>
Status:	CLOSED DUPLICATE	QA Contact:	qe-baseos-tools-bugs
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.5	CC:	agk, bmr, cww, fche, lberk, mgoodwin, mlinden, nathans, tbowling
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pcp-4.2.0-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-03-05 06:11:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1647308
Bug Blocks:	1594286

Description Marko Myllynen 2018-04-20 06:41:07 UTC

Description of problem:
Following the instructions on the pmdadm(1) man page:

# yum install pcp pcp-pmda-dm pcp-system-tools
# dmstats create --alldevices --bounds 10us,30us,50us
# systemctl restart pmcd
# cd /var/lib/pcp/pmdas/dm
# echo c | ./Install
# pmrep dmcache dmstats dmthin

The last commands gives no values at all, pminfo -dfmtT shows that
otherwise the metric metadata is as expected. dmsetup table reports two devices.

Version-Release number of selected component (if applicable):
RHEL 7.5 / pcp-3.12.2-5.el7.x86_64

Comment 2 Nathan Scott 2018-04-20 06:46:44 UTC

I believe both Bryn and Fumiya are looking into this currently (just FYI).

Comment 3 Nathan Scott 2018-05-23 00:19:31 UTC

Thanks for following up with this one Mark.

Comment 4 Mark Goodwin 2018-05-23 03:18:44 UTC

Easily repro'd this on F27 with current pcp-4.1.0 master branch installed.

pmrep is not reporting any values because pmdadm is segfaulting. Looks like the pmdaCache opaque pointer (pw) is corrupted or not being saved properly. Clearly pw->region_id is out of bounds .. just a bit! Seems to work OK with only one dm device, so this is probably a memory management issue when there is more than one device configured with e.g. dmstats create rootvg-root --bounds 10us,30us,50us  but it fails with this segfault for --alldevices

Program received signal SIGSEGV, Segmentation fault.
0x00007fa92eb15100 in dm_stats_get_histogram (dms=dms@entry=0x55939badf590, region_id=94092439760656, area_id=7594024938858622240) at libdm-stats.c:3424
3424		if (!dms->regions[region_id].counters)
(gdb) where
#0  0x00007fa92eb15100 in dm_stats_get_histogram (dms=dms@entry=0x55939badf590, region_id=94092439760656, area_id=7594024938858622240) at libdm-stats.c:3424
#1  0x000055939946fa9e in _pm_dm_refresh_stats_histogram_update (pw2=0x55939a72a160, pw=0x55939a72af20) at dmstats.c:243
#2  pm_dm_refresh_stats (pw=0x55939a72af20, instance=<optimized out>) at dmstats.c:324
#3  0x000055939946d37f in dm_fetch_refresh (pmda=0x55939a71c970, need_refresh=0x7ffca1d58d00) at pmda.c:1091
#4  dm_fetch (numpmid=15, pmidlist=0x55939badd85c, resp=0x7ffca1d58dc8, pmda=0x55939a71c970) at pmda.c:1131
#5  0x00007fa92efe22d0 in __pmdaMainPDU (dispatch=dispatch@entry=0x7ffca1d58e70) at mainloop.c:169
#6  0x00007fa92efe2dc0 in pmdaMain (dispatch=0x7ffca1d58e70) at mainloop.c:397
#7  0x000055939946cdfc in main (argc=3, argv=0x7ffca1d59ff8) at pmda.c:1270
(gdb) p dms
$1 = (const struct dm_stats *) 0x55939badf590
(gdb) p *dms
$2 = {bind_major = -1, bind_minor = -1, bind_name = 0x55939bac8880 "apps-virtimages", bind_uuid = 0x0, program_id = 0x55939badf0f0 "pmdadm\n", name = 0x55939badf660 "apps-virtimages", mem = 0x55939bae7bf0, 
  hist_mem = 0x55939bae7ba0, group_mem = 0x55939bae7b50, nr_regions = 1, max_region = 0, interval_ns = 0, timescale = 1000000, precise = 0, regions = 0x55939bac88a0, groups = 0x55939bae7c68, walk_flags = 562949953421312, 
  cur_flags = 562949953421312, cur_group = 18446744073709551615, cur_region = 1, cur_area = 562949953421312}
(gdb) p region_id
$3 = 94092439760656
(gdb) up
#1  0x000055939946fa9e in _pm_dm_refresh_stats_histogram_update (pw2=0x55939a72a160, pw=0x55939a72af20) at dmstats.c:243
243		if (!(dmh = dm_stats_get_histogram(dms, region_id, area_id))) {
(gdb) l 239
234	    static int number_of_bins = 0, bin = 0;
235	    uint64_t region_id, area_id;
236	    int i;
237	
238	    dms = pw->dms;
239	    region_id = pw->region_id;
240	    area_id = pw->area_id;
241	
242	    if (bin == 0) {
243		if (!(dmh = dm_stats_get_histogram(dms, region_id, area_id))) {
(gdb) p pw
$4 = (struct pm_wrap *) 0x55939a72af20
(gdb) p *pw
$5 = {dms = 0x55939badf590, dmsc = 0x55939a72afd0, pdmh = 0x55939a72b040, region_id = 94092439760656, area_id = 7594024938858622240, dev = "apps-virtimages\000stogram:10000,30000,50000,100000\n", '\000' <repeats 78 times>}
(gdb) up
#2  pm_dm_refresh_stats (pw=0x55939a72af20, instance=<optimized out>) at dmstats.c:324
324			  _pm_dm_refresh_stats_histogram_update(pw, pw2);
(gdb) l
319		    if ((inst = pmdaCacheOp(indom, PMDA_CACHE_WALK_NEXT)) < 0)
320			break;
321		    if (!pmdaCacheLookup(indom, inst, &name, (void **)&pw2) || !pw2)
322			continue;
323		    if (!strcmp(pw2->dev, pw->dev))
324			  _pm_dm_refresh_stats_histogram_update(pw, pw2);
325		}
326	    } else if (instance == DM_HISTOGRAM_INDOM) {
327		if (!(pw->dms = _dm_stats_get_region(pw->dev)))
328		    goto nostats;
(gdb) p inst
$6 = <optimized out>
(gdb) p name
$7 = 0x55939a729e10 "apps-virtimages:0:0s"
(gdb) p pw, pw2
$8 = (struct pm_wrap *) 0x55939a72a160
(gdb) p *pw
$9 = {dms = 0x55939badf590, dmsc = 0x55939a72afd0, pdmh = 0x55939a72b040, region_id = 94092439760656, area_id = 7594024938858622240, dev = "apps-virtimages\000stogram:10000,30000,50000,100000\n", '\000' <repeats 78 times>}
(gdb) up
#3  0x000055939946d37f in dm_fetch_refresh (pmda=0x55939a71c970, need_refresh=0x7ffca1d58d00) at pmda.c:1091
1091	                pm_dm_refresh_stats(pw, DM_STATS_INDOM);
(gdb) p pw
$10 = (struct pm_wrap *) 0x55939a72af20
(gdb) p *pw
$11 = {dms = 0x55939badf590, dmsc = 0x55939a72afd0, pdmh = 0x55939a72b040, region_id = 94092439760656, area_id = 7594024938858622240, dev = "apps-virtimages\000stogram:10000,30000,50000,100000\n", '\000' <repeats 78 times>}

Comment 5 Mark Goodwin 2018-05-23 06:06:21 UTC

The pw structures are being malloc'd but not initialized. Valgrind shows the region and area IDs are being referenced whilst still uninitialized, down in dm_stats_get_histogram (libdm-stats.c), and elsewhere :

[Wed May 23 15:07:11] pmdadm(17446) Debug: pmdaConnect: PMDA pmdadm: opened pipe to pmcd, infd = 0, outfd = 1
dbpmda> fetch 129.3.0
PMID(s): 129.3.0
[Wed May 23 15:07:34] pmdadm(17446) Debug: Received PDU_PROFILE
[Wed May 23 15:07:34] pmdadm(17446) Debug: Received PDU_FETCH
==17446== Conditional jump or move depends on uninitialised value(s)
==17446==    at 0x5321FF5: dm_stats_get_histogram (libdm-stats.c:3401)
==17446==    by 0x10DA9D: _pm_dm_refresh_stats_histogram_update (dmstats.c:243)
==17446==    by 0x10DA9D: pm_dm_refresh_stats (dmstats.c:324)
==17446==    by 0x10B37E: dm_fetch_refresh (pmda.c:1091)
==17446==    by 0x10B37E: dm_fetch (pmda.c:1131)
==17446==    by 0x4E422CF: __pmdaMainPDU (mainloop.c:169)
==17446==    by 0x4E42DBF: pmdaMain (mainloop.c:397)
==17446==    by 0x10ADFB: main (pmda.c:1270)
==17446== 
==17446== Conditional jump or move depends on uninitialised value(s)
==17446==    at 0x5322003: dm_stats_get_histogram (libdm-stats.c:3407)
==17446==    by 0x10DA9D: _pm_dm_refresh_stats_histogram_update (dmstats.c:243)
==17446==    by 0x10DA9D: pm_dm_refresh_stats (dmstats.c:324)
==17446==    by 0x10B37E: dm_fetch_refresh (pmda.c:1091)
==17446==    by 0x10B37E: dm_fetch (pmda.c:1131)
==17446==    by 0x4E422CF: __pmdaMainPDU (mainloop.c:169)
==17446==    by 0x4E42DBF: pmdaMain (mainloop.c:397)
==17446==    by 0x10ADFB: main (pmda.c:1270)
==17446== 
==17446== Conditional jump or move depends on uninitialised value(s)
==17446==    at 0x53220C2: dm_stats_get_histogram (libdm-stats.c:3413)
==17446==    by 0x10DA9D: _pm_dm_refresh_stats_histogram_update (dmstats.c:243)
==17446==    by 0x10DA9D: pm_dm_refresh_stats (dmstats.c:324)
==17446==    by 0x10B37E: dm_fetch_refresh (pmda.c:1091)
==17446==    by 0x10B37E: dm_fetch (pmda.c:1131)
==17446==    by 0x4E422CF: __pmdaMainPDU (mainloop.c:169)
==17446==    by 0x4E42DBF: pmdaMain (mainloop.c:397)
==17446==    by 0x10ADFB: main (pmda.c:1270)


Zeroing the just-malloc'd pw structures (stored via the pmdaCache entry opaque pointer) seems to fix it. Here's the patch :

diff --git a/src/pmdas/dm/dmstats.c b/src/pmdas/dm/dmstats.c
index 56c7f9ec7..45be0a477 100644
--- a/src/pmdas/dm/dmstats.c
+++ b/src/pmdas/dm/dmstats.c
@@ -437,6 +437,7 @@ pm_dm_stats_instance_refresh(void)
        sts = pmdaCacheLookupName(indom, names->name, NULL, (void **)&pw);
        if (sts == PM_ERR_INST || (sts >= 0 && pw == NULL)) {
            pw = (struct pm_wrap *)malloc(sizeof(*pw));
+           memset(pw, 0, sizeof(*pw)); /* RH BZ 1569854 */
            pw->dmsc = calloc(1, sizeof(*pw->dmsc));
            pw->pdmh = calloc(1, sizeof(*pw->pdmh));
            if (pw == NULL)
@@ -523,6 +524,7 @@ pm_dm_histogram_instance_refresh(void)
                sts = pmdaCacheLookupName(indom, buffer, NULL, (void **)&pw);
                if (sts == PM_ERR_INST || (sts >= 0 && pw == NULL)) {
                    pw = (struct pm_wrap *)malloc(sizeof(*pw));
+                   memset(pw, 0, sizeof(*pw)); /* RH BZ 1569854 */
                    pw->dmsc = calloc(1, sizeof(*pw->dmsc));
                    pw->pdmh = calloc(1, sizeof(*pw->pdmh));


With this patch applied, no longer getting the segfaults and I get a clean valgrind run :

goblin:root@/var/lib/pcp/pmdas/dm[]# dbpmda -n root -q 600
dbpmda> open pipe /bin/valgrind ./pmdadm -l- -d129
Start valgrind PMDA: /bin/valgrind ./pmdadm -l- -d129
==22266== Memcheck, a memory error detector
==22266== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==22266== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==22266== Command: ./pmdadm -l- -d129
==22266== 
Log for pmdadm on goblin started Wed May 23 15:57:41 2018

dbpmda> fetch 129.3.2
PMID(s): 129.3.2
pmResult dump from 0x55de1751f980 timestamp: 0.000000 10:00:00.000 numpmid: 1
  129.3.2 (dmstats.read_bytes): numval: 5 valfmt: 1 vlist[]:
    inst [0 or ???] value 24
    inst [1 or ???] value 0
    inst [2 or ???] value 56
    inst [3 or ???] value 48
    inst [4 or ???] value 8
dbpmda> close
dbpmda> 
Log finished Wed May 23 15:58:04 2018
You have a memory leak (not released memory pool):
 [0xaad94f0] stats_pool
 [0xaad9580] histogram_pool
...

The memleak is probably expected (memory would be released on exit), but may need further investigation.

Comment 6 Mark Goodwin 2018-05-23 06:27:30 UTC

# for i in $(seq 1 10000); do pminfo -f dmstats >/dev/null; done
shows pmdadm process is definitely leaking memory - the vsz/rss grows on every fetch.

running qa/1188 with pre-existing stats regions for all devices highlights the leak :

1188 53s ... - output mismatch (see 1188.out.bad)
1042a1043,1068
> You have a memory leak (not released memory pool):
>  [0xa2f8850] stats_pool
>  [0xa2f88e0] histogram_pool
>  [0xa2f8970] group_pool
>  [0xa306420] stats_pool
>  [0xa3064b0] histogram_pool
>  [0xa306540] group_pool
>  [0xa314030] stats_pool
>  [0xa3140c0] histogram_pool
>  [0xa314150] group_pool
>  [0xa321c80] stats_pool
>  [0xa321d10] histogram_pool
>  [0xa321da0] group_pool
>  [0xa3417b0] stats_pool
>  [0xa341840] histogram_pool
>  [0xa3418d0] group_pool
>  [0xa34fc60] stats_pool
>  [0xa34fcf0] histogram_pool
>  [0xa34fd80] group_pool
>  [0xa35e170] stats_pool
>  [0xa35e200] histogram_pool
>  [0xa35e290] group_pool
>  [0xa36c6f0] stats_pool
>  [0xa36c780] histogram_pool
>  [0xa36c810] group_pool
> Internal error: Unreleased memory pool(s) found.

I'll resume on this tomorrow - valgrind --leak-check=full should show what's going on

Comment 7 Mark Goodwin 2018-05-24 04:02:31 UTC

ltrace on pmdadm shows we're creating tasks and creating multiple stats handles before the dm_task_destroy, but only calling dm_stats_destroy on the last dm_stats handle. All the intervening handles thus leak :

9218 dm_task_create(13, 0x7ffdd9eca214, 0, 0)                                                   = 0x1dba120
9218 dm_task_enable_checks(0x1dba120, 0, 0x1dba180, 0)                                          = 1
9218 dm_task_run(0x1dba120, 0, 0x1dba180, 0)                                                    = 1
9218 dm_task_get_names(0x1dba120, 0, 0x7fb959ed1458, 13)                                        = 0x59e2a028
9218 dm_stats_create(0x405cb2, 0, 312, 13)                                                      = 0x59e75790
9218 dm_stats_bind_name(0x59e75790, 0x59e2a034, 8, 0)                                           = 1
9218 dm_stats_list(0x59e75790, 0x405cb2, 0x59e73370, 0x59e73370)                                = 1
9218 dm_stats_get_nr_regions(0x59e75790, 0x1622010, 0x1b28fc0, 0)                               = 1
9218 dm_stats_create(0x405cb2, 0x59e2a030, 0, 0x1626190)                                        = 0x59e63930
9218 dm_stats_bind_name(0x59e63930, 0x59e2a05c, 8, 0)                                           = 1
9218 dm_stats_list(0x59e63930, 0x405cb2, 0x59e771e0, 0x59e771e0)                                = 1
9218 dm_stats_get_nr_regions(0x59e63930, 0x1622010, 0x1b28fc0, 0)                               = 1
9218 dm_stats_create(0x405cb2, 0x59e2a060, 0, 0x1631610)                                        = 0x59e63b10
9218 dm_stats_bind_name(0x59e63b10, 0x59e2a08c, 8, 0)                                           = 1
9218 dm_stats_list(0x59e63b10, 0x405cb2, 0x59e79600, 0x59e79600)                                = 1
9218 dm_stats_get_nr_regions(0x59e63b10, 0x1622010, 0x1b28fc0, 0)                               = 1
9218 dm_stats_create(0x405cb2, 0x59e2a090, 0, 0x1632240)                                        = 0x59e76a50
9218 dm_stats_bind_name(0x59e76a50, 0x59e2a0cc, 8, 0)                                           = 1
9218 dm_stats_list(0x59e76a50, 0x405cb2, 0x59e7b610, 0x59e7b610)                                = 1
9218 dm_stats_get_nr_regions(0x59e76a50, 0x1622010, 0x1b28fc0, 0)                               = 1
9218 dm_stats_create(0x405cb2, 0x59e2a0cc, 0, 0x1634e80)                                        = 0x59e76cc0
9218 dm_stats_bind_name(0x59e76cc0, 0x59e2a0ec, 8, 0)                                           = 1
9218 dm_stats_list(0x59e76cc0, 0x405cb2, 0x59e7f620, 0x59e7f620)                                = 1
9218 dm_stats_get_nr_regions(0x59e76cc0, 0x1622010, 0x1b28fc0, 0)                               = 1
9218 dm_stats_destroy(0x59e76cc0, 0x59e2a0ec, 0, 0x1639ad0)                                     = 0
9218 dm_task_destroy(0x1dba120, 0x1622010, 0, 9)                                                = 0

Moving the dm_stats_destroy calls inside the loop in the PMDA instance refresh functions seems to fix it.
Here's the full patch now - avoiding segfaults and avoiding the memory leak :
diff --git a/src/pmdas/dm/dmstats.c b/src/pmdas/dm/dmstats.c
index 56c7f9ec7..a91dc6cff 100644
--- a/src/pmdas/dm/dmstats.c
+++ b/src/pmdas/dm/dmstats.c
@@ -437,18 +437,18 @@ pm_dm_stats_instance_refresh(void)
 	sts = pmdaCacheLookupName(indom, names->name, NULL, (void **)&pw);
 	if (sts == PM_ERR_INST || (sts >= 0 && pw == NULL)) {
 	    pw = (struct pm_wrap *)malloc(sizeof(*pw));
-	    pw->dmsc = calloc(1, sizeof(*pw->dmsc));
-	    pw->pdmh = calloc(1, sizeof(*pw->pdmh));
 	    if (pw == NULL)
 		return PM_ERR_AGAIN;
+	    memset(pw, 0, sizeof(*pw)); /* RH BZ 1569854 */
+	    pw->dmsc = calloc(1, sizeof(*pw->dmsc));
+	    pw->pdmh = calloc(1, sizeof(*pw->pdmh));
 	}
 	strcpy(pw->dev, names->name);
 	pmdaCacheStore(indom, PMDA_CACHE_ADD, names->name, (void *)pw);
 	next = names->next;
+	dm_stats_destroy(dms); /* free dms allocated by _dm_stats_search_region */
     } while(next);
 
-    if (dms)
-	dm_stats_destroy(dms);
     dm_task_destroy(dmt);
 
     return 0;
@@ -523,11 +523,11 @@ pm_dm_histogram_instance_refresh(void)
 		sts = pmdaCacheLookupName(indom, buffer, NULL, (void **)&pw);
 		if (sts == PM_ERR_INST || (sts >= 0 && pw == NULL)) {
 		    pw = (struct pm_wrap *)malloc(sizeof(*pw));
-		    pw->dmsc = calloc(1, sizeof(*pw->dmsc));
-		    pw->pdmh = calloc(1, sizeof(*pw->pdmh));
-
 		    if (pw == NULL)
 			return PM_ERR_AGAIN;
+		    memset(pw, 0, sizeof(*pw)); /* RH BZ 1569854 */
+		    pw->dmsc = calloc(1, sizeof(*pw->dmsc));
+		    pw->pdmh = calloc(1, sizeof(*pw->pdmh));
 		}
 		pw->region_id = region_id;
 		pw->area_id = area_id;
@@ -536,10 +536,9 @@ pm_dm_histogram_instance_refresh(void)
 	    }
 	}
 	next = names->next;
+	dm_stats_destroy(dms); /* free dms allocated by _dm_stats_search_region */
     } while(next);
 
-    if (dms)
-	dm_stats_destroy(dms);
     dm_task_destroy(dmt);
 
     return 0;

Comment 8 Mark Goodwin 2018-05-24 05:40:00 UTC

Fixed upstream for pcp-4.1.0:

Mark Goodwin (1):
      pmdadm: fix segfault and memleak in dmstats

 src/pmdas/dm/dmstats.c |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

Details :

commit 413e10778c0664047287c828ed8a3faff045f9a1
Author: Mark Goodwin <mgoodwin>
Date:   Thu May 24 15:20:49 2018 +1000

    pmdadm: fix segfault and memleak in dmstats
    
    RH BZ #1569854
    
    pmdadm dmstats was calling into libdevmapper with uninitialized
    region and area ID arguments passed from the pw structure, resulting
    in segfaults. Fix this by zeroing each pw structure immediately
    after it's malloc'd. Once this was fixed, a memleak was identified
    where only the last dm_stats handle was destroyed when traversing
    all of the devices in a task - all but one handle thus leaked, so
    this does not leak up if there is only one device.

Comment 9 Marko Myllynen 2018-09-07 06:00:07 UTC

Hi Mark,

Thanks for the patch, now I don't see the segfault anymore (on RHEL 7.6 Beta or Fedora/upstream). However I still don't see any values reported. Below is my testing procedure, is there perhaps a step missing here?

# yum install pcp pcp-pmda-dm pcp-system-tools
# dmsetup table                                      
swap: 0 4349952 linear 8:2 2048
root: 0 31457280 linear 8:2 4352000
# dmstats create --alldevices --bounds 10us,30us,50us
# systemctl restart pmcd
# cd /var/lib/pcp/pmdas/dm
# echo c | ./Install
# pmrep dmcache dmstats dmthin

Thanks.

Comment 10 Marko Myllynen 2018-10-01 08:35:45 UTC

Since the fix isn't fully working marking ASSIGNED.

Comment 11 Mark Goodwin 2018-10-30 03:29:08 UTC

clearing fix-in-version and working on this again

Comment 12 Mark Goodwin 2018-11-01 09:21:18 UTC

turns out build/rpm/fedora.spec (and the equivalent RHEL spec) is missing a BuildRequires on device-mapper-devel. Hence configure doesn't find the libdevmapper.h header, so the PMDA gets built without dmstats support (but some parts of the PMDA may still be operational, e.g. the vdo metrics).

diff --git a/build/rpm/fedora.spec b/build/rpm/fedora.spec
index ee08f0a31..6b940f355 100644
--- a/build/rpm/fedora.spec
+++ b/build/rpm/fedora.spec
@@ -1478,6 +1478,7 @@ Group: Applications/System
 Summary: Performance Co-Pilot (PCP) metrics for the Device Mapper Cache and Thin Client
 URL: https://pcp.io
 Requires: pcp-libs = %{version}-%{release}
+BuildRequires: device-mapper-devel

The above patch fixes it on RHEL and Fedora, but I need to check if the device-mapper-devel package is named the same on other RPM based platforms (e.g. OpenSuSE, SLES, etc) - may need some conditional gunk in the RPM spec. Need to check debian too.

Also, some of the tests in the pmda.dm group will need a tweak, e.g. to _notrun if dmstats are not available.

Comment 13 Marko Myllynen 2018-11-01 12:10:39 UTC

Thanks, I'm happy to confirm that after rebuilding RHEL 7.6 pcp.src.rpm with device-mapper-devel installed now I'm seeing dmstats metrics as expected.

Comment 14 Marko Myllynen 2018-11-06 07:14:38 UTC

Since the stats are created in memory, after rebooting the system the DM metrics are not available anymore without a superuser intervention. It would seem appropriate for the PMDA to create the needed stats during its startup if they do not exist already. Thanks.

Comment 15 Mark Goodwin 2018-11-08 07:18:34 UTC

Fixed upstream for pcp-4.2.0-1, releasing Nov 16 2018.

commit ec46029112f108db587253e8e4ec5ae44bafdfb4
Author: Mark Goodwin <mgoodwin>
Date:   Thu Nov 8 10:10:33 2018 +1100

    packaging: spec and manifest updates for pmdadm dmstats

commit 2e39031848e2120328bdeedddd9444c2dabbcd35 (HEAD -> master, upstream-master/master, upstream-goodwinos/master, origin/master, origin/HEAD)
Author: Mark Goodwin <mgoodwin>
Date:   Thu Nov 8 11:05:44 2018 +1100

    pmdadm and QA: log a message if HAVE_DEVMAPPER is not defined

Comment 16 Marko Myllynen 2018-11-08 07:27:11 UTC

@Mark: I don't see those changes addressing the concern mentioned in comment 14 - reboot will render the PMDA dysfunctional.

Comment 17 Mark Goodwin 2018-11-09 03:16:39 UTC

(In reply to Marko Myllynen from comment #16)
> @Mark: I don't see those changes addressing the concern mentioned in comment
> 14 - reboot will render the PMDA dysfunctional.

Hi Marko, the changes address the remaining issues, as originally reported.
The issues in Comment #14 are orthogonal and should really be a separate RFE.

dmstats has to be enabled for devices of interest - this is similar to perf event capture, which also has to be manually enabled. The device-mapper default is not to collect any stats because there are overheads associated with enabling this, and also because PCP may not be the only consumer of that data. So it's a site-specific admin task to enable dmstats collection for devices of interest (and histo bins, etc). A suitable configuration might be in rc.local, where an admin could run e.g. "dmstats create --alldevices" or whatever they want enabled. Thoughts?

Regards
-- Mark

Comment 18 Marko Myllynen 2018-11-09 07:14:03 UTC

(In reply to Mark Goodwin from comment #17)
> 
> dmstats has to be enabled for devices of interest - this is similar to perf
> event capture, which also has to be manually enabled. The device-mapper
> default is not to collect any stats because there are overheads associated
> with enabling this, and also because PCP may not be the only consumer of
> that data. So it's a site-specific admin task to enable dmstats collection
> for devices of interest (and histo bins, etc). A suitable configuration
> might be in rc.local, where an admin could run e.g. "dmstats create
> --alldevices" or whatever they want enabled. Thoughts?

I think having this information included in the man page would be enough, otherwise the page is very good already but doesn't mention that the stats are created in-memory. Otherwise I don't see need for additional changes. Thanks.

Comment 19 Nathan Scott 2019-03-05 06:11:14 UTC

Marks upstream fixed will arrive in 7.7 via rebase.

*** This bug has been marked as a duplicate of bug 1647308 ***