Bug 1730492 - Some numa nodes have no instances [NEEDINFO]
Summary: Some numa nodes have no instances
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcp
Version: 7.8
Hardware: ppc64le
OS: Linux
unspecified
high
Target Milestone: rc
: 7.9
Assignee: Mark Goodwin
QA Contact: Jan Kurik
URL:
Whiteboard:
Depends On:
Blocks: 1782202
TreeView+ depends on / blocked
 
Reported: 2019-07-16 21:54 UTC by Charles Haithcock
Modified: 2020-02-25 22:42 UTC (History)
6 users (show)

Fixed In Version: pcp-5.0.0
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
mgoodwin: needinfo? (chaithco)
mgoodwin: needinfo? (chaithco)


Attachments (Terms of Use)

Description Charles Haithcock 2019-07-16 21:54:07 UTC
Description of problem:

In a case with a customer, only numa nodes 0 and 1 had instances while nodes 16 and 17 did not have instances and thus no data. Note for ppc and s390 systems, having numa nodes enumerated in a non-sequential manner like above (nodes, 0, 1, 16, 17) can occur based on how the lpars are set up on the host. 


Version-Release number of selected component (if applicable):

From the customer's sosreport:

 $ grep pcp sosreport-<HOSTNAME>-20190710161811/installed-rpms 
pcp-4.1.0-5.el7_6.ppc64le                                   Tue Jul  9 19:19:43 2019
pcp-conf-4.1.0-5.el7_6.ppc64le                              Tue Jul  9 19:16:19 2019
pcp-doc-4.1.0-5.el7_6.noarch                                Tue Jul  9 19:20:06 2019
pcp-libs-4.1.0-5.el7_6.ppc64le                              Tue Jul  9 19:16:29 2019
pcp-pmda-dm-4.1.0-5.el7_6.ppc64le                           Tue Jul  9 19:17:05 2019
pcp-pmda-nfsclient-4.1.0-5.el7_6.ppc64le                    Tue Jul  9 19:17:20 2019
pcp-selinux-4.1.0-5.el7_6.ppc64le                           Tue Jul  9 19:16:39 2019
pcp-system-tools-4.1.0-5.el7_6.ppc64le                      Tue Jul  9 19:20:22 2019
pcp-zeroconf-4.1.0-5.el7_6.ppc64le                          Tue Jul  9 19:20:30 2019
python-pcp-4.1.0-5.el7_6.ppc64le                            Tue Jul  9 19:20:14 2019



How reproducible:

Can not reproduce due to lack of access to multi-numa node s390 or ppc systems. However, for customer, all archive files show this behavior. 


Steps to Reproduce:
1. Install pcp-zeroconf
2.
3.

Actual results:

Missing instances: 

 $ pminfo -dtf mem.numa.util -a pcp/pmlogger/<HOSTNAME>/20190712.00.10.0 | less
- - - - - - - - - - - - [SNIP] - - - - - - - - - - - - 
mem.numa.util.writeback [per-node count of memory locked for writeback to stable storage]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 0
    inst [1 or "node1"] value 0

mem.numa.util.filePages [per-node count of memory backed by files]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 4954240
    inst [1 or "node1"] value 3953472

mem.numa.util.mapped [per-node mapped memory]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 291008
    inst [1 or "node1"] value 240256

mem.numa.util.anonpages [per-node anonymous memory]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 81925888
    inst [1 or "node1"] value 81858432

mem.numa.util.shmem [per-node amount of shared memory]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 1862656
    inst [1 or "node1"] value 566848

mem.numa.util.kernelStack [per-node memory used as kernel stacks]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 31936
    inst [1 or "node1"] value 15040

mem.numa.util.pageTables [per-node memory used for pagetables]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 21824
    inst [1 or "node1"] value 21312

mem.numa.util.NFS_Unstable [per-node memory holding NFS data that needs writeback]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 0
    inst [1 or "node1"] value 0

mem.numa.util.bounce [per-node memory used for bounce buffers]
    Data Type: 64-bit unsigned int  InDom: 60.19 0xf000013
    Semantics: instant  Units: Kbyte
    inst [0 or "node0"] value 0
    inst [1 or "node1"] value 0
- - - - - - - - - - - - [SNIP] - - - - - - - - - - - - 

Even though pcp at some point caught the other two nodes: 


 $ pmrep mem.numa.util.slab -a pcp/pmlogger/gl_essgl6s_801a/20190709.0.xz -t 1m --timestamps -z | less
          m.n.u.slab  m.n.u.slab  m.n.u.slab  m.n.u.slab
               node0       node1      node16      node17
               Kbyte       Kbyte       Kbyte       Kbyte
19:25:12         N/A         N/A         N/A         N/A
19:26:12    38657344    32863872         N/A         N/A
19:27:12    38657344    32863872         N/A         N/A
19:28:12    38657344    32863872         N/A         N/A
19:29:12    38657344    32864384         N/A         N/A
19:30:12         N/A         N/A         N/A         N/A
19:31:12    38657408    32864448         N/A         N/A
19:32:12    38656960    32863936         N/A         N/A
19:33:12    38656960    32863872         N/A         N/A
- - - - - - - - - - - - [SNIP] - - - - - - - - - - - - 



Expected results:

Instances created for all numa nodes


Additional info:

- This would likely not be reproducible on x86 systems (either physical or virtual) as kvm, vmware, and hyper-v do not expose numa nodes like this. 
- If needed I can provide data from the customer's sosreports and pcp tarballs.

Comment 2 Nathan Scott 2019-07-16 22:45:31 UTC
Hi Charles,

Auditing the code, it does seem to have been written to handle sparse node numbering, so its not immediately obvious why this is not working.
Can you extract the following from the system exhibiting the problem...?

$ find /sys/devices/system/node/ -name node\*
$ find /sys/devices/system/node/ -name cpu\*
$ head -n 50 /sys/devices/system/node/node*/meminfo
$ cat /proc/stat

thanks.

Comment 3 Charles Haithcock 2019-07-17 15:49:04 UTC
(In reply to Nathan Scott from comment #2)
[...]
> $ find /sys/devices/system/node/ -name node\*
> $ find /sys/devices/system/node/ -name cpu\*
> $ head -n 50 /sys/devices/system/node/node*/meminfo
> $ cat /proc/stat

Unfortunately, the /sys/devices/system/node/ directory is not captured in a sosreport (not sure why as that would actually be useful for per-node stuff). The /proc/stat is captured, however:


 $ cat proc/stat 
cpu  7086634190 55834 3058237925 107365182369 1551745694 1 387879053 0 0 0
cpu0 185792643 682 62088344 2712165770 36813418 0 1044699 0 0 0
cpu1 213506559 1463 79583354 2640949215 36458767 0 10977194 0 0 0
cpu8 212633969 881 102551658 2589989196 30235948 0 27511824 0 0 0
cpu9 242975080 1333 99920960 2587387645 31434598 0 16905230 0 0 0
cpu16 202143277 784 96596561 2613171917 27805954 0 27591939 0 0 0
cpu17 225566181 1178 111329027 2570141105 26764048 0 28172258 0 0 0
cpu24 205830261 570 78198628 2659059783 26436443 0 10753161 0 0 0
cpu25 123639355 638 75842544 2718327589 38698815 0 21550753 0 0 0
cpu32 189939283 461 89832495 2615029248 26109777 1 27734877 0 0 0
cpu33 229863748 1181 100478175 2589491196 27130547 0 22088009 0 0 0
cpu40 152746898 1057 60480959 2731454117 44397543 0 3357252 0 0 0
cpu41 155849782 2450 62695129 2720737881 44594942 0 5013858 0 0 0
cpu48 133288816 954 57963583 2767485041 34600520 0 2615845 0 0 0
cpu49 174002780 2853 108575421 2544367696 27544914 0 41500014 0 0 0
cpu56 162667690 909 113392057 2585414899 29905549 0 42709265 0 0 0
cpu57 145693407 2203 64421328 2742869790 35218903 0 5468207 0 0 0
cpu64 138409437 1127 58742156 2743893276 38368759 0 6980604 0 0 0
cpu65 147514254 2147 68335549 2715861796 37767667 0 11094552 0 0 0
cpu72 172167870 980 117664038 2531262010 22067610 0 46762504 0 0 0
cpu73 137811681 1692 67546258 2746177905 31143136 0 8451970 0 0 0
cpu80 165521073 942 68834028 2717819651 50848439 0 1290231 0 0 0
cpu81 181842511 1536 74504905 2696038953 51570418 0 1261129 0 0 0
cpu88 153974157 990 62661230 2741156243 46490838 0 1046822 0 0 0
cpu89 173653698 1586 69611344 2714504764 47067105 0 1033943 0 0 0
cpu96 151535192 892 60868254 2745630083 46672718 0 954471 0 0 0
cpu97 172818015 1740 68203003 2717163334 46906067 0 937288 0 0 0
cpu104 150357219 1199 59832718 2749418902 45311977 0 889590 0 0 0
cpu105 171296765 1910 67375443 2721057507 45553146 0 883636 0 0 0
cpu112 145707268 829 57033981 2752072745 47854986 0 1176195 0 0 0
cpu113 170063906 1428 65986006 2720156546 48309032 0 1055202 0 0 0
cpu120 165551994 1214 67241674 2722721017 44479656 0 1048624 0 0 0
cpu121 225806169 3024 85667556 2645403734 43823225 0 1014982 0 0 0
cpu128 159983141 949 63370151 2734260633 42534120 0 939193 0 0 0
cpu129 218492170 2679 81798870 2659908611 41600316 0 902195 0 0 0
cpu136 158121091 1119 62534797 2738086549 41957739 0 892799 0 0 0
cpu137 218494671 2549 81500886 2661012573 41253676 0 855378 0 0 0
cpu144 156673889 840 61512151 2741628422 41754017 0 865457 0 0 0
cpu145 219729183 2132 81017106 2661029564 40737033 0 834613 0 0 0
cpu152 155573155 792 61255168 2742632604 41909712 0 852008 0 0 0
cpu153 219390721 1941 81111382 2657253655 41604124 0 861234 0 0 0
intr 107611082529 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1349037038 13512625 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 3493544114 1378 656 679 3315142424 415 512 213 3158145212 273 126 276 3609892987 268 426 411 3229959948 670 374 347 3274827251 427 299 447 2543381245 423 308 242 2699207779 210 185 165 2976886114 430 364 166 2354146065 244 429 176 3446301982 279 540 370 3321007989 567 266 141 3299933732 450 283 548 3282689619 365 262 257 3246536443 440 439 369 3371792392 897 822 913 1707870717 1601512889 708 608 571 534 567 363 1690939510 1587557566 390 506 329 485 301 210 1691325023 1582396892 589 613 427 429 568 461 1684939512 1573059813 176 593 414 258 414 585 963927466 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 173591 40682329 62711524 1148471 1206352 3518926977 0 0 0 0 851 626 462 3341459095 733 558 722 3182922800 388 359 429 3638281658 612 573 656 3255688599 567 373 585 3300992140 210 357 299 2563554145 342 430 206 2720888234 639 357 229 3001242267 823 442 0 218 2373482438 428 287 297 3472962415 207 141 53 3347016060 310 157 292 3325976895 375 299 450 3308370593 172 87 186 3271999510 508 406 422 3394498858 1416 1381 1056 1720154519 1613162254 613 579 355 729 545 781 1703313248 1599090179 96 89 130 134 219 87 1704167748 1594483102 771 301 482 277 0 60 0 0 0 360 619 1698700613 1589520387 93 68 96 163 250 149 0 1292674668 3494219325 616 325 335 3316263937 410 239 281 3158974743 685 578 446 3611224071 439 325 215 3231157233 410 576 157 3277552681 2248 2646 2071 2544420904 1681 1228 1328 2701745418 1240 639 879 2981681740 905 708 808 2356940175 425 332 314 3445936137 314 174 347 3321304318 233 428 77 3300384600 172 270 316 3283062788 463 349 376 3246618180 374 462 307 3364996849 364 311 221 1707060773 1601984609 240 113 74 195 96 240 1689777399 1587401302 150 121 287 235 192 140 1690387480 1582231708 284 260 272 135 53 76 1682776316 1570906639 249 370 143 265 68 24 2171658754 2038004142 750583897 124 0 0 0 524732685 106 104 1696174059 1583504713 78 187 171 187 1703705722 1594800634 189 143 70 117 1702931409 1599988725 55 62 15 86 1720402348 1614651227 138 179 83 227 3271508153 288 363 222 449 208 3325742202 242 403 228 196 349 3470529088 135 222 180 346 224 2998508838 298 231 174 469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2567391194 394 727 0 0 0 0 1026 3298526885 839 3259064800 1153 769 3641530203 1308 1043 3184989286 1289 1933 3343068188 1146 670940744 3523205901 2660 670940745 670940743 670940748 0 0 670940740 670940749 670940746 670940741 670940744 32 670940759 670940744 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 670940747 670940744 670940742 670940747 1797 661 768 905 663 563 1383 345 185 2722161950 319 2375734759 257 3346678666 189 3308516645 425 3391258176 60 55 42 91 19 415 21 92 210 3521920670 963887967 1513580925 1862849250 3464313520 1045575562 3215579350 3871146795 2466583765 0 0 0 0 2894791955 1993794049 1873679198 4206668420 215605518 0 0 0 0 0 0 0 0 963927476 2210619051 1329917989 1502622346 2472866358 2175468312 844084459 1137308389 1297862212 0 0 0 0 0 0 4129649598 2306159353 694190075 3112142173 3268914545 3788049289 0 1 1 1 1 1
ctxt 1420257704773
btime 1532589383
processes 370870310
procs_running 6
procs_blocked 0
softirq 267434729566 122 3428701972 178586867 937176605 927355500 1469097910 2453896564 1169279190 0 3467564372




Let me see if I can't get the numa node info you're requesting from the customer.

Comment 4 Nathan Scott 2019-07-17 21:41:55 UTC
| Let me see if I can't get the numa node info you're requesting from the customer.

Many thanks Charles - I think we'll struggle to resolve this one without that info.  Auditing the code has not come up with any smoking gun so far.

cheers.

Comment 7 Charles Haithcock 2019-09-05 17:24:24 UTC
Hey Nathan, 

Sorry to bring this bz back from the dead, but the customer brought their case back from the dead with the requested info: 

# find /sys/devices/system/node/ -name node\*
/sys/devices/system/node/
/sys/devices/system/node/node0
/sys/devices/system/node/node1
/sys/devices/system/node/node16
/sys/devices/system/node/node17

# find /sys/devices/system/node/ -name cpu\*
/sys/devices/system/node/node0/cpu0
/sys/devices/system/node/node0/cpu1
/sys/devices/system/node/node0/cpu2
/sys/devices/system/node/node0/cpu3
/sys/devices/system/node/node0/cpu4
/sys/devices/system/node/node0/cpu5
/sys/devices/system/node/node0/cpu6
/sys/devices/system/node/node0/cpu7
/sys/devices/system/node/node0/cpu8
/sys/devices/system/node/node0/cpu9
/sys/devices/system/node/node0/cpulist
/sys/devices/system/node/node0/cpu10
/sys/devices/system/node/node0/cpu11
/sys/devices/system/node/node0/cpu12
/sys/devices/system/node/node0/cpu13
/sys/devices/system/node/node0/cpu14
/sys/devices/system/node/node0/cpu15
/sys/devices/system/node/node0/cpu16
/sys/devices/system/node/node0/cpu17
/sys/devices/system/node/node0/cpu18
/sys/devices/system/node/node0/cpu19
/sys/devices/system/node/node0/cpu20
/sys/devices/system/node/node0/cpu21
/sys/devices/system/node/node0/cpu22
/sys/devices/system/node/node0/cpu23
/sys/devices/system/node/node0/cpu24
/sys/devices/system/node/node0/cpu25
/sys/devices/system/node/node0/cpu26
/sys/devices/system/node/node0/cpu27
/sys/devices/system/node/node0/cpu28
/sys/devices/system/node/node0/cpu29
/sys/devices/system/node/node0/cpu30
/sys/devices/system/node/node0/cpu31
/sys/devices/system/node/node0/cpu32
/sys/devices/system/node/node0/cpu33
/sys/devices/system/node/node0/cpu34
/sys/devices/system/node/node0/cpu35
/sys/devices/system/node/node0/cpu36
/sys/devices/system/node/node0/cpu37
/sys/devices/system/node/node0/cpu38
/sys/devices/system/node/node0/cpu39
/sys/devices/system/node/node0/cpumap
/sys/devices/system/node/node1/cpulist
/sys/devices/system/node/node1/cpu40
/sys/devices/system/node/node1/cpu41
/sys/devices/system/node/node1/cpu42
/sys/devices/system/node/node1/cpu43
/sys/devices/system/node/node1/cpu44
/sys/devices/system/node/node1/cpu45
/sys/devices/system/node/node1/cpu46
/sys/devices/system/node/node1/cpu47
/sys/devices/system/node/node1/cpu48
/sys/devices/system/node/node1/cpu49
/sys/devices/system/node/node1/cpu50
/sys/devices/system/node/node1/cpu51
/sys/devices/system/node/node1/cpu52
/sys/devices/system/node/node1/cpu53
/sys/devices/system/node/node1/cpu54
/sys/devices/system/node/node1/cpu55
/sys/devices/system/node/node1/cpu56
/sys/devices/system/node/node1/cpu57
/sys/devices/system/node/node1/cpu58
/sys/devices/system/node/node1/cpu59
/sys/devices/system/node/node1/cpu60
/sys/devices/system/node/node1/cpu61
/sys/devices/system/node/node1/cpu62
/sys/devices/system/node/node1/cpu63
/sys/devices/system/node/node1/cpu64
/sys/devices/system/node/node1/cpu65
/sys/devices/system/node/node1/cpu66
/sys/devices/system/node/node1/cpu67
/sys/devices/system/node/node1/cpu68
/sys/devices/system/node/node1/cpu69
/sys/devices/system/node/node1/cpu70
/sys/devices/system/node/node1/cpu71
/sys/devices/system/node/node1/cpu72
/sys/devices/system/node/node1/cpu73
/sys/devices/system/node/node1/cpu74
/sys/devices/system/node/node1/cpu75
/sys/devices/system/node/node1/cpu76
/sys/devices/system/node/node1/cpu77
/sys/devices/system/node/node1/cpu78
/sys/devices/system/node/node1/cpu79
/sys/devices/system/node/node1/cpumap
/sys/devices/system/node/node16/cpulist
/sys/devices/system/node/node16/cpu80
/sys/devices/system/node/node16/cpu81
/sys/devices/system/node/node16/cpu82
/sys/devices/system/node/node16/cpu83
/sys/devices/system/node/node16/cpu84
/sys/devices/system/node/node16/cpu85
/sys/devices/system/node/node16/cpu86
/sys/devices/system/node/node16/cpu87
/sys/devices/system/node/node16/cpu88
/sys/devices/system/node/node16/cpu89
/sys/devices/system/node/node16/cpu90
/sys/devices/system/node/node16/cpu91
/sys/devices/system/node/node16/cpu92
/sys/devices/system/node/node16/cpu93
/sys/devices/system/node/node16/cpu94
/sys/devices/system/node/node16/cpu95
/sys/devices/system/node/node16/cpu96
/sys/devices/system/node/node16/cpu97
/sys/devices/system/node/node16/cpu98
/sys/devices/system/node/node16/cpu99
/sys/devices/system/node/node16/cpu100
/sys/devices/system/node/node16/cpu101
/sys/devices/system/node/node16/cpu102
/sys/devices/system/node/node16/cpu103
/sys/devices/system/node/node16/cpu104
/sys/devices/system/node/node16/cpu105
/sys/devices/system/node/node16/cpu106
/sys/devices/system/node/node16/cpu107
/sys/devices/system/node/node16/cpu108
/sys/devices/system/node/node16/cpu109
/sys/devices/system/node/node16/cpu110
/sys/devices/system/node/node16/cpu111
/sys/devices/system/node/node16/cpu112
/sys/devices/system/node/node16/cpu113
/sys/devices/system/node/node16/cpu114
/sys/devices/system/node/node16/cpu115
/sys/devices/system/node/node16/cpu116
/sys/devices/system/node/node16/cpu117
/sys/devices/system/node/node16/cpu118
/sys/devices/system/node/node16/cpu119
/sys/devices/system/node/node16/cpumap
/sys/devices/system/node/node17/cpulist
/sys/devices/system/node/node17/cpu120
/sys/devices/system/node/node17/cpu121
/sys/devices/system/node/node17/cpu122
/sys/devices/system/node/node17/cpu123
/sys/devices/system/node/node17/cpu124
/sys/devices/system/node/node17/cpu125
/sys/devices/system/node/node17/cpu126
/sys/devices/system/node/node17/cpu127
/sys/devices/system/node/node17/cpu128
/sys/devices/system/node/node17/cpu129
/sys/devices/system/node/node17/cpu130
/sys/devices/system/node/node17/cpu131
/sys/devices/system/node/node17/cpu132
/sys/devices/system/node/node17/cpu133
/sys/devices/system/node/node17/cpu134
/sys/devices/system/node/node17/cpu135
/sys/devices/system/node/node17/cpu136
/sys/devices/system/node/node17/cpu137
/sys/devices/system/node/node17/cpu138
/sys/devices/system/node/node17/cpu139
/sys/devices/system/node/node17/cpu140
/sys/devices/system/node/node17/cpu141
/sys/devices/system/node/node17/cpu142
/sys/devices/system/node/node17/cpu143
/sys/devices/system/node/node17/cpu144
/sys/devices/system/node/node17/cpu145
/sys/devices/system/node/node17/cpu146
/sys/devices/system/node/node17/cpu147
/sys/devices/system/node/node17/cpu148
/sys/devices/system/node/node17/cpu149
/sys/devices/system/node/node17/cpu150
/sys/devices/system/node/node17/cpu151
/sys/devices/system/node/node17/cpu152
/sys/devices/system/node/node17/cpu153
/sys/devices/system/node/node17/cpu154
/sys/devices/system/node/node17/cpu155
/sys/devices/system/node/node17/cpu156
/sys/devices/system/node/node17/cpu157
/sys/devices/system/node/node17/cpu158
/sys/devices/system/node/node17/cpu159
/sys/devices/system/node/node17/cpumap

# head -n 50 /sys/devices/system/node/node*/meminfo
==> /sys/devices/system/node/node0/meminfo <==
Node 0 MemTotal:       134217728 kB
Node 0 MemFree:         3465088 kB
Node 0 MemUsed:        130752640 kB
Node 0 Active:          5079424 kB
Node 0 Inactive:        1734400 kB
Node 0 Active(anon):    3436480 kB
Node 0 Inactive(anon):   688576 kB
Node 0 Active(file):    1642944 kB
Node 0 Inactive(file):  1045824 kB
Node 0 Unevictable:    79736320 kB
Node 0 Mlocked:        79736320 kB
Node 0 Dirty:                64 kB
Node 0 Writeback:             0 kB
Node 0 FilePages:       4609600 kB
Node 0 Mapped:           285312 kB
Node 0 AnonPages:      81940736 kB
Node 0 Shmem:           1920832 kB
Node 0 KernelStack:       30080 kB
Node 0 PageTables:        21568 kB
Node 0 NFS_Unstable:          0 kB
Node 0 Bounce:                0 kB
Node 0 WritebackTmp:          0 kB
Node 0 Slab:           38034432 kB
Node 0 SReclaimable:   37348672 kB
Node 0 SUnreclaim:       685760 kB
Node 0 AnonHugePages:         0 kB
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0

==> /sys/devices/system/node/node16/meminfo <==
Node 16 MemTotal:       134217728 kB
Node 16 MemFree:         1968000 kB
Node 16 MemUsed:        132249728 kB
Node 16 Active:          3462848 kB
Node 16 Inactive:         637568 kB
Node 16 Active(anon):    2723968 kB
Node 16 Inactive(anon):   376384 kB
Node 16 Active(file):     738880 kB
Node 16 Inactive(file):   261184 kB
Node 16 Unevictable:    79736320 kB
Node 16 Mlocked:        79736320 kB
Node 16 Dirty:              5696 kB
Node 16 Writeback:             0 kB
Node 16 FilePages:       1777600 kB
Node 16 Mapped:           258432 kB
Node 16 AnonPages:      82061376 kB
Node 16 Shmem:            777536 kB
Node 16 KernelStack:       30128 kB
Node 16 PageTables:        23296 kB
Node 16 NFS_Unstable:          0 kB
Node 16 Bounce:                0 kB
Node 16 WritebackTmp:          0 kB
Node 16 Slab:           46718272 kB
Node 16 SReclaimable:   46209408 kB
Node 16 SUnreclaim:       508864 kB
Node 16 AnonHugePages:         0 kB
Node 16 HugePages_Total:     0
Node 16 HugePages_Free:      0
Node 16 HugePages_Surp:      0

==> /sys/devices/system/node/node17/meminfo <==
Node 17 MemTotal:       134217728 kB
Node 17 MemFree:        27353024 kB
Node 17 MemUsed:        106864704 kB
Node 17 Active:          3561152 kB
Node 17 Inactive:         465856 kB
Node 17 Active(anon):    2929472 kB
Node 17 Inactive(anon):   211456 kB
Node 17 Active(file):     631680 kB
Node 17 Inactive(file):   254400 kB
Node 17 Unevictable:    79736320 kB
Node 17 Mlocked:        79736320 kB
Node 17 Dirty:                64 kB
Node 17 Writeback:             0 kB
Node 17 FilePages:       1944000 kB
Node 17 Mapped:           244800 kB
Node 17 AnonPages:      81819520 kB
Node 17 Shmem:           1057920 kB
Node 17 KernelStack:       19024 kB
Node 17 PageTables:        20928 kB
Node 17 NFS_Unstable:          0 kB
Node 17 Bounce:                0 kB
Node 17 WritebackTmp:          0 kB
Node 17 Slab:           21237248 kB
Node 17 SReclaimable:   20763456 kB
Node 17 SUnreclaim:       473792 kB
Node 17 AnonHugePages:         0 kB
Node 17 HugePages_Total:     0
Node 17 HugePages_Free:      0
Node 17 HugePages_Surp:      0

==> /sys/devices/system/node/node1/meminfo <==
Node 1 MemTotal:       134217728 kB
Node 1 MemFree:        16996800 kB
Node 1 MemUsed:        117220928 kB
Node 1 Active:          3046784 kB
Node 1 Inactive:         261184 kB
Node 1 Active(anon):    2621120 kB
Node 1 Inactive(anon):   118656 kB
Node 1 Active(file):     425664 kB
Node 1 Inactive(file):   142528 kB
Node 1 Unevictable:    79736320 kB
Node 1 Mlocked:        79736320 kB
Node 1 Dirty:                64 kB
Node 1 Writeback:             0 kB
Node 1 FilePages:       1141184 kB
Node 1 Mapped:           239232 kB
Node 1 AnonPages:      81904064 kB
Node 1 Shmem:            572992 kB
Node 1 KernelStack:       14640 kB
Node 1 PageTables:        21440 kB
Node 1 NFS_Unstable:          0 kB
Node 1 Bounce:                0 kB
Node 1 WritebackTmp:          0 kB
Node 1 Slab:           32264000 kB
Node 1 SReclaimable:   31815936 kB
Node 1 SUnreclaim:       448064 kB
Node 1 AnonHugePages:         0 kB
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0

# cat /proc/stat
cpu  8242360517 63565 3609240327 122297563406 1571771155 1 466203935 0 0 0
cpu0 210529374 798 71934081 3097761442 37317204 0 1223673 0 0 0
cpu1 237751577 1778 91449130 3019536044 37184211 0 13380931 0 0 0
cpu8 239638291 1167 116001156 2962039241 30843709 0 30211243 0 0 0
cpu9 269755516 1880 113588929 2961750417 32115791 0 19276323 0 0 0
cpu16 228567958 1161 108502269 2992498894 28417901 0 28767726 0 0 0
cpu17 251498095 1495 126969087 2938759279 27389544 0 32721162 0 0 0
cpu24 231150434 852 93015216 3029793639 27119721 0 14603144 0 0 0
cpu25 137531806 921 84615263 3114124739 39851824 0 22232625 0 0 0
cpu32 215124024 673 112114446 2952637984 26308029 1 40111933 0 0 0
cpu33 256551109 1544 111743150 2970181044 27454097 0 22925503 0 0 0
cpu40 177519757 1084 71290902 3111356893 44743118 0 4538259 0 0 0
cpu41 179943473 2499 73650183 3105521300 44969221 0 5622327 0 0 0
cpu48 156108758 1073 67851882 3155392034 34826267 0 2933386 0 0 0
cpu49 199211115 3171 127048927 2898501327 27752960 0 49443002 0 0 0
cpu56 191011211 1078 139642309 2925572937 30258304 0 55128571 0 0 0
cpu57 167942154 2672 74608188 3131763260 35619063 0 5621263 0 0 0
cpu64 160974234 1254 67926108 3132800780 38887343 0 7159878 0 0 0
cpu65 169765793 2285 77772957 3104738892 38344926 0 11262281 0 0 0
cpu72 213819750 1044 152220469 2811436816 22179993 0 67412060 0 0 0
cpu73 155975407 1883 76656689 3140443549 31554885 0 8580692 0 0 0
cpu80 197552796 1017 83003415 3092812771 51384382 0 1539715 0 0 0
cpu81 233223674 1793 94216308 3046515260 52156677 0 1487468 0 0 0
cpu88 184478236 1161 75460945 3119486065 46962061 0 1240959 0 0 0
cpu89 221675711 1830 87628006 3070364778 47567167 0 1216703 0 0 0
cpu96 181927575 1144 73271712 3124649571 47075049 0 1128086 0 0 0
cpu97 221897022 2129 86269696 3072078328 47341000 0 1101766 0 0 0
cpu104 180371966 1302 71958169 3129172289 45687568 0 1052352 0 0 0
cpu105 220363284 2082 85266637 3076223870 45962921 0 1037863 0 0 0
cpu112 174262084 997 68504557 3133664162 48250518 0 1379928 0 0 0
cpu113 219280078 1553 83761582 3075238583 48728204 0 1225697 0 0 0
cpu120 191946710 1252 79339141 3104787232 45251431 0 1237882 0 0 0
cpu121 253583143 3206 98296310 3025628688 44569324 0 1200333 0 0 0
cpu128 185044137 952 74502102 3118945397 43030421 0 1100612 0 0 0
cpu129 245251896 2755 93550425 3042404865 42175494 0 1059500 0 0 0
cpu136 182876887 1242 73482891 3123417113 42424789 0 1045100 0 0 0
cpu137 244936411 2860 93029795 3044197993 41772517 0 1003467 0 0 0
cpu144 181203450 937 72196738 3127510274 42297064 0 1011936 0 0 0
cpu145 246280361 2276 92379381 3044303598 41262658 0 976966 0 0 0
cpu152 179970566 795 71933409 3128582270 42479658 0 996457 0 0 0
cpu153 245859463 1970 92508719 3039980587 42244650 0 1005115 0 0 0
intr 122721035303 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3128896178 15497414 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 4064409322 1378 656 679 3884828945 415 512 213 3727377568 273 126 276 4245402794 268 426 411 3787527497 670 374 347 3814009309 427 299 447 3008395425 423 308 242 3134785643 210 185 165 3487894371 430 364 166 2679178657 244 429 176 4014512551 279 540 370 3869687906 567 266 141 3849291248 450 283 548 3830969497 365 262 257 3782323862 440 439 369 3948792940 897 822 913 1998444149 1887584750 708 608 571 534 567 363 1979115214 1872338555 390 506 329 485 301 210 1980195003 1867024727 589 613 427 429 568 461 1972705815 1856227061 176 593 414 258 414 585 1099607746 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 456168 46563535 73131454 2972288 1465270 3520195454 0 0 0 0 851 626 462 3914722953 733 558 722 3755857918 388 359 429 4277887902 612 573 656 3816828650 567 373 585 3843613548 210 357 299 3031439073 342 430 206 3159322338 639 357 229 3515604336 823 442 0 218 2700676733 428 287 297 4044798525 207 141 53 3899145730 310 157 292 3878797779 375 299 450 3860069104 172 87 186 3811264907 508 406 422 3974788101 1416 1381 1056 2012495732 1900839748 613 579 355 729 545 781 1993182274 1885567777 96 89 130 134 219 87 1994780799 1880888268 771 301 482 277 0 66 0 0 0 360 619 1988359809 1874857844 93 68 96 163 250 149 0 1064638132 4065362636 616 325 335 3886100618 410 239 281 3728519525 685 578 446 4247118824 439 325 215 3788718525 410 576 157 3816705493 2248 2646 2071 3009306665 1681 1228 1328 3137655351 1240 639 879 3493537243 905 708 808 2682147986 425 332 314 4014257596 314 174 347 3870004085 233 428 77 3849836502 172 270 316 3831413403 463 349 376 3782517237 374 462 307 3941267567 364 311 221 1997533378 1887946177 240 113 74 195 96 240 1977784909 1872167815 150 121 287 235 192 140 1979098358 1866868573 284 260 272 135 53 76 1970244383 1853816186 249 370 143 265 68 24 2173007432 2666694876 2777560602 124 0 0 0 525699536 106 104 1985531297 1868266744 78 187 171 187 1994301484 1881290442 189 143 70 117 1992805789 1886568679 55 62 15 86 2012789066 1902499895 138 179 83 227 3810789577 288 363 222 449 208 3878610267 242 403 228 196 349 4042437467 135 222 180 346 224 3511947814 298 231 174 469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3035666189 394 727 0 0 0 0 1026 3841295218 839 3820090696 1153 769 4281643645 1308 1043 3758256793 1289 1933 3916493323 1146 1904414565 4098071720 2660 1904414565 1904414564 1904414567 0 0 1904414559 1904414568 1904414566 1904414563 1904414564 32 1904414585 1904414564 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1904414567 1904414563 1904414562 1904414565 1797 661 768 905 663 563 1383 345 185 3160364646 319 2703000621 257 3898906895 189 3860330792 425 3971267893 60 55 42 91 19 415 21 92 210 4096391705 1099492992 2796586590 3606236351 3465276579 1046525282 4087651648 3872132915 2467534876 0 0 0 0 2896046409 349990600 1874885669 1314459405 216835190 0 0 0 0 0 0 0 0 1099609995 1013707604 1422155890 1528581011 2495664837 2176146310 963948951 1139885918 1298571117 0 0 0 0 0 0 792337502 2307436680 695452964 2979709077 3270874698 3982257419 0 1 1 1 1 1
ctxt 1730101731089
btime 1532589383
processes 419092027
procs_running 5
procs_blocked 0
softirq 224093755226 122 2084931283 198155958 2873897314 2070063624 2790149968 3060296352 3820041910 0 1037788487

Comment 8 Mark Goodwin 2019-09-19 04:02:25 UTC
Charles, I have a fix for this, see upstream PR 748:

  https://github.com/performancecopilot/pcp/pull/748
  "pmdalinux - fix handling of discontiguous NUMA node numbering, plus QA #748".


I don't have a ppc64le NUMA box to test it properly on though - so I created a fake sysfs root on a 4-node x86_64 NUMA VM and wrote a new PCP QA test to delete node2 and it's CPUs and memory, leaving nodes 0, 1 and 3. Would you be able to build this on ppe64le? Or would you need me to build you a test binary?

Pending testing on actual ppc64le NUMA hardware, this has not yet been merged to the master branch.

Regards
-- Mark

Comment 9 Mark Goodwin 2019-09-23 06:01:06 UTC
Now merged into upstream master and will be in PCP-5.0.0

commit fd82613f39f11ce440052735f8b5310a35cbeead (upstream-goodwinos/numa-discontig, numa-discontig)
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Wed Sep 18 12:10:43 2019 +1000

    pmdalinux: correctly handle sparse / discontiguous numa nodes
    
    RHBZ#1730492 - Some numa nodes have no instances
    
    The Linux PMDA was incorrectly assuming numa node numbering is sequential
    and 1:1 with internal numa node instance IDs. This assumption is incorrect
    on systems with sparse or discontiguous node numbering (such as some ppc64le
    platforms), leading to missing instances and/or incorrect per-node metric
    values.
    
    Tested and verified by qa/1393.

commit 619c1eb3c5b4af51b03d20dcb89c8dca6bece0f4
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Wed Sep 18 11:46:36 2019 +1000

    qa: new test 1393 tests pmdalinux on numa systems with discontiguous nodes
    
    RHBZ#1730492 - Some numa nodes have no instances
    
    This new test verifies linux PMDA numa metrics on systems with
    non-contiguous or sparse numa nodes. Note the BZ is reported on
    a ppc64le system with sparse numa node numbering.
    
    The fake sysfs root (qa/linux/sysfs-numa-001.tgz) was captured
    on a 4 numa node VM and is used by qa/1393 to check various hinv,
    per-cpu and per-node metrics are correct. The qa test then
    deletes node2 from the sysfs fake root, leaving nodes 0, 1 and 3
    (i.e. simulating discontiguous/sparse node numbering) and then
    verifies the expected metric values. The 1393.out qualified output
    shows the correct values - this requires the fix to the linux PMDA
    for RHBZ#1730492 (in the next commit).
    
    modified:   qa/1393
    modified:   qa/1393.out
    modified:   qa/group
    new file:   qa/linux/sysfs-numa-001.tgz

Comment 10 Mark Goodwin 2019-10-01 05:31:42 UTC
In addition to the changes listed in Comment#9, turns out some additional changes were needed due to qa/1393 failures on some platforms
and/or filesystems :

commit 3cf6b72497cc0d413616dbdd84662acc3d1475a2
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Tue Oct 1 15:26:19 2019 +1000

    qa: update and remake qa/1393 to avoid directory traversal differences
    
    RHBZ #1730492
    
    pminfo listings need to be sorted on instance names due to directory
    traversal differences on different platforms and filesystems.

commit c4069a4cc8f811b05aa31432d9d41f2aa77bea0d
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Tue Oct 1 15:20:13 2019 +1000

    pmdalinux: use cpu instname, not instid for per-cpu numa stats
    
    RHBZ#1730492
    
    CPU instid does not necessarily match up with cpuid (instname)
    due to directory traversal differences on different platforms
    and filesystems.


Note You need to log in before you can comment on or make changes to this bug.