Bug 1293444

Summary:	RFE: need hba and fc target aggregation
Product:	[Fedora] Fedora	Reporter:	Dwight (Bud) Brown <bubrown>
Component:	pcp	Assignee:	Mark Goodwin <mgoodwin>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	fche, lberk, mbenitez, mgoodwin, nathans, pcp
Target Milestone:	---	Flags:	bubrown: needinfo-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-24 06:17:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dwight (Bud) Brown 2015-12-21 18:54:13 UTC

Description of problem:
I have a iostat plug in that uses sysfs data to aggregate io by HBA, fc target port and even by LUN (via wwid) and plots the data for use with performance cases.  Given PCP is a replacement for sysstat, I need that same functionality in PCP.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Device: rrqm/s   wrqm/s  r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdf      0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdau     0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdbi     0.00     0.00  2.00  1.00    32.00    16.00    32.00     0.00    0.33   0.33   0.10
sdbx     0.00     0.00  0.00  1.00     0.00     0.50     1.00     0.00    0.00   0.00   0.00
sddv     0.00     0.00  1.00  1.00     0.50     0.50     1.00     0.00    0.00   0.00   0.00
sdfq     0.00     0.00  1.00  2.00     8.00    16.00    16.00     0.00    1.00   0.67   0.20
sdgh     0.00     0.00  0.00  1.00     0.00     8.00    16.00     0.00    1.00   1.00   0.10
sdgu     0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdim     0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdin     0.00     0.00  3.00  1.00    48.00    16.00    32.00     0.00    0.25   0.25   0.10
sdis     0.00     0.00  1.00  1.00    16.00    16.00    32.00     0.00    1.00   1.00   0.20
host0    0.00     0.00  8.00 12.00   104.50   137.00    24.15     0.00    0.60   0.55   0.01

host0 is scsi0 HBA.

Comment 2 Mark Goodwin 2015-12-21 23:29:35 UTC

Hi Bud, really good RFE and it's been on my pcp-iostat wish list to implement for a long time. PCP has everything needed to do this aggregation using the per-disk metrics (disk.dev.*) as well as the scsi map info in hinv.map.scsi (which is essentially /proc/scsi/scsi for each sd path). e.g.

$ pminfo -h goody.usersys -f hinv.map.scsi disk.dev.read

hinv.map.scsi
    inst [0 or "scsi0:0:0:0 Direct-Access"] value "sda"
    inst [1 or "scsi1:0:0:0 Direct-Access"] value "sdb"
    inst [2 or "scsi2:0:0:0 Direct-Access"] value "sdc"
    inst [3 or "scsi3:0:0:0 Direct-Access"] value "sdd"

disk.dev.read
    inst [0 or "sda"] value 68211
    inst [1 or "sdb"] value 244414
    inst [2 or "sdc"] value 18829883
    inst [3 or "sdd"] value 15263026

So teaching pcp-iostat to aggregate at various levels is completely feasible.
We could also introduce filtering, for when you're only interested in certain devices (just avoids | grep or whatever, but handy)

We should also do similar for the per-dm metrics, e.g. pcp iostat -x dm  will report per-dm device (including lvm, multipath, etc) devices. Similar aggregation would good there too, e.g. for all LVM devices using a particular VG, or PV, etc. Bryn has been doing some upstream LVM work in this area too.

Comment 3 Dwight (Bud) Brown 2015-12-22 14:58:01 UTC

"PCP has everything needed to do this aggregation"

It also needs device wwid to do lun aggregation and FC target wwpn to do FC target aggregation.  I'm assuming those are already present too.  

Aggregation by wwid is different from the multipath device stats, often due to the queue backlog changing the stats in dmm device vs actual io stats of the underlying set of sd devices.

Currently
mapdevs -gw > io.map
iostat -tkx 1 | iopseudo -m=io.map -HTL > iostat.log

Host4             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host6             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host7             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host8             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host9             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host10            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
TgtB-63d32        0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
TgtA-63d30        0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0001         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0000         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0002         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0003         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0004         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0005         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0006         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0007         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
:
.

Comment 6 Mark Goodwin 2015-12-23 00:50:58 UTC

(In reply to Dwight (Bud) Brown from comment #3)
> "PCP has everything needed to do this aggregation"

well, it has enough for pmiostat to implement aggregation for combinations of regexs for host:bus:target:lun or sd or dm devices, but that's only going to account for multipathing by careful choice of devices

> 
> It also needs device wwid to do lun aggregation and FC target wwpn to do FC
> target aggregation.  I'm assuming those are already present too. 

no they are not present and yes agree - we need wwid and wwpn maps and i/o stat metrics. I will investigate some more and likely add new hinv.map.* and disk.{wwid,wwpn}.* metrics.

> 
> Aggregation by wwid is different from the multipath device stats, often due
> to the queue backlog changing the stats in dmm device vs actual io stats of
> the underlying set of sd devices.

wouldn't it be enough to aggregate by specific sd or h:b:t:l paths, if at least to compare to the corresponding disk.dm.* multipath device stats?

Comment 7 Dwight (Bud) Brown 2015-12-23 16:56:40 UTC

"wouldn't it be enough to aggregate by specific sd or h:b:t:l paths"

no. as one recent case showed, aggregating by scsi host showed the FC link bandwidth used at 50% on each hba link. The problem was on the back side of the switch where there was shared links. 0:0:0:* could be the same back side link as 1:0:1:* or might not be, only by looking the the 0:0:0 fc target port wwpn can traffic to the same port (and thus over the same fc link) be aggregated properly. So while each front of switch (dedicated) link between hba and switch was at 50% the back of switch (shared) link between switch and FC storage port was doubling up traffic at times and likely a bottleneck. Adding in the other hosts sharing the storage controller (that we couldn't see/measure), made this even more likely.

btw, wwid should be present in sysfs on rhel7. on rhel6 its available via multipath -ll, but only if dmm is used as multipath solution which isn't always the case so I have a script to pulls the wwid directly from each device via h:b:t:l.

But as seen in the output I included, while its often the case that lun 0 [0:0:0:0] is the same as lun 0 [1:0:0:0], that isn't always the case -- the data posted has, for example, lun 53 on one storage target port being the same as lun 55 on a different storage target port. Only by comparing wwids can this be determined. If dmm was the only multipath solution deployed by customers, then you could use multipath -ll reports to see each set of paths under the same dmm device... but since there are multiple dmm solutions deployed by customers we can't depend on that method within support.

Comment 8 Mark Goodwin 2015-12-23 22:00:02 UTC

(In reply to Dwight (Bud) Brown from comment #7)
> "wouldn't it be enough to aggregate by specific sd or h:b:t:l paths"
> 
> no. as one recent case showed, aggregating by scsi host showed the FC link
> bandwidth used at 50% on each hba link.  The problem was on the back side of
> the switch where there was shared links.  0:0:0:* could be the same back
> side link as 1:0:1:* or might not be, only by looking the the 0:0:0 fc
> target port wwpn can traffic to the same port (and thus over the same fc
> link) be aggregated properly.  So while each front of switch (dedicated)
> link between hba and switch was at 50% the back of switch (shared) link
> between switch and FC storage port was doubling up traffic at times and
> likely a bottleneck.  Adding in the other hosts sharing the storage
> controller (that we couldn't see/measure), made this even more likely.

ok yep, typical back-end bottleneck. We used to have an "fcswitch" PCP plugin that provided per-port FC switch metrics: tx, rx and various err stats. It used to work with Brocade and a few other switch vendors. Never open sourced it though and it was a pain to set up because it used the management interface over telnet, which needed login/pass etc., and customers were often reluctant to give out that info or have it configured somewhere in PCP in cleartext.

The fcswitch agent used to make it really easy to identify back-end FC bottlenecks and protocol errors; it helped if the customer could tell you which port was plugged into where (though you can also figure this out with the wwpn and FC topology).

An easier and far more direct solution is to use snmp - all the switches export FC stats over snmp using a standard MIB, and PCP has an snmp gateway plugin (see src/pmdas/snmp in the PCP source). The standard FC switch MIB is defined at https://tools.ietf.org/html/rfc4044 and we can easily enough plug that into PCP's snmp gateway - there's a boat load of useful stats in that MIB. The resulting stats can be monitored with generic PCP clients such as pmchart, pmrep, pmdumptext, etc etc. or we could tailor a specific tool.

> 
> btw, wwid should be present in sysfs on rhel7.  on rhel6 its available via
> multipath -ll, but only if dmm is used as multipath solution which isn't
> always the case so I have a script to pulls the wwid directly from each
> device via h:b:t:l.

For wwid, we can use similar code to multipath, see various functions in libmultipath/discovery.c, and then provide hinv.map.wwid or some such metric (much like hinv.map.scsi for h:b:t:l), that maps device/paths to their wwid. CLient tools can then figure out multipaths etc. using that info, much the same as the multipath daemon.

> 
> But as seen in the output I included, while its often the case that lun 0
> [0:0:0:0] is the same as lun 0 [1:0:0:0], that isn't always the case -- the
> data posted has, for example, lun 53 on one storage target port being the
> same as lun 55 on a different storage target port.  Only by comparing wwids
> can this be determined. ...

yep, unfortunate aliasing like that sometimes happens. The hinv.map.wwid and aggregated disk.wwid.* metrics will all definitely help with this, and pmiostat will be easy to extend, e.g. with "-x wwid" reporting.

Comment 9 Nathan Scott 2016-01-08 00:15:59 UTC

BZ state admin, after following up with Mark.

Comment 10 Mark Goodwin 2016-01-08 00:19:22 UTC

will work on this upstream to begin with

Comment 11 Mark Goodwin 2022-02-24 05:33:03 UTC

returning to this one after a very long hiatus.

/usr/lib/udev/scsi_id --page=0x83 --whitelisted --device=/dev/sdX
returns the wwid for scsi device sdX, matching what's reported by multipath -ll. Or we could get it from /sys/block/sdX/device/wwid (but that has an annoyingly different prefix encoding and so does the vpd_pg83 sysfs file).

So if the wwid string for each scsi device/path is exported by PCP as a new metric such as disk.dev.wwid with indom scsi name (e.g. "sda") and value the wwid string for that device, we can determine which scsi paths are multipathed to the same physical device.

Then we can add new aggregated (summed) metrics for all paths to the same wwid as disk.wwid.* (same pmns subtree as regular disk.dev.* metrics, but with indom being the wwid).  Tools such as pcp-iostat, which can already do aggregation based on h:b:t:l patterns but not yet for all paths with same wwid) can then be taught about the new metrics. Alternatively, pcp-iostat could be taught to do the aggregation client-side by first fetching disk.dev.wwid and so we wouldn't add new metrics in the linux PMDA .. however, those new metrics would be handy for other tools such as pmrep.

Comment 12 Mark Goodwin 2022-03-06 23:29:51 UTC

Have issued upstream PR https://github.com/performancecopilot/pcp/pull/1551 which adds new metrics disk.wwid.*. These metrics mirror the disk.dev.* metrics tree, but aggregate each metric for all SD paths to the same WWID.

The new per-WWID aggregated metrics can be monitored using a new pmrep config, e.g.

$ pmrep :iostat-multipath-wwid
                      WWID      SCSI Paths     r/s     rkB/s   rrqm/s r_await rareq-sz     w/s     wkB/s   wrqm/s w_await wareq-sz  aqu-sz  %util
10:21:57 2002538878103cb18         nvme0n1     N/A       N/A      N/A     N/A      N/A     N/A       N/A      N/A     N/A      N/A     N/A    N/A
10:21:57 333333330000007d1 sdb sdc sdd sde     N/A       N/A      N/A     N/A      N/A     N/A       N/A      N/A     N/A      N/A     N/A    N/A
10:21:58 2002538878103cb18         nvme0n1    0.00      0.00     0.00    0.00     0.00   34.85    257.87     0.00    0.91     7.40    0.03   3.39
10:21:58 333333330000007d1 sdb sdc sdd sde    1.00      0.00     0.00    2.00     0.00    0.00      0.00     0.00    0.00     0.00    0.00   0.20
10:21:59 2002538878103cb18         nvme0n1    0.00      0.00     0.00    0.00     0.00    0.00      0.00     0.00    0.00     0.00    0.00   0.00
10:21:59 333333330000007d1 sdb sdc sdd sde    2.00      0.00     0.00    1.00     0.00    0.00      0.00     0.00    0.00     0.00    0.00   0.40
10:22:00 2002538878103cb18         nvme0n1    0.00      0.00     0.00    0.00     0.00   53.02    205.06     0.00    1.19     3.87    0.06   4.60
10:22:00 333333330000007d1 sdb sdc sdd sde    0.00      0.00     0.00    0.00     0.00    0.00      0.00     0.00    0.00     0.00    0.00   0.00

I can also add a new "pcp-iostat -x wwid" option that would provide a similar report (if it's deemed useful).
As with all PCP tools, remote host live monitoring and archive replay come as standard.

So far this has only been tested using multi-target scsi_debug (see 4-way multipath example above). It needs testing and verification on real multipath capable h/w, e.g. some host with multiple FC or QL HBAs and an FC switch, or dual ported scsi or similar. Adding NEEEDINFO for Bud if he could help out with that .. we can supply test builds for any RHEL or Fedora version. Thanks in advance.

Comment 13 Nathan Scott 2022-03-24 06:17:41 UTC

Closing out now that Mark has retired - this code described in #c12 has been merged upstream.