Bug 1293444 - RFE: need hba and fc target aggregation
RFE: need hba and fc target aggregation
Status: NEW
Product: Fedora
Classification: Fedora
Component: pcp (Show other bugs)
rawhide
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Mark Goodwin
qe-baseos-tools
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-21 13:54 EST by Dwight (Bud) Brown
Modified: 2017-08-23 11:12 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Dwight (Bud) Brown 2015-12-21 13:54:13 EST
Description of problem:
I have a iostat plug in that uses sysfs data to aggregate io by HBA, fc target port and even by LUN (via wwid) and plots the data for use with performance cases.  Given PCP is a replacement for sysstat, I need that same functionality in PCP.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Device: rrqm/s   wrqm/s  r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdf      0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdau     0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdbi     0.00     0.00  2.00  1.00    32.00    16.00    32.00     0.00    0.33   0.33   0.10
sdbx     0.00     0.00  0.00  1.00     0.00     0.50     1.00     0.00    0.00   0.00   0.00
sddv     0.00     0.00  1.00  1.00     0.50     0.50     1.00     0.00    0.00   0.00   0.00
sdfq     0.00     0.00  1.00  2.00     8.00    16.00    16.00     0.00    1.00   0.67   0.20
sdgh     0.00     0.00  0.00  1.00     0.00     8.00    16.00     0.00    1.00   1.00   0.10
sdgu     0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdim     0.00     0.00  0.00  1.00     0.00    16.00    32.00     0.00    1.00   1.00   0.10
sdin     0.00     0.00  3.00  1.00    48.00    16.00    32.00     0.00    0.25   0.25   0.10
sdis     0.00     0.00  1.00  1.00    16.00    16.00    32.00     0.00    1.00   1.00   0.20
host0    0.00     0.00  8.00 12.00   104.50   137.00    24.15     0.00    0.60   0.55   0.01

host0 is scsi0 HBA.
Comment 2 Mark Goodwin 2015-12-21 18:29:35 EST
Hi Bud, really good RFE and it's been on my pcp-iostat wish list to implement for a long time. PCP has everything needed to do this aggregation using the per-disk metrics (disk.dev.*) as well as the scsi map info in hinv.map.scsi (which is essentially /proc/scsi/scsi for each sd path). e.g.

$ pminfo -h goody.usersys -f hinv.map.scsi disk.dev.read

hinv.map.scsi
    inst [0 or "scsi0:0:0:0 Direct-Access"] value "sda"
    inst [1 or "scsi1:0:0:0 Direct-Access"] value "sdb"
    inst [2 or "scsi2:0:0:0 Direct-Access"] value "sdc"
    inst [3 or "scsi3:0:0:0 Direct-Access"] value "sdd"

disk.dev.read
    inst [0 or "sda"] value 68211
    inst [1 or "sdb"] value 244414
    inst [2 or "sdc"] value 18829883
    inst [3 or "sdd"] value 15263026

So teaching pcp-iostat to aggregate at various levels is completely feasible.
We could also introduce filtering, for when you're only interested in certain devices (just avoids | grep or whatever, but handy)

We should also do similar for the per-dm metrics, e.g. pcp iostat -x dm  will report per-dm device (including lvm, multipath, etc) devices. Similar aggregation would good there too, e.g. for all LVM devices using a particular VG, or PV, etc. Bryn has been doing some upstream LVM work in this area too.
Comment 3 Dwight (Bud) Brown 2015-12-22 09:58:01 EST
"PCP has everything needed to do this aggregation"

It also needs device wwid to do lun aggregation and FC target wwpn to do FC target aggregation.  I'm assuming those are already present too.  

Aggregation by wwid is different from the multipath device stats, often due to the queue backlog changing the stats in dmm device vs actual io stats of the underlying set of sd devices.

Currently
mapdevs -gw > io.map
iostat -tkx 1 | iopseudo -m=io.map -HTL > iostat.log

Host4             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host6             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host7             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host8             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host9             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Host10            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
TgtB-63d32        0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
TgtA-63d30        0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0001         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0000         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0002         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunA-0003         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0004         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0005         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0006         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
LunB-0007         0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
:
.
Comment 6 Mark Goodwin 2015-12-22 19:50:58 EST
(In reply to Dwight (Bud) Brown from comment #3)
> "PCP has everything needed to do this aggregation"

well, it has enough for pmiostat to implement aggregation for combinations of regexs for host:bus:target:lun or sd or dm devices, but that's only going to account for multipathing by careful choice of devices

> 
> It also needs device wwid to do lun aggregation and FC target wwpn to do FC
> target aggregation.  I'm assuming those are already present too. 

no they are not present and yes agree - we need wwid and wwpn maps and i/o stat metrics. I will investigate some more and likely add new hinv.map.* and disk.{wwid,wwpn}.* metrics.

> 
> Aggregation by wwid is different from the multipath device stats, often due
> to the queue backlog changing the stats in dmm device vs actual io stats of
> the underlying set of sd devices.

wouldn't it be enough to aggregate by specific sd or h:b:t:l paths, if at least to compare to the corresponding disk.dm.* multipath device stats?
Comment 7 Dwight (Bud) Brown 2015-12-23 11:56:40 EST
"wouldn't it be enough to aggregate by specific sd or h:b:t:l paths"

no. as one recent case showed, aggregating by scsi host showed the FC link bandwidth used at 50% on each hba link.  The problem was on the back side of the switch where there was shared links.  0:0:0:* could be the same back side link as 1:0:1:* or might not be, only by looking the the 0:0:0 fc target port wwpn can traffic to the same port (and thus over the same fc link) be aggregated properly.  So while each front of switch (dedicated) link between hba and switch was at 50% the back of switch (shared) link between switch and FC storage port was doubling up traffic at times and likely a bottleneck.  Adding in the other hosts sharing the storage controller (that we couldn't see/measure), made this even more likely.

btw, wwid should be present in sysfs on rhel7.  on rhel6 its available via multipath -ll, but only if dmm is used as multipath solution which isn't always the case so I have a script to pulls the wwid directly from each device via h:b:t:l.

But as seen in the output I included, while its often the case that lun 0 [0:0:0:0] is the same as lun 0 [1:0:0:0], that isn't always the case -- the data posted has, for example, lun 53 on one storage target port being the same as lun 55 on a different storage target port.  Only by comparing wwids can this be determined.  If dmm was the only multipath solution deployed by customers, then you could use multipath -ll reports to see each set of paths under the same dmm device... but since there are multiple dmm solutions deployed by customers we can't depend on that method within support.
Comment 8 Mark Goodwin 2015-12-23 17:00:02 EST
(In reply to Dwight (Bud) Brown from comment #7)
> "wouldn't it be enough to aggregate by specific sd or h:b:t:l paths"
> 
> no. as one recent case showed, aggregating by scsi host showed the FC link
> bandwidth used at 50% on each hba link.  The problem was on the back side of
> the switch where there was shared links.  0:0:0:* could be the same back
> side link as 1:0:1:* or might not be, only by looking the the 0:0:0 fc
> target port wwpn can traffic to the same port (and thus over the same fc
> link) be aggregated properly.  So while each front of switch (dedicated)
> link between hba and switch was at 50% the back of switch (shared) link
> between switch and FC storage port was doubling up traffic at times and
> likely a bottleneck.  Adding in the other hosts sharing the storage
> controller (that we couldn't see/measure), made this even more likely.

ok yep, typical back-end bottleneck. We used to have an "fcswitch" PCP plugin that provided per-port FC switch metrics: tx, rx and various err stats. It used to work with Brocade and a few other switch vendors. Never open sourced it though and it was a pain to set up because it used the management interface over telnet, which needed login/pass etc., and customers were often reluctant to give out that info or have it configured somewhere in PCP in cleartext.

The fcswitch agent used to make it really easy to identify back-end FC bottlenecks and protocol errors; it helped if the customer could tell you which port was plugged into where (though you can also figure this out with the wwpn and FC topology).

An easier and far more direct solution is to use snmp - all the switches export FC stats over snmp using a standard MIB, and PCP has an snmp gateway plugin (see src/pmdas/snmp in the PCP source). The standard FC switch MIB is defined at https://tools.ietf.org/html/rfc4044 and we can easily enough plug that into PCP's snmp gateway - there's a boat load of useful stats in that MIB. The resulting stats can be monitored with generic PCP clients such as pmchart, pmrep, pmdumptext, etc etc. or we could tailor a specific tool.

> 
> btw, wwid should be present in sysfs on rhel7.  on rhel6 its available via
> multipath -ll, but only if dmm is used as multipath solution which isn't
> always the case so I have a script to pulls the wwid directly from each
> device via h:b:t:l.

For wwid, we can use similar code to multipath, see various functions in libmultipath/discovery.c, and then provide hinv.map.wwid or some such metric (much like hinv.map.scsi for h:b:t:l), that maps device/paths to their wwid. CLient tools can then figure out multipaths etc. using that info, much the same as the multipath daemon.

> 
> But as seen in the output I included, while its often the case that lun 0
> [0:0:0:0] is the same as lun 0 [1:0:0:0], that isn't always the case -- the
> data posted has, for example, lun 53 on one storage target port being the
> same as lun 55 on a different storage target port.  Only by comparing wwids
> can this be determined. ...

yep, unfortunate aliasing like that sometimes happens. The hinv.map.wwid and aggregated disk.wwid.* metrics will all definitely help with this, and pmiostat will be easy to extend, e.g. with "-x wwid" reporting.
Comment 9 Nathan Scott 2016-01-07 19:15:59 EST
BZ state admin, after following up with Mark.
Comment 10 Mark Goodwin 2016-01-07 19:19:22 EST
will work on this upstream to begin with

Note You need to log in before you can comment on or make changes to this bug.