Bug 1550471
| Summary: | [RHEL-7.5] sos only collect info from the first InfiniBand HCA port | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Honggang LI <honli> | ||||
| Component: | sos | Assignee: | Pavel Moravec <pmoravec> | ||||
| Status: | CLOSED ERRATA | QA Contact: | zguo <zguo> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.5 | CC: | agk, bmr, gavin, honli, infiniband-qe, mhradile, mstowell, plambri, pmoravec, rdma-dev-team, sbradley, zguo | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | sos-3.6-1.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-10-30 10:31:19 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Honggang LI
2018-03-01 09:17:25 UTC
We don't do anything special at all in sos to interrogate particular HCAs or ports. We just run the vanilla infiniband commands:
"ibv_devices",
"ibv_devinfo",
"ibstat",
"ibstatus",
"ibhosts",
"iblinkinfo",
"sminfo",
"perfquery"
It's not clear from the information you have provided which command you are referring to.
If additional options or arguments are required to fetch information for all HCAs and ports then we can add those.
On the other hand, if the above commands are expected to get information for all devices (which seems reasonable since no device was specified) then this is looking more like a bug in the infiniband tools.
(In reply to Bryn M. Reeves from comment #3) > "ibv_devices", > "ibv_devinfo", > "ibstat", > "ibstatus", Those four commands report information about *ALL* HCA ports for all HCA. But I could suggest run "ibv_devices -v" replace "ibv_devices". > "ibhosts", > "iblinkinfo", > "sminfo", > "perfquery" Those four commands only collect information from the first port. > It's not clear from the information you have provided which command you are > referring to. > > If additional options or arguments are required to fetch information for all > HCAs and ports then we can add those. > > On the other hand, if the above commands are expected to get information for > all devices (which seems reasonable since no device was specified) then this > is looking more like a bug in the infiniband tools. Yes, I agree that. We need add some switches to such infiniband tools to collect info for all ports. BTW, I can't find any infiniband-diags commands in the sosrport file. Those information is _VERY_ useful to get topology information of the fabric. Is it possible to add such tools? > run "ibv_devices -v" replace "ibv_devices".
> infiniband-diags commands
No problem: these are both trivial changes to the plugin that we can include in 3.6 (due in rhel-7.6).
We'll hold off on other changes for now until the question over whether to do this in the IB tools or sos is decided.
(In reply to Bryn M. Reeves from comment #5) > > run "ibv_devices -v" replace "ibv_devices". > > infiniband-diags commands > > No problem: these are both trivial changes to the plugin that we can include > in 3.6 (due in rhel-7.6). > > We'll hold off on other changes for now until the question over whether to > do this in the IB tools or sos is decided. Raising relevant needinfo: Waiting to be specified what changes are required from sosreport side, i.e. what particular commands sosreport shall newly collect (or what cmds shall it update, like the "ibv_devices -v"). bouncing the needinfo - much welcomed if the BZ shall go to 7.6. Well, bit by this issue again in yesterday when NetApp file other mutlipath related bug.
> "ibhosts",
> "iblinkinfo",
> "sminfo",
> "perfquery"
We have two options to scan all active ports.
Option A: Modify all of those four C programs. Use "umad_get_cas_names"/"umad_get_ca_portguids"/"umad_get_port" to get all active ports. And then iterate over all active ports.
Option B: create a python wrap to get all active ports. And then run those commands iterate over all active ports.
iblinkinfo -C <CA> -P <Port NUM>
ibhosts -C <CA> -P <Port NUM>
sminfo -C <CA> -P <Port NUM>
perfquery -C <CA> -P <Port NUM>
It is easy to get the CA list, just read the dir of "/sys/class/infiniband".
[root@rdma-master ~]$ ls /sys/class/infiniband
hfi1_0 hfi1_1 i40iw0 i40iw1 mlx4_0 qib0
It is also easy to get the port list, just read the dir of "/sys/class/infiniband/<CA>/ports/ .
[root@rdma-master ~]$ ls /sys/class/infiniband/*/ports/
/sys/class/infiniband/hfi1_0/ports/:
1
/sys/class/infiniband/hfi1_1/ports/:
1
/sys/class/infiniband/i40iw0/ports/:
1
/sys/class/infiniband/i40iw1/ports/:
1
/sys/class/infiniband/mlx4_0/ports/:
1 2
/sys/class/infiniband/qib0/ports/:
1
Now, read the file /sys/class/infiniband/<CA>/ports/<NUM>/state to get the state of port.
[root@rdma-master ~]$ tail /sys/class/infiniband/*/ports/*/state
==> /sys/class/infiniband/hfi1_0/ports/1/state <==
4: ACTIVE
==> /sys/class/infiniband/hfi1_1/ports/1/state <==
1: DOWN
==> /sys/class/infiniband/i40iw0/ports/1/state <==
1: DOWN
==> /sys/class/infiniband/i40iw1/ports/1/state <==
4: ACTIVE
==> /sys/class/infiniband/mlx4_0/ports/1/state <==
4: ACTIVE
==> /sys/class/infiniband/mlx4_0/ports/2/state <==
4: ACTIVE
==> /sys/class/infiniband/qib0/ports/1/state <==
4: ACTIVE
Is it doable for Option B?
B) shouldnt be a problem, but that info can be collected more easily - we dont need to identify and cat the files, we can easily grab whole /sys/class/infiniband directory ;-) or even /sys/class/infiniband/*/ports directories.
Is collecting any file with /sys/class/infiniband/*/ports mask sufficient? You can try it by updating /usr/lib/python2.7/site-packages/sos/plugins/infiniband.py to:
self.add_copy_spec([
"/etc/ofed/openib.conf",
"/etc/ofed/opensm.conf",
"/etc/rdma", # dont forget adding comma here
"/sys/class/infiniband/*/ports" # add this line
])
Is this change the only required, please?
No, it will not work. Option B is not to collect /sys/class/infiniband directory. After we get the active port list from those directory, we will run those four commands over the active ports. For example: iblinkinfo -C mlx4_0 -P 1 iblinkinfo -C mlx4_0 -P 2 I will create a python patch for you to review. But I'm not good and python programming. so the patch will be very ugly. Created attachment 1416687 [details]
Loop all active IB ports
Thanks, with some minor improvements, I created PR https://github.com/sosreport/sos/pull/1262 for the same. devel_ack+ for 7.6 Hello, same question here - could you please do OtherQE here as well? :) (In reply to Pavel Moravec from comment #14) > Hello, > same question here - could you please do OtherQE here as well? :) infiniband-qe@ will take it [root@rdma-virt-01 infiniband]$ ls ibhosts_-C_mlx4_0_-P_1 iblinkinfo_-C_mlx4_0_-P_1 ibstat ibv_devices perfquery_-C_mlx4_0_-P_1 sminfo_-C_mlx4_0_-P_1 ibhosts_-C_mlx4_0_-P_2 iblinkinfo_-C_mlx4_0_-P_2 ibstatus ibv_devinfo_-v perfquery_-C_mlx4_0_-P_2 sminfo_-C_mlx4_0_-P_2 [root@rdma-virt-01 infiniband]$ cat sminfo_-C_mlx4_0_-P_1 sminfo: sm lid 2 sm guid 0xf4521403007be131, activity count 12743948 priority 15 state 3 SMINFO_MASTER [root@rdma-virt-01 infiniband]$ cat sminfo_-C_mlx4_0_-P_2 sminfo: sm lid 34 sm guid 0x1175000078b68e, activity count 1484999 priority 15 state 3 SMINFO_MASTER [root@rdma-virt-01 infiniband]$ cat ibstat CA 'mlx4_0' CA type: MT4103 Number of ports: 2 Firmware version: 2.40.7000 Hardware version: 0 Node GUID: 0xe41d2d03001d6790 System image GUID: 0xe41d2d03001d6793 Port 1: State: Active Physical state: LinkUp Rate: 56 Base lid: 102 LMC: 1 SM lid: 2 Capability mask: 0x02514868 Port GUID: 0xe41d2d03001d6791 Link layer: InfiniBand Port 2: State: Active Physical state: LinkUp Rate: 40 Base lid: 74 LMC: 1 SM lid: 34 Capability mask: 0x02514868 Port GUID: 0xe41d2d03001d6792 Link layer: InfiniBand CA 'mlx4_1' CA type: MT4103 Number of ports: 1 Firmware version: 2.40.7000 Hardware version: 0 Node GUID: 0x0002c90300ee5860 System image GUID: 0x0002c90300ee5863 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x04010000 Port GUID: 0x0202c9fffeee5860 Link layer: Ethernet [root@rdma-virt-01 infiniband]$ rpm -q sos sos-3.6-3.el7.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:3144 |