Bug 1341971
| Summary: | oib_utils ERROR: [1180] open_verbs_ctx: failed to find verbs device | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Honggang LI <honli> |
| Component: | opa-fm | Assignee: | Honggang LI <honli> |
| Status: | CLOSED ERRATA | QA Contact: | zguo <zguo> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.3 | CC: | ddutile, dledford, honli, jshortt, mschmidt, zguo |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | opa-fm-10.0.1.0-3.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-04 03:25:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1274397 | ||
https://github.com/01org/opa-fm/issues/4 Insert "Requires: libhfi1" in the spec file will fix this issue. It is a minor issue. [root@rdma-qe-14 ~]$ yum install -y opa-fm
[root@rdma-qe-14 ~]$ rpm -q opa-fm
opa-fm-10.0.1.0-3.el7.x86_64
[root@rdma-qe-14 ~]$ rpm -q libhfi1
libhfi1-0.5-23.el7.x86_64
[root@rdma-qe-14 ~]$ systemctl status opafm.service -l
● opafm.service - OPA Fabric Manager
Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2016-06-21 03:38:34 EDT; 27s ago
Process: 37246 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS)
Main PID: 37247 (opafmd)
CGroup: /system.slice/opafm.service
└─37247 /usr/lib/opa-fm/bin/opafmd -D
Jun 21 03:38:38 rdma-qe-14 fm0_sm[37261]: [pm]: Memory: Pool=658603K
Jun 21 03:38:38 rdma-qe-14 fm0_sm[37261]: [pm]: Using: HFI 1 Port 1 PortGuid 0x00117501016710f0
Jun 21 03:38:38 rdma-qe-14 fm0_sm[37261]: ERROR[RT-Conf (PM)]: PM: pm_conf_server_run: Command server exited with error "No such file or directory" (2)
Jun 21 03:38:38 rdma-qe-14 fm0_sm[37261]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Trap event occurred that requires re-sweep.
Jun 21 03:38:38 rdma-qe-14 fm0_sm[37261]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 1 SWs, 2 HFIs, 2 end ports, 5 total ports, 1 SM(s), 27 packets, 0 retries, 0.033 sec sweep
Jun 21 03:38:38 rdma-qe-14 fm0_sm[37261]: ERROR[RT-Conf server]: SM: pm_conf_server_run: Command server exited with error "No such file or directory" (2)
Jun 21 03:38:39 rdma-qe-14 fm0_sm[37261]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Our SM's port wasn't ready, or went down unexpectedly.
Jun 21 03:38:39 rdma-qe-14 fm0_sm[37261]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 1 SWs, 2 HFIs, 2 end ports, 5 total ports, 1 SM(s), 26 packets, 0 retries, 0.001 sec sweep
Jun 21 03:38:39 rdma-qe-14 fm0_sm[37261]: PROGR[pm]: IF3: rmpp_protocol_init: Allocating context pool with num entries=18432
Jun 21 03:38:39 rdma-qe-14 opafmd[37246]: Instance 0 of SM not running.
[root@rdma-qe-14 ~]$ ibstat
CA 'hfi1_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version: 10
Node GUID: 0x00117501016710f0
System image GUID: 0x00117501016710f0
Port 1:
State: Active --> It's active now
Physical state: LinkUp
Rate: 100
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x00490020
Port GUID: 0x00117501016710f0
Link layer: InfiniBand
But two problems:
1) ERROR[RT-Conf (PM)]: PM: pm_conf_server_run: Command server exited with error "No such file or directory" (2)
2) [root@rdma-qe-14 ~]$ ping 172.31.20.15
PING 172.31.20.15 (172.31.20.15) 56(84) bytes of data.
From 172.31.20.14 icmp_seq=1 Destination Host Unreachable
From 172.31.20.14 icmp_seq=2 Destination Host Unreachable
From 172.31.20.14 icmp_seq=3 Destination Host Unreachable
From 172.31.20.14 icmp_seq=4 Destination Host Unreachable
^C
--- 172.31.20.15 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4001ms
pipe 4
[root@rdma-qe-15 ~]$ ibstat
CA 'hfi1_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version: 10
Node GUID: 0x0011750101670fb0
System image GUID: 0x0011750101670fb0
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 3
LMC: 0
SM lid: 3
Capability mask: 0x00490020
Port GUID: 0x0011750101670fb0
Link layer: InfiniBand
Hi Honggang, Could you please take a look? Thanks. Sorry, this is a known bug. I told you when I sent xhpl/linpack test wiki. I will fix and rebuild opa-fm today.
[root@rdma-qe-14 ~]$ mkdir -p /var/usr/lib/opa-fm/
[root@rdma-qe-14 ~]$ systemctl restart opafm.service
[root@rdma-qe-14 ~]$ systemctl status opafm.service
● opafm.service - OPA Fabric Manager
Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2016-06-21 04:49:34 EDT; 8s ago
Process: 3243 ExecStopPost=/usr/bin/sleep 5 (code=killed, signal=TERM)
Process: 3233 ExecStop=/usr/lib/opa-fm/bin/opafmd halt (code=exited, status=0/SUCCESS)
Process: 3246 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS)
Main PID: 3247 (opafmd)
CGroup: /system.slice/opafm.service
├─3247 /usr/lib/opa-fm/bin/opafmd -D
└─3259 /usr/lib/opa-fm/runtime/sm -e sm_0
Jun 21 04:49:37 rdma-qe-14 fm0_sm[3259]: PROGR[topology]: SM: topology_main: SM redundancy not available
Jun 21 04:49:38 rdma-qe-14 fm0_sm[3259]: [pm]: Performance Manager starting up. LogLevel: 2 LogMode: 0
Jun 21 04:49:38 rdma-qe-14 fm0_sm[3259]: [pm]: Size Limits: EndNodePorts=9216
Jun 21 04:49:38 rdma-qe-14 fm0_sm[3259]: [pm]: Memory: Pool=658603K
Jun 21 04:49:38 rdma-qe-14 fm0_sm[3259]: [pm]: Using: HFI 1 Port 1 PortGuid 0x00117501016710f0
Jun 21 04:49:39 rdma-qe-14 fm0_sm[3259]: PROGR[pm]: IF3: rmpp_protocol_init: Allocating context pool with num entries=18432
Jun 21 04:49:39 rdma-qe-14 fm0_sm[3259]: [PmEngine]: PM: Engine starting up, will be monitoring and clearing port counters
Jun 21 04:49:39 rdma-qe-14 fm0_sm[3259]: [PmDbsyncThread]: PM: Dbsync thread starting up, will be syncing RAM/Disk PA history data
Jun 21 04:49:40 rdma-qe-14 fm0_sm[3259]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Multicast gr...change.
Jun 21 04:49:40 rdma-qe-14 fm0_sm[3259]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 1 SWs, 2 HFIs, 2 end ports,...c sweep
Hint: Some lines were ellipsized, use -l to show in full.
[root@rdma-qe-14 ~]$ ping -c 3 172.31.20.15
PING 172.31.20.15 (172.31.20.15) 56(84) bytes of data.
64 bytes from 172.31.20.15: icmp_seq=1 ttl=64 time=18.3 ms
64 bytes from 172.31.20.15: icmp_seq=2 ttl=64 time=0.029 ms
64 bytes from 172.31.20.15: icmp_seq=3 ttl=64 time=0.022 ms
--- 172.31.20.15 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.022/6.136/18.358/8.642 ms
[root@rdma-qe-14 ~]$
Move this bug to VERIFIED. Track the new issue in https://bugzilla.redhat.com/show_bug.cgi?id=1348477 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2309.html |
Description of problem: [root@rdma-qe-14 ~]$ /sbin/service opafm status -l Redirecting to /bin/systemctl status -l opafm.service ● opafm.service - OPA Fabric Manager Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2016-06-02 03:05:46 EDT; 4min 42s ago Process: 1111 ExecStart=/usr/lib/opa-fm/bin/opafmd -D (code=exited, status=0/SUCCESS) Main PID: 1115 (opafmd) CGroup: /system.slice/opafm.service └─1115 /usr/lib/opa-fm/bin/opafmd -D Jun 02 03:05:45 rdma-qe-14 fm0_sm[1180]: PROGR[main]: SM: [VF:Admin] : Sharing 100% BW among remaining VFs Jun 02 03:05:45 rdma-qe-14 fm0_sm[1180]: PROGR[main]: SM: [VF:Default] : Base SL:0 Base SC:0 NumScs:1 QOS:0 HP:0 Jun 02 03:05:45 rdma-qe-14 fm0_sm[1180]: PROGR[main]: SM: [VF:Admin] : Base SL:0 Base SC:0 NumScs:1 QOS:0 HP:0 Jun 02 03:05:46 rdma-qe-14 systemd[1]: Started OPA Fabric Manager. Jun 02 03:05:46 rdma-qe-14 fm0_sm[1180]: oib_utils ERROR: [1180] open_verbs_ctx: failed to find verbs device Jun 02 03:05:46 rdma-qe-14 fm0_sm[1180]: ERROR[main]: APP: ib_init_devport: Failed to bind to device 1, port 1; status: 5 Jun 02 03:05:46 rdma-qe-14 fm0_sm[1180]: ; MSG:NOTICE|SM:Default SM:port 1|COND:#7 SM shutdown|DETAIL:sm_main: Failed to bind to device; terminating Jun 02 03:05:46 rdma-qe-14 fm0_sm[1180]: FATAL[main]: SM: sm_main: sm_main: Failed to bind to device; terminating Jun 02 03:05:46 rdma-qe-14 FATAL:[1180]: sm_main: Failed to bind to device; terminating Jun 02 03:05:46 rdma-qe-14 opafmd[1111]: Instance 0 of SM not running. [root@rdma-qe-14 ~]$ [root@rdma-qe-14 ~]$ ibstat CA 'hfi1_0' CA type: Number of ports: 1 Firmware version: Hardware version: 10 Node GUID: 0x00117501016710f0 System image GUID: 0x00117501016710f0 Port 1: State: Initializing <----- should be Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00490020 Port GUID: 0x00117501016710f0 Link layer: InfiniBand Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: The OPA HFI port is in Initializing status. Expected results: The port is Active. Additional info: Install libhfi1, the HFI1 user space driver fixes this issue.