Bug 1937699
| Summary: | rdma-ndd doesn't reliably initialize the node description of multiple Infiniband devices | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Georg Sauthoff <georg.sauthoff> |
| Component: | rdma-core | Assignee: | Honggang LI <honli> |
| Status: | CLOSED ERRATA | QA Contact: | zguo <zguo> |
| Severity: | high | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.6 | CC: | cww, dbodnarc, fkrska, infiniband-qe, jreznik, linville, mschmidt, rdma-dev-team, sbroz, zguo |
| Target Milestone: | rc | Keywords: | Triaged, ZStream |
| Target Release: | --- | Flags: | zguo:
needinfo-
pm-rhel: mirror+ |
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | rdma-core-22.4-6.el7_9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-04-27 11:36:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Georg Sauthoff
2021-03-11 11:18:15 UTC
Please feedback rdma-ndd log file with machine always reproduce this issue. You can collect rdma-ndd log like this: [root@rdma02 ~]# diff -Nurp /usr/lib/systemd/system/rdma-ndd.service.bak /usr/lib/systemd/system/rdma-ndd.service --- /usr/lib/systemd/system/rdma-ndd.service.bak 2021-03-11 08:13:38.310702848 -0500 +++ /usr/lib/systemd/system/rdma-ndd.service 2021-03-11 08:38:07.018718538 -0500 @@ -19,6 +19,6 @@ Before=rdma-hw.target [Service] Type=notify Restart=always -ExecStart=/usr/sbin/rdma-ndd --systemd +ExecStart=/usr/sbin/rdma-ndd --debug --systemd # rdma-ndd is automatically wanted by udev when an RDMA device with a node description is present [root@rdma02 ~]# systemctl daemon-reload [root@rdma02 ~]# systemctl enable rdma-ndd.service [root@rdma02 ~]# reboot After system up again, collect log with [root@rdma02 ~]# journalctl -u rdma-ndd Ok, I enabled rdma-ndd debug and rebooted the machine. This is the output now: [root@myshorthostname ~]# cat /sys/class/infiniband/mlx5_*/node_desc myshorthostname mlx5_0 MT4119 ConnectX5 Mellanox Technologies [root@myshorthostname ~]# journalctl -u rdma-ndd -- Logs begin at Thu 2021-03-11 15:06:57 CET, end at Thu 2021-03-11 15:08:30 CET. -- Mar 11 15:07:09 myshorthostname.example.de systemd[1]: Starting RDMA Node Description Daemon... Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: Node Descriptor format (%h %d) Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: mlx5_0: change (MT4119 ConnectX5 Mellanox Technologies) -> (myshorthostname mlx5_0) Mar 11 15:07:10 myshorthostname.example.de systemd[1]: Started RDMA Node Description Daemon. (In reply to Georg Sauthoff from comment #3) > Ok, I enabled rdma-ndd debug and rebooted the machine. This is the output > now: > > [root@myshorthostname ~]# cat /sys/class/infiniband/mlx5_*/node_desc > myshorthostname mlx5_0 > MT4119 ConnectX5 Mellanox Technologies > [root@myshorthostname ~]# journalctl -u rdma-ndd > -- Logs begin at Thu 2021-03-11 15:06:57 CET, end at Thu 2021-03-11 15:08:30 > CET. -- > Mar 11 15:07:09 myshorthostname.example.de systemd[1]: Starting RDMA Node > Description Daemon... > Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: Node Descriptor > format (%h %d) > Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: mlx5_0: change > (MT4119 ConnectX5 Mellanox Technologies) -> (myshorthostname mlx5_0) > Mar 11 15:07:10 myshorthostname.example.de systemd[1]: Started RDMA Node > Description Daemon. Please provide sos report of host myshorthostname.example.de . thanks http://people.redhat.com/honli/.1937699/ Please test this scratch rpm. If issue persist, we need sos report too. You need force erase the old rdma-core package and install the scratch like this: $ rpm -e --nodeps rdma-core $ rpm -ivh rdma-core-17.2-4.bz1937699.el7_6.x86_64.rpm $ cp /etc/rdma/rdma.conf.rpmsave /etc/rdma/rdma.conf $ cp /etc/udev/rules.d/70-persistent-ipoib.rules.rpmsave /etc/udev/rules.d/70-persistent-ipoib.rules update /usr/lib/systemd/system/rdma-ndd.service to enable debug log. $ systemctl daemon-reload $ reboot I've tested your scratch RPM and with it the node description is now reliably set during boot.
That means after the reboot:
cat /sys/class/infiniband/mlx5_*/node_desc
myshorthostname mlx5_0
myshorthostname mlx5_1
journalctl -u rdma-ndd
-- Logs begin at Tue 2021-03-16 15:47:32 CET, end at Tue 2021-03-16 15:56:56 CET. --
Mar 16 15:47:43 myshorthostname systemd[1]: Starting RDMA Node Description Daemon...
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: Node Descriptor format (%h %d)
Mar 16 15:47:43 myshorthostname systemd[1]: Started RDMA Node Description Daemon.
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_0: change (MT4119 ConnectX5 Mellanox Technologies) -> (myshorthostname mlx5_0)
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_1: change (MT4119 ConnectX5 Mellanox Technologies) -> (myshorthostname mlx5_1)
systemctl status rdma-ndd
● rdma-ndd.service - RDMA Node Description Daemon
Loaded: loaded (/usr/lib/systemd/system/rdma-ndd.service; static; vendor preset: disabled)
Active: active (running) since Tue 2021-03-16 15:47:43 CET; 8min ago
Docs: man:rdma-ndd
Main PID: 16997 (rdma-ndd)
CGroup: /system.slice/rdma-ndd.service
└─16997 /usr/sbin/rdma-ndd --systemd --debug
Mar 16 15:47:43 myshorthostname systemd[1]: Starting RDMA Node Description Daemon...
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: Node Descriptor format (%h %d)
Mar 16 15:47:43 myshorthostname systemd[1]: Started RDMA Node Description Daemon.
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_0: change (MT4119 ConnectX5 Mellanox Technologies) -> (myshorthostname mlx5_0)
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_1: change (MT4119 ConnectX5 Mellanox Technologies) -> (myshorthostname mlx5_1)
So this fixes the issue.
(In reply to Georg Sauthoff from comment #6) > I've tested your scratch RPM and with it the node description is now > reliably set during boot. Thanks for testing. I will submit a patch to upstream and backport it for RHEL once it merged into upstream. Do you mind I add 'ReportedBy' and 'TestBy' tags for the patch? Reported-by: Georg Sauthoff <georg.sauthoff> Tested-by: Georg Sauthoff <georg.sauthoff> No, I don't mind. I opened this PR to address this issue in upstream repo. https://github.com/linux-rdma/rdma-core/pull/962 (In reply to Honggang LI from comment #9) > I opened this PR to address this issue in upstream repo. > > https://github.com/linux-rdma/rdma-core/pull/962 The PR had been merged into upstream repo. Set devel+ flag. Hi, Georg This bug was opened against RHEL-7.6. Are you looking for RHEL-7.6.z fix for it? If yes, please provide business justification for RHEL-7.6.z request. BTW, the bug MUST be fixed in RHEL-7.9.z first before we backport the fix for RHEL-7.7.z and RHEL-7.6.z. thanks We don't need the fix for RHEL 7.6 since we upgrade to 7.9 in the not too far future and we can work around this issue until then. So please let me know when the fix is expected to arrive in RHEL 7.9. Georg Sauthoff, have you pursued this request through any of the more formal support channels for RHEL? Can you point us at those support requests, for reference? No, I haven't, so far. I wasn't aware that I need to contact some other redhat support channel, as well. Is there something I need to trigger? Hi George, Bugzilla is not a support tool. It's a development tool and has no SLAs. Instead you need to open a Support case by logging into the Red Hat Customer Portal at https://access.redhat.com/ If you don't have a login you'll need to make arrangements with the organization that oversees your company's access to the RH Customer Portal. I see. I've opened a case that references this bug: Case #02913732 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (rdma-core bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1396 |