Bug 1937699

Summary: rdma-ndd doesn't reliably initialize the node description of multiple Infiniband devices
Product: Red Hat Enterprise Linux 7 Reporter: Georg Sauthoff <georg.sauthoff>
Component: rdma-coreAssignee: Honggang LI <honli>
Status: CLOSED ERRATA QA Contact: zguo <zguo>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.6CC: cww, dbodnarc, fkrska, infiniband-qe, jreznik, linville, mschmidt, rdma-dev-team, sbroz, zguo
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---Flags: zguo: needinfo-
pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rdma-core-22.4-6.el7_9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-27 11:36:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Georg Sauthoff 2021-03-11 11:18:15 UTC
Description of problem:
On hosts with multiple Infiniband devices, only the Infiniband node description of the first device is reliably initialized. The node description of other devices often aren't initialized.

Version-Release number of selected component (if applicable):
rdma-core-17.2-3.el7.x86_64

How reproducible:
on some hosts 100%

Steps to Reproduce:
1. install RHEL 7.6 on a host with multiple Infiniband devices (e.g. a Host with a Mellanox ConnectX5 HCA which presents itself as 2 devices)
2. hostname -s
3. ls /sys/class/infiniband 
4. cat /sys/class/infiniband/*/node_desc

Actual results:
$ hostname -s
myshorthostname
$ ls /sys/class/infiniband
mlx5_0  mlx5_1
$ cat /sys/class/infiniband/*/node_desc
myshorthostname mlx5_0
MT4119 ConnectX5

Expected results:
$ hostname -s
myshorthostname
$ ls /sys/class/infiniband
mlx5_0  mlx5_1
$ cat /sys/class/infiniband/*/node_desc
myshorthostname mlx5_0
myshorthostname mlx5_1

Additional info:
We see this on several hosts with ConnextX5 cards in production.

This seems to be caused by a race condition between the start of rdma-ndd (via udev rule /usr/lib/udev/rules.d/60-rdma-ndd.rules, when the first device is showing up) and the rdma-ndd being initialized and ready to listen for new udev events. Thus, there is a time-window where the 2nd device shows up, rdma-ndd.service is already being started but hasn't established its udev monitoring. Or its udev monitoring is simply broken.

One workaround is to restart rdma-ndd.service after the rdma-hw.target is reached. Another is to write the simple node descriptions to /sys/.../node_desc by other means, after the rdma-hw.target is reached.

A proper fix I can think of is to let rdma-ndd re-iterate over all devices **after** it has established its udev monitoring.

Comment 2 Honggang LI 2021-03-11 13:58:38 UTC
Please feedback rdma-ndd log file with machine always reproduce this issue.

You can collect rdma-ndd log like this:

[root@rdma02 ~]# diff -Nurp /usr/lib/systemd/system/rdma-ndd.service.bak /usr/lib/systemd/system/rdma-ndd.service
--- /usr/lib/systemd/system/rdma-ndd.service.bak	2021-03-11 08:13:38.310702848 -0500
+++ /usr/lib/systemd/system/rdma-ndd.service	2021-03-11 08:38:07.018718538 -0500
@@ -19,6 +19,6 @@ Before=rdma-hw.target
 [Service]
 Type=notify
 Restart=always
-ExecStart=/usr/sbin/rdma-ndd --systemd
+ExecStart=/usr/sbin/rdma-ndd --debug --systemd
 
 # rdma-ndd is automatically wanted by udev when an RDMA device with a node description is present



[root@rdma02 ~]# systemctl daemon-reload
[root@rdma02 ~]# systemctl enable rdma-ndd.service
[root@rdma02 ~]# reboot

After system up again, collect log with

[root@rdma02 ~]# journalctl -u rdma-ndd

Comment 3 Georg Sauthoff 2021-03-11 14:12:33 UTC
Ok, I enabled rdma-ndd debug and rebooted the machine. This is the output now:

[root@myshorthostname ~]# cat /sys/class/infiniband/mlx5_*/node_desc
myshorthostname mlx5_0
MT4119 ConnectX5   Mellanox Technologies
[root@myshorthostname ~]# journalctl -u rdma-ndd
-- Logs begin at Thu 2021-03-11 15:06:57 CET, end at Thu 2021-03-11 15:08:30 CET. --
Mar 11 15:07:09 myshorthostname.example.de systemd[1]: Starting RDMA Node Description Daemon...
Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: Node Descriptor format (%h %d)
Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: mlx5_0: change (MT4119 ConnectX5   Mellanox Technologies) -> (myshorthostname mlx5_0)
Mar 11 15:07:10 myshorthostname.example.de systemd[1]: Started RDMA Node Description Daemon.

Comment 4 Honggang LI 2021-03-12 08:32:06 UTC
(In reply to Georg Sauthoff from comment #3)
> Ok, I enabled rdma-ndd debug and rebooted the machine. This is the output
> now:
> 
> [root@myshorthostname ~]# cat /sys/class/infiniband/mlx5_*/node_desc
> myshorthostname mlx5_0
> MT4119 ConnectX5   Mellanox Technologies
> [root@myshorthostname ~]# journalctl -u rdma-ndd
> -- Logs begin at Thu 2021-03-11 15:06:57 CET, end at Thu 2021-03-11 15:08:30
> CET. --
> Mar 11 15:07:09 myshorthostname.example.de systemd[1]: Starting RDMA Node
> Description Daemon...
> Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: Node Descriptor
> format (%h %d)
> Mar 11 15:07:09 myshorthostname.example.de rdma-ndd[17411]: mlx5_0: change
> (MT4119 ConnectX5   Mellanox Technologies) -> (myshorthostname mlx5_0)
> Mar 11 15:07:10 myshorthostname.example.de systemd[1]: Started RDMA Node
> Description Daemon.

Please provide sos report of host myshorthostname.example.de . thanks

Comment 5 Honggang LI 2021-03-12 08:35:43 UTC
http://people.redhat.com/honli/.1937699/

Please test this scratch rpm. If issue persist, we need sos report too.

You need force erase the old rdma-core package and install the scratch like this:

    $  rpm -e --nodeps rdma-core
    $  rpm -ivh rdma-core-17.2-4.bz1937699.el7_6.x86_64.rpm
    $  cp /etc/rdma/rdma.conf.rpmsave /etc/rdma/rdma.conf
    $  cp /etc/udev/rules.d/70-persistent-ipoib.rules.rpmsave /etc/udev/rules.d/70-persistent-ipoib.rules
 
    update /usr/lib/systemd/system/rdma-ndd.service to enable debug log.
    
    $ systemctl daemon-reload 
    $ reboot

Comment 6 Georg Sauthoff 2021-03-16 15:00:05 UTC
I've tested your scratch RPM and with it the node description is now reliably set during boot.

That means after the reboot:

cat /sys/class/infiniband/mlx5_*/node_desc
myshorthostname mlx5_0
myshorthostname mlx5_1

journalctl -u rdma-ndd
-- Logs begin at Tue 2021-03-16 15:47:32 CET, end at Tue 2021-03-16 15:56:56 CET. --
Mar 16 15:47:43 myshorthostname systemd[1]: Starting RDMA Node Description Daemon...
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: Node Descriptor format (%h %d)
Mar 16 15:47:43 myshorthostname systemd[1]: Started RDMA Node Description Daemon.
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_0: change (MT4119 ConnectX5   Mellanox Technologies) -> (myshorthostname mlx5_0)
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_1: change (MT4119 ConnectX5   Mellanox Technologies) -> (myshorthostname mlx5_1)

systemctl status rdma-ndd
● rdma-ndd.service - RDMA Node Description Daemon
   Loaded: loaded (/usr/lib/systemd/system/rdma-ndd.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2021-03-16 15:47:43 CET; 8min ago
     Docs: man:rdma-ndd
 Main PID: 16997 (rdma-ndd)
   CGroup: /system.slice/rdma-ndd.service
           └─16997 /usr/sbin/rdma-ndd --systemd --debug

Mar 16 15:47:43 myshorthostname systemd[1]: Starting RDMA Node Description Daemon...
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: Node Descriptor format (%h %d)
Mar 16 15:47:43 myshorthostname systemd[1]: Started RDMA Node Description Daemon.
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_0: change (MT4119 ConnectX5   Mellanox Technologies) -> (myshorthostname mlx5_0)
Mar 16 15:47:43 myshorthostname rdma-ndd[16997]: mlx5_1: change (MT4119 ConnectX5   Mellanox Technologies) -> (myshorthostname mlx5_1)


So this fixes the issue.

Comment 7 Honggang LI 2021-03-16 15:11:39 UTC
(In reply to Georg Sauthoff from comment #6)
> I've tested your scratch RPM and with it the node description is now
> reliably set during boot.

Thanks for testing. I will submit a patch to upstream and backport it for RHEL once it merged into upstream.

Do you mind I add 'ReportedBy' and 'TestBy' tags for the patch?

Reported-by: Georg Sauthoff <georg.sauthoff>
Tested-by: Georg Sauthoff <georg.sauthoff>

Comment 8 Georg Sauthoff 2021-03-16 15:13:12 UTC
No, I don't mind.

Comment 9 Honggang LI 2021-03-17 01:53:46 UTC
I opened this PR to address this issue in upstream repo.

https://github.com/linux-rdma/rdma-core/pull/962

Comment 10 Honggang LI 2021-03-18 11:01:06 UTC
(In reply to Honggang LI from comment #9)
> I opened this PR to address this issue in upstream repo.
> 
> https://github.com/linux-rdma/rdma-core/pull/962

The PR had been merged into upstream repo. Set devel+ flag.

Comment 11 Honggang LI 2021-03-18 11:06:15 UTC
Hi, Georg

This bug was opened against RHEL-7.6. Are you looking for RHEL-7.6.z fix for it?

If yes, please provide business justification for RHEL-7.6.z request.

BTW, the bug MUST be fixed in RHEL-7.9.z first before we backport the fix for RHEL-7.7.z and RHEL-7.6.z.

thanks

Comment 12 Georg Sauthoff 2021-03-18 16:11:19 UTC
We don't need the fix for RHEL 7.6 since we upgrade to 7.9 in the not too far future and we can work around this issue until then.

So please let me know when the fix is expected to arrive in RHEL 7.9.

Comment 14 John W. Linville 2021-04-06 14:27:54 UTC
Georg Sauthoff, have you pursued this request through any of the more formal support channels for RHEL? Can you point us at those support requests, for reference?

Comment 15 Georg Sauthoff 2021-04-06 16:03:01 UTC
No, I haven't, so far.

I wasn't aware that I need to contact some other redhat support channel, as well.


Is there something I need to trigger?

Comment 16 Chris Williams 2021-04-07 12:50:34 UTC
Hi George,

Bugzilla is not a support tool. It's a development tool and has no SLAs.
Instead you need to open a Support case by logging into the Red Hat Customer Portal at https://access.redhat.com/
If you don't have a login you'll need to make arrangements with the organization that oversees your company's access to the RH Customer Portal.

Comment 17 Georg Sauthoff 2021-04-09 08:29:36 UTC
I see.

I've opened a case that references this bug: Case #02913732

Comment 30 errata-xmlrpc 2021-04-27 11:36:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (rdma-core bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1396