Bug 2002512 - [tracker] Failed to start RDMA Node description daemon
Summary: [tracker] Failed to start RDMA Node description daemon
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.8
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: 4.10.0
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1946606 2008394
Blocks: 2043061
TreeView+ depends on / blocked
 
Reported: 2021-09-09 03:54 UTC by Jatan Malde
Modified: 2022-03-02 20:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2043061 (view as bug list)
Environment:
Last Closed: 2022-01-20 14:58:30 UTC
Target Upstream Version:
Embargoed:
miabbott: needinfo-
miabbott: needinfo-
miabbott: needinfo-


Attachments (Terms of Use)

Description Jatan Malde 2021-09-09 03:54:44 UTC
OCP Version at Install Time: 4.8
RHCOS Version at Install Time: 4.8
Platform: bare metal
Architecture: x86_64


What are you trying to do? What is your use case?

Customer is trying to install OCP 4.8 using baremetal UPI installation format and booting the machines using an iPXE server. The iPXE config is included in the further comments. 

The machines fail in the initramfs stage on the baremetal hosts modeled HP DL360 with ilo5 lifecycle.

It fails with the following error message, 

~~~
Sep 08 08:22:50 xxxx systemd[1838]: rdma-ndd.service: Failed to execute command: No such file or directory
Sep 08 08:22:50 xxxx systemd[1838]: rdma-ndd.service: Failed at step EXEC spawning /usr/sbin/rdma-ndd: No such file or directory
Sep 08 08:22:51 xxxx systemd[1]: Starting Dracut Emergency Shell...
Sep 08 08:22:51 xxxx systemd[1]: rdma-ndd.service: Main process exited, code=exited, status=203/EXEC
Sep 08 08:22:51 xxxx systemd[1]: rdma-ndd.service: Failed with result 'exit-code'.
Sep 08 08:22:51 xxxx systemd[1]: Failed to start RDMA Node Description Daemon.
Sep 08 08:22:52 xxxx systemd[1]: rdma-ndd.service: Service RestartSec=100ms expired, scheduling restart.
Sep 08 08:22:52 xxxx systemd[1]: rdma-ndd.service: Scheduled restart job, restart counter is at 18.
Sep 08 08:22:52 xxxx systemd[1]: Stopped RDMA Node Description Daemon.
Sep 08 08:22:52 xxxx systemd[1]: Starting RDMA Node Description Daemon...
Sep 08 08:22:52 xxxx systemd[1855]: rdma-ndd.service: Failed to execute command: No such file or directory
Sep 08 08:22:52 xxxx systemd[1855]: rdma-ndd.service: Failed at step EXEC spawning /usr/sbin/rdma-ndd: No such file or directory
~~~

If I use the live iso of that version on the same machine the RDMA service is active and running, attached is the screenshot of the same. 

I have seen a similar issue in rhel8 some days back and could be related as well.
https://bugzilla.redhat.com/show_bug.cgi?id=1946606


What happened? What went wrong or what did you expect?


What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.

- To reproduce the issue, 
- Configure an iPXE server with the versions of initrd, rootfs, and kernel mentioned in the below comments.
- Select a physical host like HP DL360 
- Boot up the machine in the iPXE mode

Comment 5 Micah Abbott 2021-09-20 18:34:25 UTC
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1946606

Specifically, it has been observed on RHEL 8.4 https://bugzilla.redhat.com/show_bug.cgi?id=1946606#c5 using `rdma-core-32.0-4.el8` which is included in RHCOS 4.8

It appears this is fixed in RHEL 8.5, but since RHCOS 4.8 will continue to use RHEL 8.4 EUS content, the fix should be requested to be backported to 8.4.z.

Please follow the z-stream backport request procedures here - https://source.redhat.com/departments/pnt/pnt_cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook

Note:  while this problem affects OpenShift, you must use the process for requesting a RHEL backport as the affected package is in RHEL

---

While the backport is requested, we'll mark this BZ as a tracking BZ to follow the progress of the updated package in RHCOS.

Comment 7 Micah Abbott 2022-01-20 14:58:30 UTC
This problem was fixed as part of https://bugzilla.redhat.com/show_bug.cgi?id=2019819 in `rdma-core-32.0-5.el8_4`

That version of the package was included as part of RHCOS 410.84.202112162002-0 on Dec 12.

Since this was fixed in RHEL 8.4.z, it means that the fixed package also landed in RHCOS 4.9/4.8/4.7

I'll create backport BZs to track inclusion of the fixed packages in those releases.


Note You need to log in before you can comment on or make changes to this bug.