Bug 2118817 - Hostname is not configured during IPI installation of OpenShift 4.10.3 on baremetal when using NMState and static IP config for a bond network interface.
Summary: Hostname is not configured during IPI installation of OpenShift 4.10.3 on bar...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: NetworkManager
Version: 8.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Beniamino Galvani
QA Contact: David Jaša
URL: https://gitlab.freedesktop.org/Networ...
Whiteboard:
Depends On: 2064339
Blocks: 2164816 2152891 2152892 2152895
TreeView+ depends on / blocked
 
Reported: 2022-08-16 21:08 UTC by Ben Nemec
Modified: 2023-06-12 15:19 UTC (History)
25 users (show)

Fixed In Version: NetworkManager-1.40.2-1.el8
Doc Type: Release Note
Doc Text:
Clone Of: 2064339
: 2152891 2152892 2152895 2164816 (view as bug list)
Environment:
Last Closed: 2023-05-16 09:04:54 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Node1 - NetworkManager journalctl logs (702.98 KB, text/plain)
2022-09-07 08:32 UTC, Venkat B
no flags Details
Node2 - NetworkManager journalctl logs (849.96 KB, text/plain)
2022-09-07 08:33 UTC, Venkat B
no flags Details
Node3 - NetworkManager journalctl logs (701.10 KB, text/plain)
2022-09-07 08:33 UTC, Venkat B
no flags Details
NM reproducer (2.59 KB, application/x-shellscript)
2022-09-26 08:42 UTC, Beniamino Galvani
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker NMT-187 0 None None None 2023-01-26 15:05:31 UTC
Red Hat Issue Tracker RHELPLAN-131244 0 None None None 2022-08-16 21:09:24 UTC
Red Hat Product Errata RHBA-2023:2968 0 None None None 2023-05-16 09:06:21 UTC
freedesktop.org Gitlab NetworkManager NetworkManager merge_requests 1402 0 None merged core: wait for carrier before resolving hostname via DNS 2022-10-06 12:08:15 UTC

Comment 1 Beniamino Galvani 2022-08-18 07:44:12 UTC
Right, as Ben said, we need NM logs at trace level to understand the cause.

Comment 2 Venkat B 2022-08-18 11:53:41 UTC
Hello, I am facing the same issue, but with Assisted Installer offline. At first boot of the Discovery ISO the hosts go into hostname as 'localhost'. This I am seeing on HPE DL380 Rack Mount Servers only (its proper on other HPE HW). This is reproducible always, in case you want to check this in my environment, we can do so.

The workaround which helps here is to restart both the NetworkManager and AI-Agent service on every node.

Comment 3 Beniamino Galvani 2022-09-05 14:57:21 UTC
(In reply to Venkat B from comment #2)
> Hello, I am facing the same issue, but with Assisted Installer offline. At
> first boot of the Discovery ISO the hosts go into hostname as 'localhost'.
> This I am seeing on HPE DL380 Rack Mount Servers only (its proper on other
> HPE HW). This is reproducible always, in case you want to check this in my
> environment, we can do so.

It would be helpful if you can reproduce with NM log level set to trace and attach the journal log.
To enable trace log you can follow one of these methods: http://blog.nemebean.com/content/networkmanager-trace-logging

Comment 5 Venkat B 2022-09-07 07:37:17 UTC
Hi bgalvani, unfortunately (http://blog.nemebean.com/content/networkmanager-trace-logging) does not help at this context.
Reason: The link talks about either manually doing the TRACE log enabling =>

# echo "[logging]" > /etc/NetworkManager/conf.d/99-trace-logging.conf
# echo "level=TRACE" >> /etc/NetworkManager/conf.d/99-trace-logging.conf
# systemctl restart NetworkManager

OR via Machine Config YAML.

Both ways dont help here in our case, as we are here doing the first boot up of our Hardware with the Assisted Installer Discovery Image. At this first boot machineconfigs are not applied.

Also, the Server enters into the 'localhost' state while RHCOS is booting to RAM. Thus doing the manual changes (by creating 99-trace-logging.conf) also does not help.

What are the other options?
Shouldnt this be done via Ignition config override instead?

Comment 6 Venkat B 2022-09-07 08:31:47 UTC
Hi, I now injected via the Ignition config override to inject /etc/NetworkManager/conf.d/99-trace-logging.conf via the Discovery Image ISO. With this the I was able to get the debug traces of NetworkManager service. I have attached 3 logs files as I did a 3 Node OCP installation via Assisted Installer Offline.

I hope this helps?

Comment 7 Venkat B 2022-09-07 08:32:43 UTC
Created attachment 1910044 [details]
Node1 - NetworkManager journalctl logs

Comment 8 Venkat B 2022-09-07 08:33:12 UTC
Created attachment 1910045 [details]
Node2 - NetworkManager journalctl logs

Comment 9 Venkat B 2022-09-07 08:33:41 UTC
Created attachment 1910046 [details]
Node3 - NetworkManager journalctl logs

Comment 10 Beniamino Galvani 2022-09-08 08:10:47 UTC
Hi, the problem here is that the ethernet interfaces take some time to get carrier, and NM tries the reverse DNS lookup on the bond when the ports are not yet active.

--

Initially eno2 and eno3 are waiting for carrier:

  Sep 07 08:08:17 localhost NetworkManager[2869]: <debug> [1662538097.8495] device[5a80e622e02f9ac8] (eno2): add_pending_action (2): 'carrier-wait'
  Sep 07 08:08:17 localhost NetworkManager[2869]: <debug> [1662538097.9872] device[59bdc647cc68ca49] (eno3): add_pending_action (2): 'carrier-wait'

Then bond0 gets configured:

  Sep 07 08:08:18 localhost NetworkManager[2869]: <debug> [1662538098.8029] platform: (bond0) address: adding or updating IPv4 address: 10.13.71.166/27 brd 10.13.71.191 lft forever pref forever ...
  Sep 07 08:08:18 localhost NetworkManager[2869]: <debug> [1662538098.8031] platform: (bond0) route: append     IPv4 route: type unicast 0.0.0.0/0 via 10.13.71.161 dev 10 metric 300 mss 0 rt-src user
  Sep 07 08:08:18 localhost NetworkManager[2869]: <info>  [1662538098.8054] device (bond0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')

The hostname is looked up via DNS using the address configured on bond0:

  Sep 07 08:08:18 localhost NetworkManager[2869]: <trace> [1662538098.8071] device[bd3187c077dd2f01] (bond0): hostname-from-dns: starting lookup for address 10.13.71.166

At this point eno2 and eno3 are not yet attached to the bond and DNS queries are dropped. Eventually the interfaces get carrier, but it's too late.

  Sep 07 08:08:21 localhost NetworkManager[2869]: <info>  [1662538101.6547] device (eno3): carrier: link connected
  Sep 07 08:08:25 localhost NetworkManager[2869]: <info>  [1662538105.1258] device (eno2): carrier: link connected
  Sep 07 08:08:25 localhost NetworkManager[2869]: <info>  [1662538105.1463] device (bond0): carrier: link connected

The query returns "localhost" presumably from the "myhostname" NSS module of glibc:

  Sep 07 08:08:26 localhost NetworkManager[2869]: <debug> [1662538106.8882] device[bd3187c077dd2f01] (bond0): hostname-from-dns: lookup done for 10.13.71.166, result "localhost"

--

I think NM should be improved; first, if there is no carrier, the interface should not be used for reverse DNS lookup. When there is a carrier change, the system hostname should be re-evaluated.

However, he previous point doesn't cover scenarios in which the interface has carrier but (temporarily) no connectivity; this happens for example if a 802.3ad bond is still negotiating LACP, or when the network is not ready yet because e.g. another switch is booting or in the STP forward-delay timeout. The problem is that once the DNS lookup fails, NM never retries (unless addresses or carrier change).

So, I think we should also introduce a retry mechanism to make the lookup more robust. If there is no hostname set, NM should retry the DNS query with a exponential backoff.

Comment 15 Ben Nemec 2022-09-23 21:04:23 UTC
Would restarting NetworkManager help or would it go through the exact same sequence of events again? I'm wondering if we could write a little service that would watch for the static ip + localhost scenario and bounce NM if it sees that happen. Or maybe just a reload like we do in the resolv.conf dispatcher script[0] would help?

0: https://github.com/openshift/machine-config-operator/blob/a627415c240b4c7dd2f9e90f659690d9c0f623f3/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml#L101

Comment 16 Beniamino Galvani 2022-09-26 08:42:51 UTC
Created attachment 1914318 [details]
NM reproducer

Comment 17 Beniamino Galvani 2022-09-26 08:48:08 UTC
Hi Ben,

for reference, I attached a script that I use to reproduce the problem. To force a new DNS resolution of the hostname there is no need to restart NetworkManager, command "nmcli general reload dns-rc" is enough.

Comment 23 David Jaša 2022-12-05 23:16:01 UTC
VERIFIED in NetworkManager-1.40.6-1.el8 using reproducer by Beniamino, hostaname stays correct.

Automated test will follow later.

Comment 29 errata-xmlrpc 2023-05-16 09:04:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2968


Note You need to log in before you can comment on or make changes to this bug.