Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2118817

Summary:

Hostname is not configured during IPI installation of OpenShift 4.10.3 on baremetal when using NMState and static IP config for a bond network interface.

Product:

Red Hat Enterprise Linux 8

Reporter:

Ben Nemec <bnemec>

Component:

NetworkManager

Assignee:

Beniamino Galvani <bgalvani>

Status:

CLOSED ERRATA

QA Contact:

David Jaša <djasa>

Severity:

high

Docs Contact:

Priority:

high

Version:

8.4

CC:

alex.birkner, anestero, apjagtap, augol, bgalvani, blitton, bnemec, bzvonar, cldavey, derekh, dphillip, hpokorny, jmalde, lrintel, mflorczy, mko, openshift-bugs-escalate, rbennett, rkhan, sfaye, sronan, sukulkar, till, vbenes, venkatasubramanian.b

Target Milestone:

Keywords:

Triaged, ZStream

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

URL:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/1292

Whiteboard:

Fixed In Version:

NetworkManager-1.40.2-1.el8

Doc Type:

Release Note

Doc Text:

Story Points:

---

Clone Of:

2064339

Clones:

2152891 2152892 2152895 2164816 (view as bug list)

Environment:

Last Closed:

2023-05-16 09:04:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2064339

Bug Blocks:

2152891, 2152892, 2152895, 2164816

Attachments:

Description	Flags
Node1 - NetworkManager journalctl logs	none
Node2 - NetworkManager journalctl logs	none
Node3 - NetworkManager journalctl logs	none
NM reproducer	none

Comment 1 Beniamino Galvani 2022-08-18 07:44:12 UTC

Right, as Ben said, we need NM logs at trace level to understand the cause.

Comment 2 Venkat B 2022-08-18 11:53:41 UTC

Hello, I am facing the same issue, but with Assisted Installer offline. At first boot of the Discovery ISO the hosts go into hostname as 'localhost'. This I am seeing on HPE DL380 Rack Mount Servers only (its proper on other HPE HW). This is reproducible always, in case you want to check this in my environment, we can do so.

The workaround which helps here is to restart both the NetworkManager and AI-Agent service on every node.

Comment 3 Beniamino Galvani 2022-09-05 14:57:21 UTC

(In reply to Venkat B from comment #2)
> Hello, I am facing the same issue, but with Assisted Installer offline. At
> first boot of the Discovery ISO the hosts go into hostname as 'localhost'.
> This I am seeing on HPE DL380 Rack Mount Servers only (its proper on other
> HPE HW). This is reproducible always, in case you want to check this in my
> environment, we can do so.

It would be helpful if you can reproduce with NM log level set to trace and attach the journal log.
To enable trace log you can follow one of these methods: http://blog.nemebean.com/content/networkmanager-trace-logging

Comment 5 Venkat B 2022-09-07 07:37:17 UTC

Hi bgalvani, unfortunately (http://blog.nemebean.com/content/networkmanager-trace-logging) does not help at this context.
Reason: The link talks about either manually doing the TRACE log enabling =>

# echo "[logging]" > /etc/NetworkManager/conf.d/99-trace-logging.conf
# echo "level=TRACE" >> /etc/NetworkManager/conf.d/99-trace-logging.conf
# systemctl restart NetworkManager

OR via Machine Config YAML.

Both ways dont help here in our case, as we are here doing the first boot up of our Hardware with the Assisted Installer Discovery Image. At this first boot machineconfigs are not applied.

Also, the Server enters into the 'localhost' state while RHCOS is booting to RAM. Thus doing the manual changes (by creating 99-trace-logging.conf) also does not help.

What are the other options?
Shouldnt this be done via Ignition config override instead?

Comment 6 Venkat B 2022-09-07 08:31:47 UTC

Hi, I now injected via the Ignition config override to inject /etc/NetworkManager/conf.d/99-trace-logging.conf via the Discovery Image ISO. With this the I was able to get the debug traces of NetworkManager service. I have attached 3 logs files as I did a 3 Node OCP installation via Assisted Installer Offline.

I hope this helps?

Comment 7 Venkat B 2022-09-07 08:32:43 UTC

Created attachment 1910044 [details]
Node1 - NetworkManager journalctl logs

Comment 8 Venkat B 2022-09-07 08:33:12 UTC

Created attachment 1910045 [details]
Node2 - NetworkManager journalctl logs

Comment 9 Venkat B 2022-09-07 08:33:41 UTC

Created attachment 1910046 [details]
Node3 - NetworkManager journalctl logs

Comment 10 Beniamino Galvani 2022-09-08 08:10:47 UTC

Hi, the problem here is that the ethernet interfaces take some time to get carrier, and NM tries the reverse DNS lookup on the bond when the ports are not yet active.

--

Initially eno2 and eno3 are waiting for carrier:

  Sep 07 08:08:17 localhost NetworkManager[2869]: <debug> [1662538097.8495] device[5a80e622e02f9ac8] (eno2): add_pending_action (2): 'carrier-wait'
  Sep 07 08:08:17 localhost NetworkManager[2869]: <debug> [1662538097.9872] device[59bdc647cc68ca49] (eno3): add_pending_action (2): 'carrier-wait'

Then bond0 gets configured:

  Sep 07 08:08:18 localhost NetworkManager[2869]: <debug> [1662538098.8029] platform: (bond0) address: adding or updating IPv4 address: 10.13.71.166/27 brd 10.13.71.191 lft forever pref forever ...
  Sep 07 08:08:18 localhost NetworkManager[2869]: <debug> [1662538098.8031] platform: (bond0) route: append     IPv4 route: type unicast 0.0.0.0/0 via 10.13.71.161 dev 10 metric 300 mss 0 rt-src user
  Sep 07 08:08:18 localhost NetworkManager[2869]: <info>  [1662538098.8054] device (bond0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')

The hostname is looked up via DNS using the address configured on bond0:

  Sep 07 08:08:18 localhost NetworkManager[2869]: <trace> [1662538098.8071] device[bd3187c077dd2f01] (bond0): hostname-from-dns: starting lookup for address 10.13.71.166

At this point eno2 and eno3 are not yet attached to the bond and DNS queries are dropped. Eventually the interfaces get carrier, but it's too late.

  Sep 07 08:08:21 localhost NetworkManager[2869]: <info>  [1662538101.6547] device (eno3): carrier: link connected
  Sep 07 08:08:25 localhost NetworkManager[2869]: <info>  [1662538105.1258] device (eno2): carrier: link connected
  Sep 07 08:08:25 localhost NetworkManager[2869]: <info>  [1662538105.1463] device (bond0): carrier: link connected

The query returns "localhost" presumably from the "myhostname" NSS module of glibc:

  Sep 07 08:08:26 localhost NetworkManager[2869]: <debug> [1662538106.8882] device[bd3187c077dd2f01] (bond0): hostname-from-dns: lookup done for 10.13.71.166, result "localhost"

--

I think NM should be improved; first, if there is no carrier, the interface should not be used for reverse DNS lookup. When there is a carrier change, the system hostname should be re-evaluated.

However, he previous point doesn't cover scenarios in which the interface has carrier but (temporarily) no connectivity; this happens for example if a 802.3ad bond is still negotiating LACP, or when the network is not ready yet because e.g. another switch is booting or in the STP forward-delay timeout. The problem is that once the DNS lookup fails, NM never retries (unless addresses or carrier change).

So, I think we should also introduce a retry mechanism to make the lookup more robust. If there is no hostname set, NM should retry the DNS query with a exponential backoff.

Comment 15 Ben Nemec 2022-09-23 21:04:23 UTC

Would restarting NetworkManager help or would it go through the exact same sequence of events again? I'm wondering if we could write a little service that would watch for the static ip + localhost scenario and bounce NM if it sees that happen. Or maybe just a reload like we do in the resolv.conf dispatcher script[0] would help?

0: https://github.com/openshift/machine-config-operator/blob/a627415c240b4c7dd2f9e90f659690d9c0f623f3/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml#L101

Comment 16 Beniamino Galvani 2022-09-26 08:42:51 UTC

Created attachment 1914318 [details]
NM reproducer

Comment 17 Beniamino Galvani 2022-09-26 08:48:08 UTC

Hi Ben,

for reference, I attached a script that I use to reproduce the problem. To force a new DNS resolution of the hostname there is no need to restart NetworkManager, command "nmcli general reload dns-rc" is enough.

Comment 23 David Jaša 2022-12-05 23:16:01 UTC

VERIFIED in NetworkManager-1.40.6-1.el8 using reproducer by Beniamino, hostaname stays correct.

Automated test will follow later.

Comment 29 errata-xmlrpc 2023-05-16 09:04:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2968