Bug 2118817
| Summary: | Hostname is not configured during IPI installation of OpenShift 4.10.3 on baremetal when using NMState and static IP config for a bond network interface. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Ben Nemec <bnemec> | ||||||||||
| Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | David Jaša <djasa> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 8.4 | CC: | alex.birkner, anestero, apjagtap, augol, bgalvani, blitton, bnemec, bzvonar, cldavey, derekh, dphillip, hpokorny, jmalde, lrintel, mflorczy, mko, openshift-bugs-escalate, rbennett, rkhan, sfaye, sronan, sukulkar, till, vbenes, venkatasubramanian.b | ||||||||||
| Target Milestone: | rc | Keywords: | Triaged, ZStream | ||||||||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| URL: | https://gitlab.freedesktop.org/NetworkManager/NetworkManager-ci/-/merge_requests/1292 | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | NetworkManager-1.40.2-1.el8 | Doc Type: | Release Note | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | 2064339 | ||||||||||||
| : | 2152891 2152892 2152895 2164816 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2023-05-16 09:04:54 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | 2064339 | ||||||||||||
| Bug Blocks: | 2152891, 2152892, 2152895, 2164816 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Comment 1
Beniamino Galvani
2022-08-18 07:44:12 UTC
Hello, I am facing the same issue, but with Assisted Installer offline. At first boot of the Discovery ISO the hosts go into hostname as 'localhost'. This I am seeing on HPE DL380 Rack Mount Servers only (its proper on other HPE HW). This is reproducible always, in case you want to check this in my environment, we can do so. The workaround which helps here is to restart both the NetworkManager and AI-Agent service on every node. (In reply to Venkat B from comment #2) > Hello, I am facing the same issue, but with Assisted Installer offline. At > first boot of the Discovery ISO the hosts go into hostname as 'localhost'. > This I am seeing on HPE DL380 Rack Mount Servers only (its proper on other > HPE HW). This is reproducible always, in case you want to check this in my > environment, we can do so. It would be helpful if you can reproduce with NM log level set to trace and attach the journal log. To enable trace log you can follow one of these methods: http://blog.nemebean.com/content/networkmanager-trace-logging Hi bgalvani, unfortunately (http://blog.nemebean.com/content/networkmanager-trace-logging) does not help at this context. Reason: The link talks about either manually doing the TRACE log enabling => # echo "[logging]" > /etc/NetworkManager/conf.d/99-trace-logging.conf # echo "level=TRACE" >> /etc/NetworkManager/conf.d/99-trace-logging.conf # systemctl restart NetworkManager OR via Machine Config YAML. Both ways dont help here in our case, as we are here doing the first boot up of our Hardware with the Assisted Installer Discovery Image. At this first boot machineconfigs are not applied. Also, the Server enters into the 'localhost' state while RHCOS is booting to RAM. Thus doing the manual changes (by creating 99-trace-logging.conf) also does not help. What are the other options? Shouldnt this be done via Ignition config override instead? Hi, I now injected via the Ignition config override to inject /etc/NetworkManager/conf.d/99-trace-logging.conf via the Discovery Image ISO. With this the I was able to get the debug traces of NetworkManager service. I have attached 3 logs files as I did a 3 Node OCP installation via Assisted Installer Offline. I hope this helps? Created attachment 1910044 [details]
Node1 - NetworkManager journalctl logs
Created attachment 1910045 [details]
Node2 - NetworkManager journalctl logs
Created attachment 1910046 [details]
Node3 - NetworkManager journalctl logs
Hi, the problem here is that the ethernet interfaces take some time to get carrier, and NM tries the reverse DNS lookup on the bond when the ports are not yet active. -- Initially eno2 and eno3 are waiting for carrier: Sep 07 08:08:17 localhost NetworkManager[2869]: <debug> [1662538097.8495] device[5a80e622e02f9ac8] (eno2): add_pending_action (2): 'carrier-wait' Sep 07 08:08:17 localhost NetworkManager[2869]: <debug> [1662538097.9872] device[59bdc647cc68ca49] (eno3): add_pending_action (2): 'carrier-wait' Then bond0 gets configured: Sep 07 08:08:18 localhost NetworkManager[2869]: <debug> [1662538098.8029] platform: (bond0) address: adding or updating IPv4 address: 10.13.71.166/27 brd 10.13.71.191 lft forever pref forever ... Sep 07 08:08:18 localhost NetworkManager[2869]: <debug> [1662538098.8031] platform: (bond0) route: append IPv4 route: type unicast 0.0.0.0/0 via 10.13.71.161 dev 10 metric 300 mss 0 rt-src user Sep 07 08:08:18 localhost NetworkManager[2869]: <info> [1662538098.8054] device (bond0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') The hostname is looked up via DNS using the address configured on bond0: Sep 07 08:08:18 localhost NetworkManager[2869]: <trace> [1662538098.8071] device[bd3187c077dd2f01] (bond0): hostname-from-dns: starting lookup for address 10.13.71.166 At this point eno2 and eno3 are not yet attached to the bond and DNS queries are dropped. Eventually the interfaces get carrier, but it's too late. Sep 07 08:08:21 localhost NetworkManager[2869]: <info> [1662538101.6547] device (eno3): carrier: link connected Sep 07 08:08:25 localhost NetworkManager[2869]: <info> [1662538105.1258] device (eno2): carrier: link connected Sep 07 08:08:25 localhost NetworkManager[2869]: <info> [1662538105.1463] device (bond0): carrier: link connected The query returns "localhost" presumably from the "myhostname" NSS module of glibc: Sep 07 08:08:26 localhost NetworkManager[2869]: <debug> [1662538106.8882] device[bd3187c077dd2f01] (bond0): hostname-from-dns: lookup done for 10.13.71.166, result "localhost" -- I think NM should be improved; first, if there is no carrier, the interface should not be used for reverse DNS lookup. When there is a carrier change, the system hostname should be re-evaluated. However, he previous point doesn't cover scenarios in which the interface has carrier but (temporarily) no connectivity; this happens for example if a 802.3ad bond is still negotiating LACP, or when the network is not ready yet because e.g. another switch is booting or in the STP forward-delay timeout. The problem is that once the DNS lookup fails, NM never retries (unless addresses or carrier change). So, I think we should also introduce a retry mechanism to make the lookup more robust. If there is no hostname set, NM should retry the DNS query with a exponential backoff. Would restarting NetworkManager help or would it go through the exact same sequence of events again? I'm wondering if we could write a little service that would watch for the static ip + localhost scenario and bounce NM if it sees that happen. Or maybe just a reload like we do in the resolv.conf dispatcher script[0] would help? 0: https://github.com/openshift/machine-config-operator/blob/a627415c240b4c7dd2f9e90f659690d9c0f623f3/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml#L101 Created attachment 1914318 [details]
NM reproducer
Hi Ben, for reference, I attached a script that I use to reproduce the problem. To force a new DNS resolution of the hostname there is no need to restart NetworkManager, command "nmcli general reload dns-rc" is enough. VERIFIED in NetworkManager-1.40.6-1.el8 using reproducer by Beniamino, hostaname stays correct. Automated test will follow later. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2968 |