Description of problem: In my current test with the 'latest' assisted-service operator version from today May 12th, 9:00 UTC 5 out of 6 servers don't get hostnames during the initial boot from the minimal-iso via virtual media. As a result the servers register with hostname 'localhost' and the agent validation fails. Setting the hostname name for each server via ssh core@<serverIP> sudo hostnamectl set-hostname <a valid hostname> and restarting the agent.service continues the installation process. (restarting the agent service may not be required) Hardware Dell PowerEdge R340 ,IDrac 9, 4.40.00.00 Target cluster version 4.8.0-fc-3 How reproducible: - Create cluster resources in assisted-service side - boot servers from minimum-iso Actual results: 5 out of 6 server have hostname localhost Expected results: all servers have a different hostname (typically following the dhcp-<part of the server IP>.<basedomain> pattern, so that the agent validation works Additional info:
Attached two boot logs files of a working and failing server. https://bugzilla.redhat.com/show_bug.cgi?id=1956360 may or may not be related.
Is there a different hostname configured in Agent Spec? If not defined hostname can come up with a localhost hostname.
@otuchfel need to understand if this is indeed a prollem, or just that the host is not assigned a hostname via DHCP Anyway, there is an easy workaround to explicitly set hostnames via our API. No need to ssh into the hosts
wrt is this just a host not assigned a hostname via DHCP: The servers are part of the lab setup where the dhcp setup has not changed since the initial setup in January. The servers get rebooted/rebuilt several times per week and so far the servers came up with an automatically assigned hostname (dhcp-X-Y_Z.whateverthebasedomain). This is not case when I use the assisted service operator with a 4.8.0-fc target cluster. I'm not saying this is a big problem, but certainly an unexpected and unwanted change of behaviour. What I'm hoping to get from this BZ is at least some hint of what changed to cause this different behaviour (installer change? 4.8.0-fc-3 bug ? Rhel 8.4) Also: would you be able to point out how setting the hostname via the API would look like.
You can set a desired hostname different from "localhost" via the UI by clicking a hostname on the Host discovery page and modifying the Requested hostname field. Let me know if doing it via an AIP is more convenient for you - I will write a script that does it.
@vemporop it is true that this can be configured by the user. However, the bug still stands. There is DHCP that is sending hostnames to the hosts, and 1 of the hosts does not get/accept the hostname and uses localhost instead
Please set level=TRACE in the [logging] section of /etc/NetworkManager/NetworkManager.conf, reboot and attach logs again for the machine that fails to obtain a hostname. If rebooting is not possible because this happens during installation, only restart the NetworkManager service.
Another BZ that might be related https://bugzilla.redhat.com/show_bug.cgi?id=1929160, but it seems we need NM logs at TRACE level to tell for sure.
@vemporop I guess we need to instruct @uschlute how to set this flag so that he can recreate this
It's explained in comment 10 https://bugzilla.redhat.com/show_bug.cgi?id=1959842#c10 To enable TRACE: /etc/NetworkManager/NetworkManager.conf already has a line #level=TRACE, it just needs to be uncommented. To restart the NetworkManager service: sudo systemctl restart NetworkManager
@vemporop I attached three logfiles NetworkManager traces, two of server which came up with hostname localhost and one with a regular hostname, all booted from the same ISO at the same time. Servers were not rebooted, only the networkmanager restarted.
Thank you very much, Ulrich! @bgalvani PTAL at the trace logs attached by Ulrich Schlueter. Since it's a live ISO, servers could not be rebooted, only the NetworkManager was restarted.
I think the problem is the following. There are two interfaces enp3s0f0 and enp3s0f1, both with IPv4 and IPv6 default routes; it seems that the DNS server doesn't return a hostname for the IPv6 address that is assigned to interface (which makes sense since the address is assigned via SLAAC). If there are multiple interfaces active, NM starts with the ones having a best (smaller metric) default route, and tries a reverse lookup on the address. When the interface with IPv6 is the first in the list (because it activated before), the DNS server returns no result. However, /etc/nsswitch.conf contains: hosts: files dns myhostname so the glibc resolver calls the "myhostname" module which returns "localhost". NM then uses "localhost" as the hostname, without trying other interfaces. A proper fix for this would be [1] i.e. to spawn a helper that forces glibc to use only the "dns" NSS plugin. [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/877 I think what changed compared to 8.3 is that on 8.3 an interface with the best-default-route IPv4 address would always be tried before one with a best IPv6 address; now, the one activated earlier wins. So an alternative would be to restore the preference for IPv4; this would work for the network scenario in this bz (where the hostname can only be resolved via IPv4), but probably not for others. Another alternative would be to ignore the "localhost" hostname when it's returned from lookup, and switch to the next interface.
@bgalvani is there a way to force NM to try all interface by external configuration ( i mean setting some flag in the NM config files), or it is purely NM code fix?
(In reply to yevgeny shnaidman from comment #21) > @bgalvani is there a way to force NM to try all interface by > external configuration ( i mean setting some flag in the NM config files), No, NM is already trying all interfaces; the problem is that the first one returns a bogus result ("localhost") and NM stops there. By the way, looking better at logs I see something strange. The kernel command line "ip=dhcp,dhcp6" is passed to configure networking in initrd. After switch root, when NM starts again the generated connection is no longer present and instead NM creates "default" connections for each interface found. This happens because the NetworkManager-config-server package is not installed, which sets "no-auto-default=*" to avoid the generation of "Wired Connection $x" connections. It seems that the connection generated in initrd is lost when switching to the real root. Do you know why? Is the intended behavior to activate all interfaces with both IPv4 and IPv6 autoconfiguration? > or it is purely NM code fix? Yes, I think the fix should be in NM.
(In reply to Beniamino Galvani from comment #22) > (In reply to yevgeny shnaidman from comment #21) > > @bgalvani is there a way to force NM to try all interface by > > external configuration ( i mean setting some flag in the NM config files), > > No, NM is already trying all interfaces; the problem is that the first > one returns a bogus result ("localhost") and NM stops there. > > By the way, looking better at logs I see something strange. The > kernel command line "ip=dhcp,dhcp6" is passed to configure networking > in initrd. After switch root, when NM starts again the generated > connection is no longer present and instead NM creates "default" > connections for each interface found. This happens because the > NetworkManager-config-server package is not installed, which sets > "no-auto-default=*" to avoid the generation of "Wired Connection $x" > connections. > > It seems that the connection generated in initrd is lost when > switching to the real root. Do you know why? > > Is the intended behavior to activate all interfaces with both IPv4 and > IPv6 autoconfiguration? > > > or it is purely NM code fix? > > Yes, I think the fix should be in NM. i see that NM is using Wired Connection (meaning default) in initrd also. Am i missing something? are there any other nmconnectio files?
In initrd NM is using the connection created by nm-initrd-generator: policy: auto-activating connection 'Wired Connection' (7a87eb61-96a1-4f26-885c-f0e0b9fe2280) policy: auto-activating connection 'Wired Connection' (7a87eb61-96a1-4f26-885c-f0e0b9fe2280) In real root that connection is not present, and instead default-ethernet connections are created. They have a slightly different name and different UUIDs: settings: (eno1): created default wired connection 'Wired connection 1' settings: (eno2): created default wired connection 'Wired connection 2' settings: (enp3s0f0): created default wired connection 'Wired connection 3' settings: (enp3s0f1): created default wired connection 'Wired connection 4' policy: auto-activating connection 'Wired connection 3' (511ecb79-8204-306e-9d53-37db64112a85) policy: auto-activating connection 'Wired connection 4' (94fea715-43f3-30e1-8ee6-7e1bfee19c12)
(In reply to Beniamino Galvani from comment #24) > In initrd NM is using the connection created by nm-initrd-generator: > > policy: auto-activating connection 'Wired Connection' > (7a87eb61-96a1-4f26-885c-f0e0b9fe2280) > policy: auto-activating connection 'Wired Connection' > (7a87eb61-96a1-4f26-885c-f0e0b9fe2280) > > In real root that connection is not present, and instead default-ethernet > connections are created. They have a slightly different name and different > UUIDs: > > settings: (eno1): created default wired connection 'Wired connection 1' > settings: (eno2): created default wired connection 'Wired connection 2' > settings: (enp3s0f0): created default wired connection 'Wired connection 3' > settings: (enp3s0f1): created default wired connection 'Wired connection 4' > > policy: auto-activating connection 'Wired connection 3' > (511ecb79-8204-306e-9d53-37db64112a85) > policy: auto-activating connection 'Wired connection 4' > (94fea715-43f3-30e1-8ee6-7e1bfee19c12) I think that in initrd the default.nmconnection is created by the nm-initrd-generator, but when the switch is executed, coreos-copy-firstboot-network.service is removing the default connection, so that's why it is not present ( at least this is my theory)
Bug https://bugzilla.redhat.com/show_bug.cgi?id=1970335 is created to tracking the effort in NetworkManager.
In RHEL 8.5 this problem is fixed by performing the DNS resolution in a way such that synthesized results (like 'localhost') are avoided. To do so, NetworkManager spawns a helper binary which configures the libc resolver to use only the 'dns' NSS module, and not the 'myhostname' one (see https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/601 ) For RHEL 8.4 this change is too invasive and so an alternative fix is to restore the preference for IPv4, as it was in RHEL 8.3. This seems anyway the right thing to do because it helps to guarantee predictability when there are multiple devices that autoconnect at boot.
Looks good, tested twice and in both case the 6 cluster server came up with proper hostnames (dhcp-10.xx.xx.xxx) [core@dhcp-1-145-250 ~]$ sudo journalctl | grep 1959842 Jul 09 12:44:29 localhost NetworkManager[1381]: <info> [1625834669.6524] NetworkManager (version 1.30.0-9.1.bz1959842.el8_4) is starting... (for the first time) Jul 09 12:44:29 localhost NetworkManager[1381]: <info> [1625834669.9740] Loaded device plugin: NMOvsFactory (/usr/lib64/NetworkManager/1.30.0-9.1.bz1959842.el8_4/libnm-device-plugin-ovs.so) Jul 09 12:44:29 localhost NetworkManager[1381]: <info> [1625834669.9753] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.30.0-9.1.bz1959842.el8_4/libnm-device-plugin-team.so) Jul 09 12:44:29 localhost NetworkManager[1381]: <info> [1625834669.9775] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.30.0-9.1.bz1959842.el8_4/libnm-settings-plugin-ifcfg-rh.so")
@bgalvani FYI the fix seems to work. Tested with a custom build of RHCOS 48.84. Do you have an ETA for a release version?