Bug 1959842 - Localhost hostname after booting minimum-iso
Summary: Localhost hostname after booting minimum-iso
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: vemporop
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Platform
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-12 13:23 UTC by Ulrich Schlueter
Modified: 2021-10-11 07:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-11 07:43:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ulrich Schlueter 2021-05-12 13:23:50 UTC
Description of problem:

In my current test with the 'latest' assisted-service operator version from today May 12th, 9:00 UTC 5 out of 6 servers don't get hostnames during the initial boot from the minimal-iso via virtual media. As a result the servers register with hostname 'localhost' and the agent validation fails.

Setting the hostname name for each server via
ssh core@<serverIP> sudo hostnamectl set-hostname <a valid hostname>
and restarting the agent.service continues the installation process.
(restarting the agent service may not be required)

Hardware Dell PowerEdge R340 ,IDrac 9, 4.40.00.00 

Target cluster version 4.8.0-fc-3

How reproducible:

- Create cluster resources in assisted-service side
- boot servers from minimum-iso



Actual results:

5 out of 6 server have hostname localhost

Expected results:

all servers have a different hostname (typically following the dhcp-<part of the server IP>.<basedomain> pattern, so that the agent validation works

Additional info:

Comment 3 Ulrich Schlueter 2021-05-12 13:28:24 UTC
Attached two boot logs files of a working and failing server. 

https://bugzilla.redhat.com/show_bug.cgi?id=1956360 may or may not be related.

Comment 4 Michael Filanov 2021-05-13 14:30:12 UTC
Is there a different hostname configured in Agent Spec? 
If not defined hostname can come up with a localhost hostname.

Comment 5 Ronnie Lazar 2021-05-13 16:26:30 UTC
@otuchfel need to understand if this is indeed a prollem, or just that the host is not assigned a hostname via DHCP

Anyway, there is an easy workaround to explicitly set hostnames via our API.
No need to ssh into the hosts

Comment 6 Ulrich Schlueter 2021-05-14 14:22:31 UTC
wrt is this just a host not assigned a hostname via DHCP: The servers are part of the lab setup where the dhcp setup has not changed since the initial setup in January. The servers get rebooted/rebuilt several times per week and so far the servers came up with an automatically assigned hostname (dhcp-X-Y_Z.whateverthebasedomain). This is not case  when I use the assisted service operator with a 4.8.0-fc target cluster.  I'm not saying this is a big problem, but certainly an unexpected and unwanted change of behaviour. What I'm hoping to get from this BZ is at least some hint of what changed to cause this different behaviour (installer change? 4.8.0-fc-3 bug ? Rhel 8.4) 

Also: would you be able to point out how setting the hostname via the API would look like.

Comment 7 vemporop 2021-05-20 12:57:33 UTC
You can set a desired hostname different from "localhost" via the UI by clicking a hostname on the Host discovery page and modifying the Requested hostname field.

Let me know if doing it via an AIP is more convenient for you - I will write a script that does it.

Comment 8 Ronnie Lazar 2021-05-20 13:50:33 UTC
@vemporop it is true that this can be configured by the user.
However, the bug still stands. There is DHCP that is sending hostnames to the hosts, and 1 of the hosts does not get/accept the hostname and uses localhost instead

Comment 10 Beniamino Galvani 2021-05-24 14:44:36 UTC
Please set level=TRACE in the [logging] section of /etc/NetworkManager/NetworkManager.conf, reboot and attach logs again for the machine that fails to obtain a hostname. If rebooting is not possible because this happens during installation, only restart the NetworkManager service.

Comment 11 vemporop 2021-05-24 15:09:09 UTC
Another BZ that might be related https://bugzilla.redhat.com/show_bug.cgi?id=1929160, but it seems we need NM logs at TRACE level to tell for sure.

Comment 12 Ronnie Lazar 2021-05-24 15:21:33 UTC
@vemporop I guess we need to instruct @uschlute how to set this flag so that he can recreate this

Comment 13 vemporop 2021-05-24 15:30:52 UTC
It's explained in comment 10 https://bugzilla.redhat.com/show_bug.cgi?id=1959842#c10

To enable TRACE:
/etc/NetworkManager/NetworkManager.conf already has a line #level=TRACE, it just needs to be uncommented.

To restart the NetworkManager service: 
sudo systemctl restart NetworkManager

Comment 17 Ulrich Schlueter 2021-05-31 12:01:54 UTC
@vemporop I attached three logfiles NetworkManager traces, two of server which came up with hostname localhost and one with a regular hostname, all booted from the same ISO at the same time. Servers were not rebooted, only the networkmanager restarted.

Comment 18 vemporop 2021-05-31 12:11:12 UTC
Thank you very much, Ulrich! 

@bgalvani PTAL at the trace logs attached by Ulrich Schlueter. Since it's a live ISO, servers could not be rebooted, only the NetworkManager was restarted.

Comment 19 Beniamino Galvani 2021-05-31 15:14:33 UTC
I think the problem is the following. There are two interfaces
enp3s0f0 and enp3s0f1, both with IPv4 and IPv6 default routes; it
seems that the DNS server doesn't return a hostname for the IPv6
address that is assigned to interface (which makes sense since the
address is assigned via SLAAC).

If there are multiple interfaces active, NM starts with the ones
having a best (smaller metric) default route, and tries a reverse
lookup on the address.

When the interface with IPv6 is the first in the list (because it
activated before), the DNS server returns no result. However,
/etc/nsswitch.conf contains:

  hosts:      files dns myhostname

so the glibc resolver calls the "myhostname" module which returns
"localhost". NM then uses "localhost" as the hostname, without trying
other interfaces.

A proper fix for this would be [1] i.e. to spawn a helper that forces
glibc to use only the "dns" NSS plugin.

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/877

I think what changed compared to 8.3 is that on 8.3 an interface with
the best-default-route IPv4 address would always be tried before one
with a best IPv6 address; now, the one activated earlier wins.

So an alternative would be to restore the preference for IPv4; this
would work for the network scenario in this bz (where the hostname can
only be resolved via IPv4), but probably not for others.

Another alternative would be to ignore the "localhost" hostname when
it's returned from lookup, and switch to the next interface.

Comment 20 yevgeny shnaidman 2021-05-31 17:56:46 UTC
@bgalvani is there a way to force NM to try all interface by external configuration ( i mean setting some flag in the NM config files), or it is purely NM code fix?

Comment 21 yevgeny shnaidman 2021-05-31 17:57:02 UTC
@bgalvani is there a way to force NM to try all interface by external configuration ( i mean setting some flag in the NM config files), or it is purely NM code fix?

Comment 22 Beniamino Galvani 2021-06-03 06:58:15 UTC
(In reply to yevgeny shnaidman from comment #21)
> @bgalvani is there a way to force NM to try all interface by
> external configuration ( i mean setting some flag in the NM config files),

No, NM is already trying all interfaces; the problem is that the first
one returns a bogus result ("localhost") and NM stops there.

By the way, looking better at logs I see something strange. The
kernel command line "ip=dhcp,dhcp6" is passed to configure networking
in initrd. After switch root, when NM starts again the generated
connection is no longer present and instead NM creates "default"
connections for each interface found. This happens because the
NetworkManager-config-server package is not installed, which sets
"no-auto-default=*" to avoid the generation of "Wired Connection $x"
connections.

It seems that the connection generated in initrd is lost when
switching to the real root. Do you know why?

Is the intended behavior to activate all interfaces with both IPv4 and
IPv6 autoconfiguration?

> or it is purely NM code fix?

Yes, I think the fix should be in NM.

Comment 23 yevgeny shnaidman 2021-06-03 08:07:12 UTC
(In reply to Beniamino Galvani from comment #22)
> (In reply to yevgeny shnaidman from comment #21)
> > @bgalvani is there a way to force NM to try all interface by
> > external configuration ( i mean setting some flag in the NM config files),
> 
> No, NM is already trying all interfaces; the problem is that the first
> one returns a bogus result ("localhost") and NM stops there.
> 
> By the way, looking better at logs I see something strange. The
> kernel command line "ip=dhcp,dhcp6" is passed to configure networking
> in initrd. After switch root, when NM starts again the generated
> connection is no longer present and instead NM creates "default"
> connections for each interface found. This happens because the
> NetworkManager-config-server package is not installed, which sets
> "no-auto-default=*" to avoid the generation of "Wired Connection $x"
> connections.
> 
> It seems that the connection generated in initrd is lost when
> switching to the real root. Do you know why?
> 
> Is the intended behavior to activate all interfaces with both IPv4 and
> IPv6 autoconfiguration?
> 
> > or it is purely NM code fix?
> 
> Yes, I think the fix should be in NM.

i see that NM is using Wired Connection (meaning default) in initrd also. Am i missing something? are there any other nmconnectio files?

Comment 24 Beniamino Galvani 2021-06-03 08:39:04 UTC
In initrd NM is using the connection created by nm-initrd-generator:

  policy: auto-activating connection 'Wired Connection' (7a87eb61-96a1-4f26-885c-f0e0b9fe2280)
  policy: auto-activating connection 'Wired Connection' (7a87eb61-96a1-4f26-885c-f0e0b9fe2280)

In real root that connection is not present, and instead default-ethernet connections are created. They have a slightly different name and different UUIDs:

  settings: (eno1): created default wired connection 'Wired connection 1'
  settings: (eno2): created default wired connection 'Wired connection 2'
  settings: (enp3s0f0): created default wired connection 'Wired connection 3'
  settings: (enp3s0f1): created default wired connection 'Wired connection 4'

  policy: auto-activating connection 'Wired connection 3' (511ecb79-8204-306e-9d53-37db64112a85)
  policy: auto-activating connection 'Wired connection 4' (94fea715-43f3-30e1-8ee6-7e1bfee19c12)

Comment 25 yevgeny shnaidman 2021-06-03 08:54:29 UTC
(In reply to Beniamino Galvani from comment #24)
> In initrd NM is using the connection created by nm-initrd-generator:
> 
>   policy: auto-activating connection 'Wired Connection'
> (7a87eb61-96a1-4f26-885c-f0e0b9fe2280)
>   policy: auto-activating connection 'Wired Connection'
> (7a87eb61-96a1-4f26-885c-f0e0b9fe2280)
> 
> In real root that connection is not present, and instead default-ethernet
> connections are created. They have a slightly different name and different
> UUIDs:
> 
>   settings: (eno1): created default wired connection 'Wired connection 1'
>   settings: (eno2): created default wired connection 'Wired connection 2'
>   settings: (enp3s0f0): created default wired connection 'Wired connection 3'
>   settings: (enp3s0f1): created default wired connection 'Wired connection 4'
> 
>   policy: auto-activating connection 'Wired connection 3'
> (511ecb79-8204-306e-9d53-37db64112a85)
>   policy: auto-activating connection 'Wired connection 4'
> (94fea715-43f3-30e1-8ee6-7e1bfee19c12)

I think that in initrd the default.nmconnection is created by the nm-initrd-generator, but when the switch is executed, coreos-copy-firstboot-network.service is removing the default connection, so that's why it is not present ( at least this is my theory)

Comment 26 Gris Ge 2021-06-10 09:40:55 UTC
Bug https://bugzilla.redhat.com/show_bug.cgi?id=1970335 is created to tracking the effort in NetworkManager.

Comment 27 Beniamino Galvani 2021-06-21 07:44:29 UTC
In RHEL 8.5 this problem is fixed by performing the DNS resolution in
a way such that synthesized results (like 'localhost') are avoided. To
do so, NetworkManager spawns a helper binary which configures the libc
resolver to use only the 'dns' NSS module, and not the 'myhostname'
one (see https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/601 )

For RHEL 8.4 this change is too invasive and so an alternative fix is
to restore the preference for IPv4, as it was in RHEL 8.3. This seems
anyway the right thing to do because it helps to guarantee
predictability when there are multiple devices that autoconnect at
boot.

Comment 33 Ulrich Schlueter 2021-07-09 12:56:39 UTC
Looks good, tested twice  and in both case the 6 cluster server came up with proper hostnames (dhcp-10.xx.xx.xxx) 


[core@dhcp-1-145-250 ~]$ sudo journalctl | grep 1959842
Jul 09 12:44:29 localhost NetworkManager[1381]: <info>  [1625834669.6524] NetworkManager (version 1.30.0-9.1.bz1959842.el8_4) is starting... (for the first time)
Jul 09 12:44:29 localhost NetworkManager[1381]: <info>  [1625834669.9740] Loaded device plugin: NMOvsFactory (/usr/lib64/NetworkManager/1.30.0-9.1.bz1959842.el8_4/libnm-device-plugin-ovs.so)
Jul 09 12:44:29 localhost NetworkManager[1381]: <info>  [1625834669.9753] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.30.0-9.1.bz1959842.el8_4/libnm-device-plugin-team.so)
Jul 09 12:44:29 localhost NetworkManager[1381]: <info>  [1625834669.9775] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.30.0-9.1.bz1959842.el8_4/libnm-settings-plugin-ifcfg-rh.so")

Comment 34 vemporop 2021-07-09 13:07:35 UTC
@bgalvani FYI the fix seems to work. Tested with a custom build of RHCOS 48.84. Do you have an ETA for a release version?


Note You need to log in before you can comment on or make changes to this bug.