1929160 – Node registered as localhost.localdomain in a baremetal IPI cluster

Bug 1929160 - Node registered as localhost.localdomain in a baremetal IPI cluster

Summary: Node registered as localhost.localdomain in a baremetal IPI cluster

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Nemec
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1950763 1990369 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-16 10:30 UTC by Sabina Aledort
Modified:	2023-09-18 00:24 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-09 21:23:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
journalctl.4.8.log (4.99 MB, text/plain) 2021-04-22 10:18 UTC, Sabina Aledort	no flags	Details
journalctl_local_host_bm.log (10.66 MB, text/plain) 2021-05-13 09:37 UTC, Sabina Aledort	no flags	Details
node-valid-hostname.service..bm.log (1.91 KB, text/plain) 2021-05-13 09:38 UTC, Sabina Aledort	no flags	Details
NetworkManager-wait-online.service.bm.log (1.66 KB, text/plain) 2021-05-13 09:41 UTC, Sabina Aledort	no flags	Details
MachineConfig workaround (3.35 KB, text/plain) 2021-05-18 13:43 UTC, Sabina Aledort	no flags	Details
base64 MC script (623 bytes, text/plain) 2021-05-18 13:45 UTC, Sabina Aledort	no flags	Details
workaround service log (1.74 KB, text/plain) 2021-05-18 13:46 UTC, Sabina Aledort	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2488	0	None	Merged	Bug 1929160: Add NetworkManager reload to resolv-prepender	2022-11-07 14:48:03 UTC

Description Sabina Aledort 2021-02-16 10:30:23 UTC

Version: 4.7.0-0.nightly-2021-02-13-071408

$ openshift-install version
4.7.0-0.nightly-2021-02-13-071408

Platform:
baremetal

Please specify: IPI

What happened?

When installing a new baremetal ipi cluster, one of the masters came up as localhost.localdomain.

[root@cnfdd5-installer ~]# oc get node
NAME                                             STATUS   ROLES            AGE   VERSION
cnfdd5.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker           9h    v1.20.0+ba45583
cnfdd6.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker           9h    v1.20.0+ba45583
cnfdd7.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker           9h    v1.20.0+ba45583
dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   10h   v1.20.0+ba45583
dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   10h   v1.20.0+ba45583
localhost.localdomain                            Ready    master,virtual   10h   v1.20.0+ba45583

What did you expect to happen?

All 3 masters should appear with their proper fqdn.

Comment 2 Dmitry Tantsur 2021-02-22 13:18:40 UTC

I see the correct host name received by ironic-inspector:

> 'system_vendor': {'product_name': 'KVM', 'serial_number': '', 'manufacturer': 'Red Hat'}, 'boot': {'current_boot_mode': 'bios', 'pxe_interface': 'aa:aa:aa:aa:aa:02'}, 'hostname': 'dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com'}

No mentions of localhost.localdomain in the ironic-inspector logs. The problem must be higher up the stack.

Is this problem consistently reproducible? Do you have a reproducer?

Comment 3 Dmitry Tantsur 2021-02-22 13:29:25 UTC

There are a lot of error messages in the main logs, but I don't have enough knowledge to understand them. Since, at least at the first impression, the bare metal components are working normally, I'm passing to the Metal Installer team for further triaging.

Comment 4 Steven Hardy 2021-02-23 17:21:25 UTC

To debug this further, can you please provide the full journal log from the node which registered as localhost.localdomain please?

This should tell us if we timed out waiting for a valid hostname via DHCP/DNS.

When logging into the node via SSH is the hostname set correctly?

Comment 5 Ben Nemec 2021-02-23 18:21:52 UTC

Looking through the pod logs for one of our services, it appears the node did eventually get a hostname, but it took significantly longer than the other nodes that registered normally. We'll still need journal logs to see why it took so long, but that supports the idea that it just took too long for the hostname to be assigned.

Comment 6 Sabina Aledort 2021-03-01 15:59:37 UTC

Hey,

I don't have a way to reproduce. When redeploying the cluster it didn't happen so i can't provide more logs.

Comment 7 Riccardo Pittau 2021-03-02 17:05:01 UTC

please reopen this if you see this happening again and provide more logs if possible, thanks!

Comment 8 Sabina Aledort 2021-03-08 10:57:08 UTC

Hi,

It happened again with the custom version registry.ci.openshift.org/rhcos-devel/rhel4784:4.7.0-rc.2

[root@cnfdd3-installer ~]# oc get node -o wide
NAME                                             STATUS   ROLES            AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME
cnfdd3.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker           16h   v1.20.0+bd9e442   10.19.16.100   <none>        Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa)   4.18.0-287.el8.x86_64   cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52
dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   16h   v1.20.0+bd9e442   10.19.17.118   <none>        Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa)   4.18.0-287.el8.x86_64   cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52
dhcp19-17-128.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   16h   v1.20.0+bd9e442   10.19.17.128   <none>        Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa)   4.18.0-287.el8.x86_64   cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52
dhcp19-17-199.clus2.t5g.lab.eng.bos.redhat.com   Ready    worker           16h   v1.20.0+bd9e442   10.19.17.199   <none>        Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa)   4.18.0-287.el8.x86_64   cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52
localhost                                        Ready    master,virtual   16h   v1.20.0+bd9e442   10.19.17.102   <none>        Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa)   4.18.0-287.el8.x86_64   cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52

[root@cnfdd3-installer ~]# oc version
Client Version: 4.7.0-rc.2
Server Version: 4.7.0-rc.2
Kubernetes Version: v1.20.0+bd9e442

must-gather log can be downloaded from:
https://drive.google.com/file/d/1bNOWwQojhLO_jgM64FV-vK6BNFj9aJva/view?usp=sharing

The cluster is up so i can get more logs or share access.

Comment 9 Ben Nemec 2021-03-08 22:17:05 UTC

The main thing we need to see is the full journal log so we can see what the timing on hostname changes was. If you can give me access to the cluster that would work as well.

Comment 11 Ben Nemec 2021-03-10 16:19:27 UTC

Okay, I see what happened, but I don't know why the behavior differed on the one node. We might have to talk to the NM people.

On the broken node, it lost its hostname when ens4 was disconnected:

Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1329]: + nmcli device disconnect ens4
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4265] device (ens4): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4290] manager: NetworkManager state is now CONNECTED_LOCAL
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4299] audit: op="device-disconnect" interface="ens4" ifindex=3 pid=1501 uid=0 result="success"
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4301] device (ens4): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4421] dhcp4 (ens4): canceled DHCP transaction
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4421] dhcp4 (ens4): state changed bound -> done
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4452] policy: set-hostname: set hostname to 'localhost' (from address lookup)
Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1329]: Device 'ens4' successfully disconnected.
Mar 07 17:13:55 localhost systemd-hostnamed[1309]: Changed host name to 'localhost'

On a working node, that doesn't happen:

Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1335]: + nmcli device disconnect ens4
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info>  [1615137236.5215] device (ens4): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed')
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info>  [1615137236.5223] manager: NetworkManager state is now CONNECTED_LOCAL
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info>  [1615137236.5226] audit: op="device-disconnect" interface="ens4" ifindex=3 pid=1497 uid=0 result="success"
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info>  [1615137236.5232] device (ens4): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed')
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Created slice machine.slice.
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Started libpod-conmon-73a8431e8ca15e3224a6a37cc045ef0ff048ebef3b571d2a831326f5327a8023.scope.
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info>  [1615137236.5323] dhcp4 (ens4): canceled DHCP transaction
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info>  [1615137236.5324] dhcp4 (ens4): state changed bound -> done
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: tmp-crun.b5jjEX.mount: Succeeded.
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1335]: Device 'ens4' successfully disconnected.
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1335]: + nmcli connection show ovs-if-phys0
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Started libcrun container.
[the hostname never goes back to localhost]

I thought maybe one was getting the address from DHCP and the other from DNS reverse lookup, but they both log the same thing when setting the hostname:
NetworkManager[1303]: <info>  [1615137234.8826] policy: set-hostname: set hostname to 'dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com' (from address lookup)
and
NetworkManager[1309]: <info>  [1615137236.0151] policy: set-hostname: set hostname to 'dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com' (from address lookup)

I do see a difference in what the dispatcher scripts are getting though:
Mar 07 17:13:54 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com root[1327]: Hostname changed: localhost
Mar 07 17:13:54 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[1313]: <13>Mar  7 17:13:54 root: Hostname changed: localhost
vs.
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com root[1334]: Hostname changed: dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com
Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[1319]: <13>Mar  7 17:13:56 root: Hostname changed: dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com

I'm a bit confused why the broken node is getting a localhost hostname when it clearly has a hostname at that point. Maybe it got an empty hostname, which triggered https://github.com/openshift/machine-config-operator/blob/82868e63176fee2bc806c1deb308ed1fc8965d84/templates/common/on-prem/files/NetworkManager-mdns-hostname.yaml#L13 ?

I don't think this is the underlying problem though because this dispatcher script only deals with mdns-publisher. At worst it would cause mdns-publisher to hang waiting on its init container (which is, in fact, happening on the broken node). I think what we need to figure out is why the one node is dropping its hostname when configure-ovs.sh runs and moves the configuration to the bridge.

I grabbed journal logs from the bad node and one of the good ones if anyone wants to look at them. I think they're a bit big to attach to the bz.

Comment 12 Beniamino Galvani 2021-03-10 17:17:34 UTC

> Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info>  [1615137235.4452] policy: set-hostname: set hostname to 'localhost' (from address lookup)
> Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1329]: Device 'ens4' successfully disconnected.

Here the reverse address lookup of the address present on one interface returns 'localhost', but it's hard to tell why. Would it be possible to have NM logs at trace level?

Comment 13 Ben Nemec 2021-03-10 18:09:22 UTC

Sabina, would you be able to deploy this environment with the MCO change in https://github.com/cybertron/machine-config-operator/commit/144e0db6a393cd93982b61c05e68ef95a944d95b ? That will enable the NetworkManager trace logs.

Let me know if you need me to build a release or MCO image for it.

Comment 14 Sabina Aledort 2021-03-15 12:17:06 UTC

(In reply to Ben Nemec from comment #13)
> Sabina, would you be able to deploy this environment with the MCO change in
> https://github.com/cybertron/machine-config-operator/commit/
> 144e0db6a393cd93982b61c05e68ef95a944d95b ? That will enable the
> NetworkManager trace logs.
> 
> Let me know if you need me to build a release or MCO image for it.

I recreated the cluster with the MCO change.

[root@cnfdd3-installer ~]# oc get node
NAME                                             STATUS   ROLES            AGE    VERSION
cnfdd3.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker           60m    v1.20.0+bd9e442
dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   164m   v1.20.0+bd9e442
dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   164m   v1.20.0+bd9e442
dhcp19-17-82.clus2.t5g.lab.eng.bos.redhat.com    Ready    worker           71m    v1.20.0+bd9e442
localhost                                        Ready    master,virtual   159m   v1.20.0+bd9e442

Comment 16 Beniamino Galvani 2021-03-16 20:43:53 UTC

Hi, I have analyzed the log from "Mar 15 09:30:15" to "Mar 15
09:30:19" and there seems to be a race condition there.

After ens4 gets added to the ovs bridge and the ovs interface gets an
address via DHCP, NM tries to resolve the address on br-ex to get a
hostname. However at that time resolv.conf doesn't contain the new
nameservers because NM is configured with rc=unmanaged and so the DNS
resolution fails.

resolv.conf is updated later by the "30-resolv-prepender" dispatcher
script on the "up" event; however NM doesn't know that it should retry
the reverse DNS lookup. Therefore, the hostname stays set to
"localhost".

I have to do some test about this, but it's possible that if the
dispatcher script sends a SIGHUP to NM, that will trigger a DNS
reconfiguration which will also cause a new DNS lookup to
start. Probably this commit [1] needs to be backported, to restart
the lookup even if resolv.conf is unmanaged.

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/1c0932a6e66880f5b4c92fcd2d13cbba29238a14

Comment 17 Ben Nemec 2021-03-24 21:11:13 UTC

Okay, I've written a patch to reload NetworkManager: https://github.com/openshift/machine-config-operator/pull/2488

I think that should accomplish the same thing as SIGHUP, but let me know if I'm mistaken. It works in my environment, but I've never run into this bug so that doesn't mean a whole lot.

Sabina, could you try deploying with that patch in your environment where this reproduces? Thanks.

Comment 18 Sabina Aledort 2021-03-25 13:53:27 UTC

(In reply to Ben Nemec from comment #17)
> Okay, I've written a patch to reload NetworkManager:
> https://github.com/openshift/machine-config-operator/pull/2488
> 
> I think that should accomplish the same thing as SIGHUP, but let me know if
> I'm mistaken. It works in my environment, but I've never run into this bug
> so that doesn't mean a whole lot.
> 
> Sabina, could you try deploying with that patch in your environment where
> this reproduces? Thanks.

I recreated the cluster with the NetworkManager patch and it looks good.

[root@cnfdd3-installer ~]# oc get node
NAME                                             STATUS   ROLES            AGE     VERSION
cnfdd3.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker           4m10s   v1.20.0+ba45583
dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   37m     v1.20.0+ba45583
dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   37m     v1.20.0+ba45583
dhcp19-17-128.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual   38m     v1.20.0+ba45583
dhcp19-17-147.clus2.t5g.lab.eng.bos.redhat.com   Ready    worker           18m     v1.20.0+ba45583

Comment 19 Ben Nemec 2021-03-25 15:24:30 UTC

Great, thanks!

Comment 20 Arda Guclu 2021-04-20 16:17:57 UTC

*** Bug 1950763 has been marked as a duplicate of this bug. ***

Comment 21 Sabina Aledort 2021-04-22 10:17:19 UTC

Hi, 

i just created a cluster with the patch (https://github.com/openshift/machine-config-operator/pull/2488) in a 4.8 nightly but got again the localhost issue.
Attached the journalctl log from the localhost node.

[root@cnfdd5-installer ~]# oc get node
NAME                                             STATUS   ROLES    AGE   VERSION
dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com   Ready    master   23m   v1.21.0-rc.0+3ced7a9
dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com   Ready    master   24m   v1.21.0-rc.0+3ced7a9
localhost                                        Ready    master   19m   v1.21.0-rc.0+3ced7a9

[root@cnfdd5-installer ~]# oc version
Client Version: 4.8.0-0.nightly-2021-04-22-013545
Server Version: 4.8.0-0.nightly-2021-04-22-013545
Kubernetes Version: v1.21.0-rc.0+3ced7a9

Comment 22 Sabina Aledort 2021-04-22 10:18:12 UTC

Created attachment 1774447 [details]
journalctl.4.8.log

Comment 23 Yuval Kashtan 2021-04-27 07:01:04 UTC

according to above, shouldnt this BZ move back to ASSIGNED status?

Comment 24 Ben Nemec 2021-04-27 15:29:52 UTC

Hmm, I see that the reload happened, but it doesn't seem to have set the hostname:

Apr 22 09:29:23 localhost nm-dispatcher[1318]: NM resolv-prepender: Prepending 'nameserver 10.19.17.115' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
Apr 22 09:29:23 localhost systemd[1]: Reloading Network Manager.
Apr 22 09:29:23 localhost NetworkManager[1304]: <info>  [1619083763.6915] audit: op="reload" arg="0" pid=2991 uid=0 result="success"
Apr 22 09:29:23 localhost NetworkManager[1304]: <info>  [1619083763.6919] config: signal: SIGHUP (no changes from disk)
Apr 22 09:29:23 localhost systemd[1]: Reloaded Network Manager.

Beniamino, is it possible that the "no changes from disk" part would have prevented NM from re-doing the lookup? If that's not it, I assume you'll need trace logs to investigate further.

Comment 25 Beniamino Galvani 2021-04-28 16:53:11 UTC

I think I know what's the problem. Upon a reload/SIGHUP, currently
with dns=none the dns-manager doesn't emit a CONFIG_CHANGED signal and
therefore NMPolicy doesn't restart the DNS lookup.

The solution would be that SIGHUP always forces a restart of DNS
lookup, and possibly we also add a new reload flag (or unix signal) to
explicitly try again resolving the hostname.

I'll try to prepare a patch for that.

Comment 26 Beniamino Galvani 2021-04-29 10:02:13 UTC

Merge request:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/832

Comment 27 Yuval Kashtan 2021-05-03 19:25:41 UTC

is there a downstream BZ that can be marked as blocking #1929160 ?

Comment 28 Yuval Kashtan 2021-05-04 13:36:58 UTC

we are still hitting this issue,
I think it should be moved back to 'ASSIGNED'

Comment 32 Ben Nemec 2021-05-04 20:39:01 UTC

At this point in the cycle I think we need to consider workarounds until the NM fix is available. Even once that merges it will take some time to show up in our images. One option is to send the hostname in the DHCP response so DNS is not used to set the hostname at all. That requires the deployment infrastructure to be changed, but it should entirely eliminate this.

I'm unsure what else we could safely do. I assume a complete restart of NM would force lookup of the hostname again, but triggering that from a dispatcher script run by NM seems a bit dangerous. Maybe we could schedule a restart for after the script has finished?

Comment 33 Sabina Aledort 2021-05-05 11:10:02 UTC

This is also happening with BM worker nodes after reboot. The node is coming up from the reboot with 'localhost' hostname.
https://bugzilla.redhat.com/show_bug.cgi?id=1956360

Comment 34 Yuval Kashtan 2021-05-10 16:19:14 UTC

@ben nemeec
We tried that workaround, and problem persist. :-(

Comment 35 Yuval Kashtan 2021-05-10 16:25:29 UTC

@ben nemeec
We tried that workaround (dhcp option 12, so DHCP send hostname), and problem persist. :-(

Comment 36 Ben Nemec 2021-05-11 16:47:47 UTC

(In reply to Yuval Kashtan from comment #35)
> @ben nemeec
> We tried that workaround (dhcp option 12, so DHCP send hostname), and
> problem persist. :-(

In that case there's a different issue because DNS shouldn't be involved at all. Can you provide logs from a test run with the hostname provided by DHCP?

Comment 37 Sabina Aledort 2021-05-13 09:36:03 UTC

Attached the log from a BM node that's coming up from a reboot as localhost. 
The hostname is set in the DHCP:

host cnfdd5.clus2.t5g.lab.eng.bos.redhat.com {
                hardware ethernet 0c:42:a1:55:e4:ce;
                # hardware ethernet 0c:42:a1:55:e4:cf;
                # hardware ethernet 40:a6:b7:17:57:80;
                # hardware ethernet 40:a6:b7:17:57:81;
                # hardware ethernet 40:a6:b7:17:43:d0;
                # hardware ethernet 40:a6:b7:17:43:d1;
                fixed-address cnfdd5.clus2.t5g.lab.eng.bos.redhat.com;
        }


[root@cnfdd5-installer ~]# oc get node
NAME                                               STATUS                        ROLES               AGE    VERSION
cnfdd5-master-0.clus2.t5g.lab.eng.bos.redhat.com   Ready                         master,virtual      142m   v1.21.0-rc.0+41625cd
cnfdd5-master-1.clus2.t5g.lab.eng.bos.redhat.com   Ready                         master,virtual      141m   v1.21.0-rc.0+41625cd
cnfdd5-master-2.clus2.t5g.lab.eng.bos.redhat.com   Ready                         master,virtual      142m   v1.21.0-rc.0+41625cd
cnfdd5.clus2.t5g.lab.eng.bos.redhat.com            NotReady,SchedulingDisabled   worker,worker-cnf   85m    v1.21.0-rc.0+41625cd
cnfdd7.clus2.t5g.lab.eng.bos.redhat.com            Ready                         worker,worker-cnf   95m    v1.21.0-rc.0+41625cd
cnfdd8.clus2.t5g.lab.eng.bos.redhat.com            Ready                         worker              92m    v1.21.0-rc.0+41625cd
[root@cnfdd5-installer ~]# ssh core.t5g.lab.eng.bos.redhat.com
Red Hat Enterprise Linux CoreOS 48.84.202105121453-0
  Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html

---
Last login: Thu May 13 09:29:07 2021 from 10.19.17.175
[systemd]
Failed Units: 2
  NetworkManager-wait-online.service
  node-valid-hostname.service
[core@localhost ~]$

Comment 38 Sabina Aledort 2021-05-13 09:37:32 UTC

Created attachment 1782672 [details]
journalctl_local_host_bm.log

Comment 39 Sabina Aledort 2021-05-13 09:38:46 UTC

Created attachment 1782673 [details]
node-valid-hostname.service..bm.log

Comment 40 Sabina Aledort 2021-05-13 09:41:37 UTC

Created attachment 1782686 [details]
NetworkManager-wait-online.service.bm.log

Comment 41 Yuval Kashtan 2021-05-18 10:02:33 UTC

Hi Beth/ Beniamino
have anyone tried to test the PR mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1929160#c26 
actually resolve the issue here?

Comment 42 Sabina Aledort 2021-05-18 13:41:23 UTC

Hi,

We were able to apply a workaround to set the hostname of a node in case it got 'localhost'.
We set a MachineConfig that's running before kubelet and setting the right hostname to the node in case it got 'localhost'.
We were able to get a proper, full, CI run with that. The nodes are rebooted several times during our CI run so we were able to see the fix of the hostname in the log. 

Attached the MachineConfig, the base64 script and the service log from the node.

Comment 43 Sabina Aledort 2021-05-18 13:43:47 UTC

Created attachment 1784451 [details]
MachineConfig workaround

Comment 44 Sabina Aledort 2021-05-18 13:45:20 UTC

Created attachment 1784452 [details]
base64 MC script

Comment 45 Sabina Aledort 2021-05-18 13:46:20 UTC

Created attachment 1784453 [details]
workaround service log

Comment 46 Ben Nemec 2021-05-18 14:45:36 UTC

I don't think the DHCP configuration is correct. In the journal I see: "set hostname to 'cnfdd5.clus2.t5g.lab.eng.bos.redhat.com' (from address lookup)". The "(from address lookup)" part suggests it is still using DNS. I think the problem is that fixed-address is not the correct way to specify the hostname. It looks like you need:

option host-name "cnfdd5.clus2.t5g.lab.eng.bos.redhat.com";

in the host block for the node. Can you try that and see if it still happens?

Comment 47 Beth White 2021-06-09 11:57:10 UTC

@ykashtan is this issue now resolved with the workaround provided?

Comment 48 Yuval Kashtan 2021-06-09 12:00:39 UTC

as we wrote above, we are using our own workaround (as stated #42),
which works for us,

but I guess we still need something that is part of the product and supported.

Comment 50 Ben Nemec 2021-08-31 17:39:42 UTC

The new nmcli command is now available in our host images, so we should be able to write a patch to fix this now. However, I don't think it's going to make 4.9 code freeze. We should be able to backport a fix though since the necessary functionality will be available in 4.9 images.

Comment 51 Yuval Kashtan 2021-08-31 18:10:40 UTC

IMHO
It's exteremly important that we do. the workaround MC is not trivial..

Comment 52 Ben Nemec 2021-09-14 20:13:22 UTC

I have updated the linked PR to use the new nmcli command. Can someone test this in an environment where the bug reproduced to see if it fixes the problem? Thanks.

Comment 53 Yuval Kashtan 2021-09-14 21:01:40 UTC

This just happened again in a different lab
so I guess we can easily reproduce (and it still happens in our d/s CI system)

keeping the needinfo for now, as we are going on holidays here in Israel till the end of the month, so I dont think I'll be able to test it earlier

Comment 54 Ben Nemec 2021-10-14 14:40:42 UTC

*** Bug 1990369 has been marked as a duplicate of this bug. ***

Comment 55 Ben Nemec 2021-10-15 16:01:33 UTC

Hi Yuval, now that everyone is back in the office would you have a chance to test the patch for this bug?

Comment 56 Yuval Kashtan 2021-12-08 08:17:14 UTC

I've tried to reproduce the issue, several times, with latest 4.10 nightly
and I cant

I dont know if it is because one of the latest NetworkManager patches in RHEL or something changed in my lab network environment.

I will remove the workaround from all our deployments, including CI (which is running in the same LAB)
so we'll know if it ever returns.

Comment 57 Yuval Kashtan 2022-01-31 08:31:25 UTC

this is still happening in latest 4.10

just got this again after applying an MC to masters,
one of the masters rebooted as localhost, wrecking havock in the cluster :-(

Comment 58 Ben Nemec 2022-02-01 15:50:47 UTC

That's because https://github.com/openshift/machine-config-operator/pull/2488 hasn't merged. We either need to verify that it fixes the problem or just merge it and see if this keeps happening.

Comment 61 Denis Ollier 2022-02-21 09:34:24 UTC

Hi,

Although I'm not sure if it's the same issue, I'm observing the same symptoms when trying to deploy a BM-IPI cluster using OVNKubernetes backend (required for dual stack IPv4/IPv6 support).

All worker nodes are registering as `localhost.localdomain` and their CSR are never approved.

I'm facing this issue with OCP 4.10 and 4.11 nightlies using OVNKubernetes.

OCP 4.9 + OVNKubernetes is working fine and OCP 4.10/4.11 + OpenShiftSDN is also working properly. 

Could you please assist?

Comment 63 Ben Nemec 2022-02-22 22:29:00 UTC

Where are the hostnames coming from in your environment? DHCPv4, DHCPv6, reverse DNS on v4 or v6?

The CNI plugin has little to no influence on the hostname, so I suspect the reason OSDN works is because it only supports ipv4 and adding ipv6 to the environment is breaking something. There are known issues with reverse DNS lookup on ipv6, so that's my best guess as to what might be going on here.

If you're not setting hostnames via DHCP then I suggest doing that. Reverse DNS has proven to be an issue and while there are some fixes on the way I don't know when they will land in our images and there are additional problems with no identified fix yet.

Comment 68 Victor Voronkov 2022-03-10 07:25:03 UTC

While Bugfix included in accepted release 4.11.0-0.nightly-2022-02-16-085151
The issue remains:

[root@cnfdf07-installer ~]# oc version
Client Version: 4.11.0-0.nightly-2022-02-16-211105
Server Version: 4.11.0-0.nightly-2022-02-16-211105
Kubernetes Version: v1.23.3+f14faf2

[root@cnfdf07-installer ~]# oc get node
NAME                                           STATUS   ROLES    AGE   VERSION
dhcp-8-34-208.telco5gran.eng.rdu2.redhat.com   Ready    master   64m   v1.23.3+2e8bad7
dhcp-8-34-219.telco5gran.eng.rdu2.redhat.com   Ready    master   63m   v1.23.3+2e8bad7
localhost.localdomain                          Ready    master   59m   v1.23.3+2e8bad7

Comment 69 Bob Fournier 2022-03-10 12:17:31 UTC

Reassigning to Ben as he's been working on this.

Comment 70 Ben Nemec 2022-03-14 16:11:27 UTC

Can we capture trace logs again as described in https://bugzilla.redhat.com/show_bug.cgi?id=1929160#c13 ? We're most likely going to need them to debug this with the NM team.

Comment 71 Yuval Kashtan 2022-03-14 20:22:49 UTC

and I'll delegate to my team member

Comment 72 Ben Nemec 2022-03-21 22:03:56 UTC

Can we get some followup on this? We have a report that this fixed part of the problem in https://bugzilla.redhat.com/show_bug.cgi?id=2058030 and it would be nice to get it backported. We need to figure out why verification failed first though.

Comment 73 Yuval Kashtan 2022-03-21 22:39:57 UTC

the env where we had 100% success in recreating the issue is currently in use for other urgent issue.
once that will be resolved, I'll get hold of it and recreate the issue for you.
so you can examine and test it as you see fit

Comment 74 Michael Gourin 2022-06-01 13:59:33 UTC

Got the same issue for 4.9.35

Comment 79 Red Hat Bugzilla 2023-09-18 00:24:43 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.