Hide Forgot
Version: 4.7.0-0.nightly-2021-02-13-071408 $ openshift-install version 4.7.0-0.nightly-2021-02-13-071408 Platform: baremetal Please specify: IPI What happened? When installing a new baremetal ipi cluster, one of the masters came up as localhost.localdomain. [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd5.clus2.t5g.lab.eng.bos.redhat.com Ready worker 9h v1.20.0+ba45583 cnfdd6.clus2.t5g.lab.eng.bos.redhat.com Ready worker 9h v1.20.0+ba45583 cnfdd7.clus2.t5g.lab.eng.bos.redhat.com Ready worker 9h v1.20.0+ba45583 dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 10h v1.20.0+ba45583 dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 10h v1.20.0+ba45583 localhost.localdomain Ready master,virtual 10h v1.20.0+ba45583 What did you expect to happen? All 3 masters should appear with their proper fqdn.
I see the correct host name received by ironic-inspector: > 'system_vendor': {'product_name': 'KVM', 'serial_number': '', 'manufacturer': 'Red Hat'}, 'boot': {'current_boot_mode': 'bios', 'pxe_interface': 'aa:aa:aa:aa:aa:02'}, 'hostname': 'dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com'} No mentions of localhost.localdomain in the ironic-inspector logs. The problem must be higher up the stack. Is this problem consistently reproducible? Do you have a reproducer?
There are a lot of error messages in the main logs, but I don't have enough knowledge to understand them. Since, at least at the first impression, the bare metal components are working normally, I'm passing to the Metal Installer team for further triaging.
To debug this further, can you please provide the full journal log from the node which registered as localhost.localdomain please? This should tell us if we timed out waiting for a valid hostname via DHCP/DNS. When logging into the node via SSH is the hostname set correctly?
Looking through the pod logs for one of our services, it appears the node did eventually get a hostname, but it took significantly longer than the other nodes that registered normally. We'll still need journal logs to see why it took so long, but that supports the idea that it just took too long for the hostname to be assigned.
Hey, I don't have a way to reproduce. When redeploying the cluster it didn't happen so i can't provide more logs.
please reopen this if you see this happening again and provide more logs if possible, thanks!
Hi, It happened again with the custom version registry.ci.openshift.org/rhcos-devel/rhel4784:4.7.0-rc.2 [root@cnfdd3-installer ~]# oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME cnfdd3.clus2.t5g.lab.eng.bos.redhat.com Ready worker 16h v1.20.0+bd9e442 10.19.16.100 <none> Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa) 4.18.0-287.el8.x86_64 cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 16h v1.20.0+bd9e442 10.19.17.118 <none> Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa) 4.18.0-287.el8.x86_64 cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52 dhcp19-17-128.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 16h v1.20.0+bd9e442 10.19.17.128 <none> Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa) 4.18.0-287.el8.x86_64 cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52 dhcp19-17-199.clus2.t5g.lab.eng.bos.redhat.com Ready worker 16h v1.20.0+bd9e442 10.19.17.199 <none> Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa) 4.18.0-287.el8.x86_64 cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52 localhost Ready master,virtual 16h v1.20.0+bd9e442 10.19.17.102 <none> Red Hat Enterprise Linux CoreOS 47.84.202102161611-0 (Ootpa) 4.18.0-287.el8.x86_64 cri-o://1.20.0-0.rhaos4.7.gitfdbdf43.el8.52 [root@cnfdd3-installer ~]# oc version Client Version: 4.7.0-rc.2 Server Version: 4.7.0-rc.2 Kubernetes Version: v1.20.0+bd9e442 must-gather log can be downloaded from: https://drive.google.com/file/d/1bNOWwQojhLO_jgM64FV-vK6BNFj9aJva/view?usp=sharing The cluster is up so i can get more logs or share access.
The main thing we need to see is the full journal log so we can see what the timing on hostname changes was. If you can give me access to the cluster that would work as well.
Okay, I see what happened, but I don't know why the behavior differed on the one node. We might have to talk to the NM people. On the broken node, it lost its hostname when ens4 was disconnected: Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1329]: + nmcli device disconnect ens4 Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4265] device (ens4): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed') Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4290] manager: NetworkManager state is now CONNECTED_LOCAL Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4299] audit: op="device-disconnect" interface="ens4" ifindex=3 pid=1501 uid=0 result="success" Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4301] device (ens4): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed') Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4421] dhcp4 (ens4): canceled DHCP transaction Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4421] dhcp4 (ens4): state changed bound -> done Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4452] policy: set-hostname: set hostname to 'localhost' (from address lookup) Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1329]: Device 'ens4' successfully disconnected. Mar 07 17:13:55 localhost systemd-hostnamed[1309]: Changed host name to 'localhost' On a working node, that doesn't happen: Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1335]: + nmcli device disconnect ens4 Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info> [1615137236.5215] device (ens4): state change: activated -> deactivating (reason 'user-requested', sys-iface-state: 'managed') Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info> [1615137236.5223] manager: NetworkManager state is now CONNECTED_LOCAL Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info> [1615137236.5226] audit: op="device-disconnect" interface="ens4" ifindex=3 pid=1497 uid=0 result="success" Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info> [1615137236.5232] device (ens4): state change: deactivating -> disconnected (reason 'user-requested', sys-iface-state: 'managed') Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Created slice machine.slice. Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Started libpod-conmon-73a8431e8ca15e3224a6a37cc045ef0ff048ebef3b571d2a831326f5327a8023.scope. Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info> [1615137236.5323] dhcp4 (ens4): canceled DHCP transaction Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1309]: <info> [1615137236.5324] dhcp4 (ens4): state changed bound -> done Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: tmp-crun.b5jjEX.mount: Succeeded. Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1335]: Device 'ens4' successfully disconnected. Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1335]: + nmcli connection show ovs-if-phys0 Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Started libcrun container. [the hostname never goes back to localhost] I thought maybe one was getting the address from DHCP and the other from DNS reverse lookup, but they both log the same thing when setting the hostname: NetworkManager[1303]: <info> [1615137234.8826] policy: set-hostname: set hostname to 'dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com' (from address lookup) and NetworkManager[1309]: <info> [1615137236.0151] policy: set-hostname: set hostname to 'dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com' (from address lookup) I do see a difference in what the dispatcher scripts are getting though: Mar 07 17:13:54 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com root[1327]: Hostname changed: localhost Mar 07 17:13:54 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[1313]: <13>Mar 7 17:13:54 root: Hostname changed: localhost vs. Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com root[1334]: Hostname changed: dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com Mar 07 17:13:56 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com nm-dispatcher[1319]: <13>Mar 7 17:13:56 root: Hostname changed: dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com I'm a bit confused why the broken node is getting a localhost hostname when it clearly has a hostname at that point. Maybe it got an empty hostname, which triggered https://github.com/openshift/machine-config-operator/blob/82868e63176fee2bc806c1deb308ed1fc8965d84/templates/common/on-prem/files/NetworkManager-mdns-hostname.yaml#L13 ? I don't think this is the underlying problem though because this dispatcher script only deals with mdns-publisher. At worst it would cause mdns-publisher to hang waiting on its init container (which is, in fact, happening on the broken node). I think what we need to figure out is why the one node is dropping its hostname when configure-ovs.sh runs and moves the configuration to the bridge. I grabbed journal logs from the bad node and one of the good ones if anyone wants to look at them. I think they're a bit big to attach to the bz.
> Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com NetworkManager[1303]: <info> [1615137235.4452] policy: set-hostname: set hostname to 'localhost' (from address lookup) > Mar 07 17:13:55 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com configure-ovs.sh[1329]: Device 'ens4' successfully disconnected. Here the reverse address lookup of the address present on one interface returns 'localhost', but it's hard to tell why. Would it be possible to have NM logs at trace level?
Sabina, would you be able to deploy this environment with the MCO change in https://github.com/cybertron/machine-config-operator/commit/144e0db6a393cd93982b61c05e68ef95a944d95b ? That will enable the NetworkManager trace logs. Let me know if you need me to build a release or MCO image for it.
(In reply to Ben Nemec from comment #13) > Sabina, would you be able to deploy this environment with the MCO change in > https://github.com/cybertron/machine-config-operator/commit/ > 144e0db6a393cd93982b61c05e68ef95a944d95b ? That will enable the > NetworkManager trace logs. > > Let me know if you need me to build a release or MCO image for it. I recreated the cluster with the MCO change. [root@cnfdd3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd3.clus2.t5g.lab.eng.bos.redhat.com Ready worker 60m v1.20.0+bd9e442 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 164m v1.20.0+bd9e442 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 164m v1.20.0+bd9e442 dhcp19-17-82.clus2.t5g.lab.eng.bos.redhat.com Ready worker 71m v1.20.0+bd9e442 localhost Ready master,virtual 159m v1.20.0+bd9e442
Hi, I have analyzed the log from "Mar 15 09:30:15" to "Mar 15 09:30:19" and there seems to be a race condition there. After ens4 gets added to the ovs bridge and the ovs interface gets an address via DHCP, NM tries to resolve the address on br-ex to get a hostname. However at that time resolv.conf doesn't contain the new nameservers because NM is configured with rc=unmanaged and so the DNS resolution fails. resolv.conf is updated later by the "30-resolv-prepender" dispatcher script on the "up" event; however NM doesn't know that it should retry the reverse DNS lookup. Therefore, the hostname stays set to "localhost". I have to do some test about this, but it's possible that if the dispatcher script sends a SIGHUP to NM, that will trigger a DNS reconfiguration which will also cause a new DNS lookup to start. Probably this commit [1] needs to be backported, to restart the lookup even if resolv.conf is unmanaged. [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/1c0932a6e66880f5b4c92fcd2d13cbba29238a14
Okay, I've written a patch to reload NetworkManager: https://github.com/openshift/machine-config-operator/pull/2488 I think that should accomplish the same thing as SIGHUP, but let me know if I'm mistaken. It works in my environment, but I've never run into this bug so that doesn't mean a whole lot. Sabina, could you try deploying with that patch in your environment where this reproduces? Thanks.
(In reply to Ben Nemec from comment #17) > Okay, I've written a patch to reload NetworkManager: > https://github.com/openshift/machine-config-operator/pull/2488 > > I think that should accomplish the same thing as SIGHUP, but let me know if > I'm mistaken. It works in my environment, but I've never run into this bug > so that doesn't mean a whole lot. > > Sabina, could you try deploying with that patch in your environment where > this reproduces? Thanks. I recreated the cluster with the NetworkManager patch and it looks good. [root@cnfdd3-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd3.clus2.t5g.lab.eng.bos.redhat.com Ready worker 4m10s v1.20.0+ba45583 dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 37m v1.20.0+ba45583 dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 37m v1.20.0+ba45583 dhcp19-17-128.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 38m v1.20.0+ba45583 dhcp19-17-147.clus2.t5g.lab.eng.bos.redhat.com Ready worker 18m v1.20.0+ba45583
Great, thanks!
*** Bug 1950763 has been marked as a duplicate of this bug. ***
Hi, i just created a cluster with the patch (https://github.com/openshift/machine-config-operator/pull/2488) in a 4.8 nightly but got again the localhost issue. Attached the journalctl log from the localhost node. [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master 23m v1.21.0-rc.0+3ced7a9 dhcp19-17-117.clus2.t5g.lab.eng.bos.redhat.com Ready master 24m v1.21.0-rc.0+3ced7a9 localhost Ready master 19m v1.21.0-rc.0+3ced7a9 [root@cnfdd5-installer ~]# oc version Client Version: 4.8.0-0.nightly-2021-04-22-013545 Server Version: 4.8.0-0.nightly-2021-04-22-013545 Kubernetes Version: v1.21.0-rc.0+3ced7a9
Created attachment 1774447 [details] journalctl.4.8.log
according to above, shouldnt this BZ move back to ASSIGNED status?
Hmm, I see that the reload happened, but it doesn't seem to have set the hostname: Apr 22 09:29:23 localhost nm-dispatcher[1318]: NM resolv-prepender: Prepending 'nameserver 10.19.17.115' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf) Apr 22 09:29:23 localhost systemd[1]: Reloading Network Manager. Apr 22 09:29:23 localhost NetworkManager[1304]: <info> [1619083763.6915] audit: op="reload" arg="0" pid=2991 uid=0 result="success" Apr 22 09:29:23 localhost NetworkManager[1304]: <info> [1619083763.6919] config: signal: SIGHUP (no changes from disk) Apr 22 09:29:23 localhost systemd[1]: Reloaded Network Manager. Beniamino, is it possible that the "no changes from disk" part would have prevented NM from re-doing the lookup? If that's not it, I assume you'll need trace logs to investigate further.
I think I know what's the problem. Upon a reload/SIGHUP, currently with dns=none the dns-manager doesn't emit a CONFIG_CHANGED signal and therefore NMPolicy doesn't restart the DNS lookup. The solution would be that SIGHUP always forces a restart of DNS lookup, and possibly we also add a new reload flag (or unix signal) to explicitly try again resolving the hostname. I'll try to prepare a patch for that.
Merge request: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/832
is there a downstream BZ that can be marked as blocking #1929160 ?
we are still hitting this issue, I think it should be moved back to 'ASSIGNED'
At this point in the cycle I think we need to consider workarounds until the NM fix is available. Even once that merges it will take some time to show up in our images. One option is to send the hostname in the DHCP response so DNS is not used to set the hostname at all. That requires the deployment infrastructure to be changed, but it should entirely eliminate this. I'm unsure what else we could safely do. I assume a complete restart of NM would force lookup of the hostname again, but triggering that from a dispatcher script run by NM seems a bit dangerous. Maybe we could schedule a restart for after the script has finished?
This is also happening with BM worker nodes after reboot. The node is coming up from the reboot with 'localhost' hostname. https://bugzilla.redhat.com/show_bug.cgi?id=1956360
@ben nemeec We tried that workaround, and problem persist. :-(
@ben nemeec We tried that workaround (dhcp option 12, so DHCP send hostname), and problem persist. :-(
(In reply to Yuval Kashtan from comment #35) > @ben nemeec > We tried that workaround (dhcp option 12, so DHCP send hostname), and > problem persist. :-( In that case there's a different issue because DNS shouldn't be involved at all. Can you provide logs from a test run with the hostname provided by DHCP?
Attached the log from a BM node that's coming up from a reboot as localhost. The hostname is set in the DHCP: host cnfdd5.clus2.t5g.lab.eng.bos.redhat.com { hardware ethernet 0c:42:a1:55:e4:ce; # hardware ethernet 0c:42:a1:55:e4:cf; # hardware ethernet 40:a6:b7:17:57:80; # hardware ethernet 40:a6:b7:17:57:81; # hardware ethernet 40:a6:b7:17:43:d0; # hardware ethernet 40:a6:b7:17:43:d1; fixed-address cnfdd5.clus2.t5g.lab.eng.bos.redhat.com; } [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION cnfdd5-master-0.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 142m v1.21.0-rc.0+41625cd cnfdd5-master-1.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 141m v1.21.0-rc.0+41625cd cnfdd5-master-2.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 142m v1.21.0-rc.0+41625cd cnfdd5.clus2.t5g.lab.eng.bos.redhat.com NotReady,SchedulingDisabled worker,worker-cnf 85m v1.21.0-rc.0+41625cd cnfdd7.clus2.t5g.lab.eng.bos.redhat.com Ready worker,worker-cnf 95m v1.21.0-rc.0+41625cd cnfdd8.clus2.t5g.lab.eng.bos.redhat.com Ready worker 92m v1.21.0-rc.0+41625cd [root@cnfdd5-installer ~]# ssh core.t5g.lab.eng.bos.redhat.com Red Hat Enterprise Linux CoreOS 48.84.202105121453-0 Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html --- Last login: Thu May 13 09:29:07 2021 from 10.19.17.175 [systemd] Failed Units: 2 NetworkManager-wait-online.service node-valid-hostname.service [core@localhost ~]$
Created attachment 1782672 [details] journalctl_local_host_bm.log
Created attachment 1782673 [details] node-valid-hostname.service..bm.log
Created attachment 1782686 [details] NetworkManager-wait-online.service.bm.log
Hi Beth/ Beniamino have anyone tried to test the PR mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=1929160#c26 actually resolve the issue here?
Hi, We were able to apply a workaround to set the hostname of a node in case it got 'localhost'. We set a MachineConfig that's running before kubelet and setting the right hostname to the node in case it got 'localhost'. We were able to get a proper, full, CI run with that. The nodes are rebooted several times during our CI run so we were able to see the fix of the hostname in the log. Attached the MachineConfig, the base64 script and the service log from the node.
Created attachment 1784451 [details] MachineConfig workaround
Created attachment 1784452 [details] base64 MC script
Created attachment 1784453 [details] workaround service log
I don't think the DHCP configuration is correct. In the journal I see: "set hostname to 'cnfdd5.clus2.t5g.lab.eng.bos.redhat.com' (from address lookup)". The "(from address lookup)" part suggests it is still using DNS. I think the problem is that fixed-address is not the correct way to specify the hostname. It looks like you need: option host-name "cnfdd5.clus2.t5g.lab.eng.bos.redhat.com"; in the host block for the node. Can you try that and see if it still happens?
@ykashtan is this issue now resolved with the workaround provided?
as we wrote above, we are using our own workaround (as stated #42), which works for us, but I guess we still need something that is part of the product and supported.
The new nmcli command is now available in our host images, so we should be able to write a patch to fix this now. However, I don't think it's going to make 4.9 code freeze. We should be able to backport a fix though since the necessary functionality will be available in 4.9 images.
IMHO It's exteremly important that we do. the workaround MC is not trivial..
I have updated the linked PR to use the new nmcli command. Can someone test this in an environment where the bug reproduced to see if it fixes the problem? Thanks.
This just happened again in a different lab so I guess we can easily reproduce (and it still happens in our d/s CI system) keeping the needinfo for now, as we are going on holidays here in Israel till the end of the month, so I dont think I'll be able to test it earlier
*** Bug 1990369 has been marked as a duplicate of this bug. ***
Hi Yuval, now that everyone is back in the office would you have a chance to test the patch for this bug?
I've tried to reproduce the issue, several times, with latest 4.10 nightly and I cant I dont know if it is because one of the latest NetworkManager patches in RHEL or something changed in my lab network environment. I will remove the workaround from all our deployments, including CI (which is running in the same LAB) so we'll know if it ever returns.
this is still happening in latest 4.10 just got this again after applying an MC to masters, one of the masters rebooted as localhost, wrecking havock in the cluster :-(
That's because https://github.com/openshift/machine-config-operator/pull/2488 hasn't merged. We either need to verify that it fixes the problem or just merge it and see if this keeps happening.
Hi, Although I'm not sure if it's the same issue, I'm observing the same symptoms when trying to deploy a BM-IPI cluster using OVNKubernetes backend (required for dual stack IPv4/IPv6 support). All worker nodes are registering as `localhost.localdomain` and their CSR are never approved. I'm facing this issue with OCP 4.10 and 4.11 nightlies using OVNKubernetes. OCP 4.9 + OVNKubernetes is working fine and OCP 4.10/4.11 + OpenShiftSDN is also working properly. Could you please assist?
Where are the hostnames coming from in your environment? DHCPv4, DHCPv6, reverse DNS on v4 or v6? The CNI plugin has little to no influence on the hostname, so I suspect the reason OSDN works is because it only supports ipv4 and adding ipv6 to the environment is breaking something. There are known issues with reverse DNS lookup on ipv6, so that's my best guess as to what might be going on here. If you're not setting hostnames via DHCP then I suggest doing that. Reverse DNS has proven to be an issue and while there are some fixes on the way I don't know when they will land in our images and there are additional problems with no identified fix yet.
While Bugfix included in accepted release 4.11.0-0.nightly-2022-02-16-085151 The issue remains: [root@cnfdf07-installer ~]# oc version Client Version: 4.11.0-0.nightly-2022-02-16-211105 Server Version: 4.11.0-0.nightly-2022-02-16-211105 Kubernetes Version: v1.23.3+f14faf2 [root@cnfdf07-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp-8-34-208.telco5gran.eng.rdu2.redhat.com Ready master 64m v1.23.3+2e8bad7 dhcp-8-34-219.telco5gran.eng.rdu2.redhat.com Ready master 63m v1.23.3+2e8bad7 localhost.localdomain Ready master 59m v1.23.3+2e8bad7
Reassigning to Ben as he's been working on this.
Can we capture trace logs again as described in https://bugzilla.redhat.com/show_bug.cgi?id=1929160#c13 ? We're most likely going to need them to debug this with the NM team.
and I'll delegate to my team member
Can we get some followup on this? We have a report that this fixed part of the problem in https://bugzilla.redhat.com/show_bug.cgi?id=2058030 and it would be nice to get it backported. We need to figure out why verification failed first though.
the env where we had 100% success in recreating the issue is currently in use for other urgent issue. once that will be resolved, I'll get hold of it and recreate the issue for you. so you can examine and test it as you see fit
Got the same issue for 4.9.35