Bug 1874869 - node registered as localhost.localdomain
Summary: node registered as localhost.localdomain
Keywords:
Status: CLOSED DUPLICATE of bug 1879156
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Brad P. Crochet
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-02 13:12 UTC by Yuval Kashtan
Modified: 2020-09-29 11:55 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-22 13:47:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yuval Kashtan 2020-09-02 13:12:54 UTC
Description of problem:
when installing a new baremetal, ipi, cluster, masters dont come up or show as localhost.localdomain and deployment is blocked.

Version-Release number of selected component (if applicable):
4.6+

How reproducible:
it constantly happens but only in specific environments.


Steps to Reproduce:
1. run the installer
2. wait for masters to appear

Actual results:
```
[root@cnfd1-installer ~]# oc get node
NAME                    STATUS     ROLES    AGE     VERSION
localhost.localdomain   NotReady   master   9m45s   v1.19.0-rc.2+aaf4ce1-dirty
```


Expected results:
all 3 masters should appear with their proper fqdn


Additional info:
I believe this is due to slow DHCP response, triggering some kind of nm related race condition.
nevertheless /etc/systemd/system/kubelet.service.d/20-nodenet.conf contains the correct ip for the node.

Comment 1 Sabina Aledort 2020-09-10 12:27:12 UTC
we are still facing this issue, i tried to deploy a cluster with 4.6.0-0.nightly-2020-09-09-083207 and the node name is localhost.localdomain

[root@cnfd1-installer ~]# oc version
Client Version: 4.6.0-0.nightly-2020-09-09-083207
Kubernetes Version: v1.19.0-rc.2+068702d
[root@cnfd1-installer ~]# oc get node
NAME                    STATUS     ROLES            AGE   VERSION
localhost.localdomain   NotReady   master,virtual   22h   v1.19.0-rc.2+068702d

Comment 2 Antoni Segura Puimedon 2020-09-11 14:32:51 UTC
Is this happening with IPv6 on either of the networks? Or is this an IPv4 only deployment? I'm asking because nowadays we use the NetworkManager internal IPv6 client, which does not parse the IPV6 hostname DHCP option.

Comment 3 Ben Nemec 2020-09-11 19:28:39 UTC
This is ipv4 and seems to be different from the issue that was breaking our ipv6 deployments.

I forgot to update here with the results of my investigation, but this is what I found:

I can see what's happening, but I have no idea why. I see the same flow as before: node boots, gets hostname, configure-ovs.sh unconfigures the interface, node loses hostname, configure-ovs brings up br-ex, node gets ip again. The difference is that after it DHCPs br-ex it doesn't get a hostname again. It proceeds to sit there for five minutes waiting for node-valid-hostname, which eventually times out and the subsequent services start up, then a couple of minutes later it gets a hostname again.

That 6+ minute delay is too long to be explained just by slow DHCP or rDNS. I think we may need to talk to the NM team about what's going on here.

The next step is going to be deploying in this environment with trace logging enabled in NM. They're going to ask us for that anyway when we raise this to them. I've pushed an MCO patch[0] to enable trace logging and I believe Yuval is going to deploy with it.

0: https://github.com/cybertron/machine-config-operator/tree/nm-trace

Comment 4 Yuval Kashtan 2020-09-15 07:04:35 UTC
this is ipv4
trying with that nm-trace didnt reproduce the issue

Comment 5 Brad P. Crochet 2020-09-17 12:51:05 UTC
4.6.0-0.nightly-2020-09-09-083207 is no longer available. Is this happening consistently with a newer, available build on the affected environments?

Comment 6 Brad P. Crochet 2020-09-21 19:03:07 UTC
Please try to reproduce on a current build. We believe this to be fixed by https://github.com/openshift/machine-config-operator/pull/2094. If you are not able to reproduce this, I propose this be marked as a duplicate of BZ #1879156.

Comment 7 Ben Bennett 2020-09-22 13:47:11 UTC

*** This bug has been marked as a duplicate of bug 1879156 ***

Comment 8 Sabina Aledort 2020-09-23 11:05:32 UTC
I deployed the same cluster with 4.6.0-0.nightly-2020-09-21-081745 and it seems to work fine.


Note You need to log in before you can comment on or make changes to this bug.