Description of problem: when installing a new baremetal, ipi, cluster, masters dont come up or show as localhost.localdomain and deployment is blocked. Version-Release number of selected component (if applicable): 4.6+ How reproducible: it constantly happens but only in specific environments. Steps to Reproduce: 1. run the installer 2. wait for masters to appear Actual results: ``` [root@cnfd1-installer ~]# oc get node NAME STATUS ROLES AGE VERSION localhost.localdomain NotReady master 9m45s v1.19.0-rc.2+aaf4ce1-dirty ``` Expected results: all 3 masters should appear with their proper fqdn Additional info: I believe this is due to slow DHCP response, triggering some kind of nm related race condition. nevertheless /etc/systemd/system/kubelet.service.d/20-nodenet.conf contains the correct ip for the node.
we are still facing this issue, i tried to deploy a cluster with 4.6.0-0.nightly-2020-09-09-083207 and the node name is localhost.localdomain [root@cnfd1-installer ~]# oc version Client Version: 4.6.0-0.nightly-2020-09-09-083207 Kubernetes Version: v1.19.0-rc.2+068702d [root@cnfd1-installer ~]# oc get node NAME STATUS ROLES AGE VERSION localhost.localdomain NotReady master,virtual 22h v1.19.0-rc.2+068702d
Is this happening with IPv6 on either of the networks? Or is this an IPv4 only deployment? I'm asking because nowadays we use the NetworkManager internal IPv6 client, which does not parse the IPV6 hostname DHCP option.
This is ipv4 and seems to be different from the issue that was breaking our ipv6 deployments. I forgot to update here with the results of my investigation, but this is what I found: I can see what's happening, but I have no idea why. I see the same flow as before: node boots, gets hostname, configure-ovs.sh unconfigures the interface, node loses hostname, configure-ovs brings up br-ex, node gets ip again. The difference is that after it DHCPs br-ex it doesn't get a hostname again. It proceeds to sit there for five minutes waiting for node-valid-hostname, which eventually times out and the subsequent services start up, then a couple of minutes later it gets a hostname again. That 6+ minute delay is too long to be explained just by slow DHCP or rDNS. I think we may need to talk to the NM team about what's going on here. The next step is going to be deploying in this environment with trace logging enabled in NM. They're going to ask us for that anyway when we raise this to them. I've pushed an MCO patch[0] to enable trace logging and I believe Yuval is going to deploy with it. 0: https://github.com/cybertron/machine-config-operator/tree/nm-trace
this is ipv4 trying with that nm-trace didnt reproduce the issue
4.6.0-0.nightly-2020-09-09-083207 is no longer available. Is this happening consistently with a newer, available build on the affected environments?
Please try to reproduce on a current build. We believe this to be fixed by https://github.com/openshift/machine-config-operator/pull/2094. If you are not able to reproduce this, I propose this be marked as a duplicate of BZ #1879156.
*** This bug has been marked as a duplicate of bug 1879156 ***
I deployed the same cluster with 4.6.0-0.nightly-2020-09-21-081745 and it seems to work fine.