Description of problem:
The IPI installer of OCP 4.5 does not add worker nodes to the cluster. The worker nodes are running as openstack instance but failed to form openshfit cluster.
The installation stalls at a point where the provisioned worker nodes are not able to get added to the cluster. Listing the nodes only shows master nodes, listing the machines has all the nodes but worker nodes stay in "Provisioned" phase. The CSRs are only generated for master nodes. It is observed after some hours of installation that the CSRs for workers are automatically generated and when manually approved, they join the cluster.
The installer is causing some discrepancy in /etc/resolv.conf files on masters and workers.
For master, there are three entries comprising of the default provided DNS VIP (126.96.36.199) along with external DNSes. For workers, there are only two entries, both are external DNSes given in install-config.yaml.
The nslookup to the internal API server from masters and workers presents two cases:
Case 1: First nameserver is not considered (188.8.131.52) and resolution occur through the second DNS(10.255.0.10). It resolves to IP address seen in the kubelet logs of worker nodes.
Case 2: Explicitly mentioned the nameserver used to resolve should be 184.108.40.206 . It was resolved correctly to the API VIP (220.127.116.11)
It was confirmed that the resolution for internal API in worker nodes is going through an external DNS and hence they are unable to reach the api server. After making manually changes in all the nodes resolv.conf and appended the DNS VIP as the first nameserver, the CSRs were generated and workers were added.
Steps to Reproduce:
1. Run the OSP IPI Installer of OCP 4.5.2
2. Wait for it to complete
3. Check the output of # oc get nodes and status of worker VMs in horizon portal.
4. Actual Master Nodes install, Bootstrap machine gets destroyed but worker nodes add to the cluster.
Complete installation logs: http://pastebin.test.redhat.com/889026
Difference in DNS configuration on workers and masters stopping workers to make request to API server and get added in the cluster.
The CSRs for workers must be generated right after the installation and must be automatically approved.
The worker nodes should normally point at themselves as the first resolver. Could you provide the must-gather archive so we can look at what happens?
At first glance, nothing out of the ordinary pops up from the provided must-gather archive. Unfortunately, it doesn't include debugging information from the compute nodes since they didn't join the cluster. Could you confirm that all services are running as expected on the compute nodes, in particular the nm-dispatcher scripts?
Please provide the output of `sudo journalctl | grep nm-dispatcher` from the worker nodes.
I have a hunch this could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1853298. In which case, we would need to backport https://github.com/openshift/machine-config-operator/pull/1773 to 4.5.
> Please provide the output of `sudo journalctl | grep nm-dispatcher` from the worker nodes.
We might be able to link this bug to another one for which the fix exists in 4.6. If this is the case, we can backport.
A fixed for this may have landed with https://bugzilla.redhat.com/show_bug.cgi?id=1870285 .
Can you confirm if the problem is still present with a newer 4.5 release?
This is very likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1870285, I'm marking it as such. The fix should be available in the upcoming 4.5.8 release, please re-open the BZ if it doesn't solve your issue.
*** This bug has been marked as a duplicate of bug 1870285 ***