Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1862017

Summary: Worker nodes not getting added in IPI installation of OCP 4.5 on OSP 13 due to mismatch in DNS config
Product: OpenShift Container Platform Reporter: Yash Chouksey <ychoukse>
Component: InstallerAssignee: Martin André <maandre>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: medium CC: cswanson, maandre, m.andre, nm-s, pprinett, uday-pratap-singh
Version: 4.5Keywords: UpcomingSprint
Target Milestone: ---Flags: ychoukse: needinfo-
ychoukse: needinfo-
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-28 16:01:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yash Chouksey 2020-07-30 07:34:59 UTC
Description of problem:

The IPI installer of OCP 4.5 does not add worker nodes to the cluster. The worker nodes are running as openstack instance but failed to form openshfit cluster.
The installation stalls at a point where the provisioned worker nodes are not able to get added to the cluster. Listing the nodes only shows master nodes, listing the machines has all the nodes but worker nodes stay in "Provisioned" phase. The CSRs are only generated for master nodes. It is observed after some hours of installation that the CSRs for workers are automatically generated and when manually approved, they join the cluster.


The installer is causing some discrepancy in /etc/resolv.conf files on masters and workers. 
For master, there are three entries comprising of the default provided DNS VIP (200.0.0.6) along with external DNSes. For workers, there are only two entries, both are external DNSes given in install-config.yaml.

The nslookup to the internal API server from masters and workers presents two cases:
Case 1: First nameserver is not considered (200.0.0.6) and resolution occur through the second DNS(10.255.0.10). It resolves to IP address seen in the kubelet logs of worker nodes.
Case 2: Explicitly mentioned the nameserver used to resolve should be 200.0.0.6 . It was resolved correctly to the API VIP (200.0.0.5)

It was confirmed that the resolution for internal API in worker nodes is going through an external DNS and hence they are unable to reach the api server. After making manually changes in all the nodes resolv.conf and appended the DNS VIP as the first nameserver, the CSRs were generated and workers were added.


How reproducible:

Steps to Reproduce:
1. Run the OSP IPI Installer of OCP 4.5.2 
2. Wait for it to complete
3. Check the output of # oc get nodes and status of worker VMs in horizon portal.
4. Actual Master Nodes install, Bootstrap machine gets destroyed but worker nodes add to the cluster.

Complete installation logs: http://pastebin.test.redhat.com/889026


Actual results:

Difference in DNS configuration on workers and masters stopping workers to make request to API server and get added in the cluster.

Expected results:

The CSRs for workers must be generated right after the installation and must be automatically approved.

Comment 3 Martin André 2020-07-30 15:15:39 UTC
The worker nodes should normally point at themselves as the first resolver. Could you provide the must-gather archive so we can look at what happens?

Comment 11 Martin André 2020-08-19 12:09:47 UTC
At first glance, nothing out of the ordinary pops up from the provided must-gather archive. Unfortunately, it doesn't include debugging information from the compute nodes since they didn't join the cluster. Could you confirm that all services are running as expected on the compute nodes, in particular the nm-dispatcher scripts?

Please provide the output of `sudo journalctl | grep nm-dispatcher` from the worker nodes.

I have a hunch this could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1853298. In which case, we would need to backport https://github.com/openshift/machine-config-operator/pull/1773 to 4.5.

Comment 12 Pierre Prinetti 2020-08-20 14:35:57 UTC
> Please provide the output of `sudo journalctl | grep nm-dispatcher` from the worker nodes.

We might be able to link this bug to another one for which the fix exists in 4.6. If this is the case, we can backport.

Comment 13 Pierre Prinetti 2020-08-27 14:46:39 UTC
A fixed for this may have landed with https://bugzilla.redhat.com/show_bug.cgi?id=1870285 .

Can you confirm if the problem is still present with a newer 4.5 release?

Comment 14 Martin André 2020-08-28 16:01:30 UTC
This is very likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1870285, I'm marking it as such. The fix should be available in the upcoming 4.5.8 release, please re-open the BZ if it doesn't solve your issue.

*** This bug has been marked as a duplicate of bug 1870285 ***