Bug 1862017 - Worker nodes not getting added in IPI installation of OCP 4.5 on OSP 13 due to mismatch in DNS config
Summary: Worker nodes not getting added in IPI installation of OCP 4.5 on OSP 13 due t...
Keywords:
Status: CLOSED DUPLICATE of bug 1870285
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.6.0
Assignee: Martin André
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-30 07:34 UTC by Yash Chouksey
Modified: 2020-09-04 02:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-28 16:01:30 UTC
Target Upstream Version:
ychoukse: needinfo-
ychoukse: needinfo-


Attachments (Terms of Use)

Description Yash Chouksey 2020-07-30 07:34:59 UTC
Description of problem:

The IPI installer of OCP 4.5 does not add worker nodes to the cluster. The worker nodes are running as openstack instance but failed to form openshfit cluster.
The installation stalls at a point where the provisioned worker nodes are not able to get added to the cluster. Listing the nodes only shows master nodes, listing the machines has all the nodes but worker nodes stay in "Provisioned" phase. The CSRs are only generated for master nodes. It is observed after some hours of installation that the CSRs for workers are automatically generated and when manually approved, they join the cluster.


The installer is causing some discrepancy in /etc/resolv.conf files on masters and workers. 
For master, there are three entries comprising of the default provided DNS VIP (200.0.0.6) along with external DNSes. For workers, there are only two entries, both are external DNSes given in install-config.yaml.

The nslookup to the internal API server from masters and workers presents two cases:
Case 1: First nameserver is not considered (200.0.0.6) and resolution occur through the second DNS(10.255.0.10). It resolves to IP address seen in the kubelet logs of worker nodes.
Case 2: Explicitly mentioned the nameserver used to resolve should be 200.0.0.6 . It was resolved correctly to the API VIP (200.0.0.5)

It was confirmed that the resolution for internal API in worker nodes is going through an external DNS and hence they are unable to reach the api server. After making manually changes in all the nodes resolv.conf and appended the DNS VIP as the first nameserver, the CSRs were generated and workers were added.


How reproducible:

Steps to Reproduce:
1. Run the OSP IPI Installer of OCP 4.5.2 
2. Wait for it to complete
3. Check the output of # oc get nodes and status of worker VMs in horizon portal.
4. Actual Master Nodes install, Bootstrap machine gets destroyed but worker nodes add to the cluster.

Complete installation logs: http://pastebin.test.redhat.com/889026


Actual results:

Difference in DNS configuration on workers and masters stopping workers to make request to API server and get added in the cluster.

Expected results:

The CSRs for workers must be generated right after the installation and must be automatically approved.

Comment 3 Martin André 2020-07-30 15:15:39 UTC
The worker nodes should normally point at themselves as the first resolver. Could you provide the must-gather archive so we can look at what happens?

Comment 11 Martin André 2020-08-19 12:09:47 UTC
At first glance, nothing out of the ordinary pops up from the provided must-gather archive. Unfortunately, it doesn't include debugging information from the compute nodes since they didn't join the cluster. Could you confirm that all services are running as expected on the compute nodes, in particular the nm-dispatcher scripts?

Please provide the output of `sudo journalctl | grep nm-dispatcher` from the worker nodes.

I have a hunch this could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1853298. In which case, we would need to backport https://github.com/openshift/machine-config-operator/pull/1773 to 4.5.

Comment 12 Pierre Prinetti 2020-08-20 14:35:57 UTC
> Please provide the output of `sudo journalctl | grep nm-dispatcher` from the worker nodes.

We might be able to link this bug to another one for which the fix exists in 4.6. If this is the case, we can backport.

Comment 13 Pierre Prinetti 2020-08-27 14:46:39 UTC
A fixed for this may have landed with https://bugzilla.redhat.com/show_bug.cgi?id=1870285 .

Can you confirm if the problem is still present with a newer 4.5 release?

Comment 14 Martin André 2020-08-28 16:01:30 UTC
This is very likely a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1870285, I'm marking it as such. The fix should be available in the upcoming 4.5.8 release, please re-open the BZ if it doesn't solve your issue.

*** This bug has been marked as a duplicate of bug 1870285 ***


Note You need to log in before you can comment on or make changes to this bug.