Bug 1779866 - When installing openshift via IPI on GCP, compute nodes fail to join the cluster
Summary: When installing openshift via IPI on GCP, compute nodes fail to join the cluster
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Alexander Demicev
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-04 21:27 UTC by rvanderp
Modified: 2023-09-18 00:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-21 09:48:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
machine-controller log (68.99 KB, text/plain)
2019-12-09 03:16 UTC, Nobuhiro Sue
no flags Details
machine-approver-logs (152.85 KB, text/plain)
2019-12-09 16:39 UTC, jim conallen
no flags Details

Description rvanderp 2019-12-04 21:27:26 UTC
Description of problem:
When installing openshift via IPI on GCP, compute nodes fail to join the cluster.


Version-Release number of the following components:
4.2.8

How reproducible:
consistently

Steps to Reproduce:
1. Initiate an IPI installation on GCP 
2. Compute nodes generate numerous CSRs
3. Approving the CSRs does not result in nodes joining the cluster

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
The machine controller claims to not be able to find the machine, but the machine in question exists in GCP.

Comment 3 Abhinav Dahiya 2019-12-05 17:56:16 UTC
cluster machine approver is responsible for approving CSRs

Comment 4 Alberto 2019-12-05 18:02:06 UTC
can you please share machine-controller and machine-approver logs?

Comment 6 jim conallen 2019-12-05 20:49:01 UTC
I am experiencing this same exact issue as well.  I can supply full must-have data if you like.

Comment 7 Nobuhiro Sue 2019-12-09 03:16:23 UTC
Created attachment 1643154 [details]
machine-controller log

Comment 8 Nobuhiro Sue 2019-12-09 03:18:01 UTC
Comment on attachment 1643154 [details]
machine-controller log

I ran into same issue. Attached machine-controller log.

Comment 9 Alberto 2019-12-09 08:35:38 UTC
Machines seem ok. Can you please share cluster-machine-approver logs?

Comment 10 jim conallen 2019-12-09 16:39:23 UTC
Created attachment 1643354 [details]
machine-approver-logs

Here are the machine-approver logs from the instance that I have.

Comment 11 jim conallen 2019-12-10 19:40:30 UTC
Using the 4.2.10 installer, and using a different data center (us-east-4) is working for me.  

When I tried the 4.2.9 installer, and us-central-1 it too failed.   So at this point I don't know which of the two things that I changed was the trick.

Comment 12 rvanderp 2019-12-16 13:38:55 UTC
We tried to install with 4.2.10 and it did not resolve the issue in our case.  The cluster seems to be unable to resolve routes and think this is because the compute nodes have not yet been joined to the cluster.

      message: 'RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.x.x.
        on 172.30.0.10:53: no such host'

Comment 13 Christoph Blecker 2019-12-24 03:07:59 UTC
I've been able to confirm this behaviour too when provisioning OCP 4.2.z clusters on GCP.

When provisioning a 4.1.z cluster with default settings, I encounter an issue where the machines will provision, but the nodes will never join the cluster.

I have attempted 7 failed installations with the following configurations:
- 4.1.2 in us-central1
- 4.1.12 in us-central1
- 4.1.12 in europe-west2

And I've been able to successfully provision a 4.1.12 cluster us-east4 once, but when attempting it a second time, I encounter the same failure as the other regions.

The specific symptoms I see are:
- Installer runs and bootstrap phase completes
- Masters are provisioned and added to the cluster
- A MachineSet and Machines are created, which provision the GCP VMs for nodes
- These VMs never successfully join the cluster as nodes
- CSRs start piling up, very quickly into the hundreds, which then causes the machine-approver to stop approving CSRs altogether.
- This is shown in the machine-approver logs

Comment 15 Alberto 2020-01-02 09:08:57 UTC
> I've been able to confirm this behaviour too when provisioning OCP 4.2.z clusters on GCP.

can you please leave aside 4.1 for now and share the exact version of 4.2.z where this happens and share the must gather logs and also particularly the machine approver logs for that cluster? so we can narrow this down the root problem why this ticket was created and break it down into different problems if needed, then fix in 4.2, then see if anything needs to be back ported to 4.1.

In https://bugzilla.redhat.com/show_bug.cgi?id=1779866#c2 I can't see machine approver logs and I'm not sure https://bugzilla.redhat.com/show_bug.cgi?id=1779866#c10 belongs to the original cluster why this ticket was reported since logs seems ok.

Comment 16 Christoph Blecker 2020-01-03 15:55:39 UTC
(In reply to Alberto from comment #15)
> > I've been able to confirm this behaviour too when provisioning OCP 4.2.z clusters on GCP.
> 
> can you please leave aside 4.1 for now and share the exact version of 4.2.z
> where this happens and share the must gather logs and also particularly the
> machine approver logs for that cluster? 

I'm sorry Alberto, but I misspoke on version numbers in https://bugzilla.redhat.com/show_bug.cgi?id=1779866#c13. I've verified that the cluster versions I was attempting to install are actually 4.2.2 and 4.2.12, not 4.1.z.

The machine approver logs I uploaded in https://bugzilla.redhat.com/show_bug.cgi?id=1779866#attach_1647438 are from a 4.2.12 cluster.

Comment 19 Alberto 2020-02-21 09:48:15 UTC
Hey Christoph we have not been able to reproduce this so far. Please reopen if this is still relevant.

Comment 20 Red Hat Bugzilla 2023-09-18 00:18:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.