Description of problem:
When installing openshift via IPI on GCP, compute nodes fail to join the cluster.
Version-Release number of the following components:
Steps to Reproduce:
1. Initiate an IPI installation on GCP
2. Compute nodes generate numerous CSRs
3. Approving the CSRs does not result in nodes joining the cluster
Please include the entire output from the last TASK line through the end of output if an error is generated
The machine controller claims to not be able to find the machine, but the machine in question exists in GCP.
cluster machine approver is responsible for approving CSRs
can you please share machine-controller and machine-approver logs?
I am experiencing this same exact issue as well. I can supply full must-have data if you like.
Created attachment 1643154 [details]
Comment on attachment 1643154 [details]
I ran into same issue. Attached machine-controller log.
Machines seem ok. Can you please share cluster-machine-approver logs?
Created attachment 1643354 [details]
Here are the machine-approver logs from the instance that I have.
Using the 4.2.10 installer, and using a different data center (us-east-4) is working for me.
When I tried the 4.2.9 installer, and us-central-1 it too failed. So at this point I don't know which of the two things that I changed was the trick.
We tried to install with 4.2.10 and it did not resolve the issue in our case. The cluster seems to be unable to resolve routes and think this is because the compute nodes have not yet been joined to the cluster.
message: 'RouteHealthDegraded: failed to GET route: dial tcp: lookup oauth-openshift.apps.x.x.
on 172.30.0.10:53: no such host'
I've been able to confirm this behaviour too when provisioning OCP 4.2.z clusters on GCP.
When provisioning a 4.1.z cluster with default settings, I encounter an issue where the machines will provision, but the nodes will never join the cluster.
I have attempted 7 failed installations with the following configurations:
- 4.1.2 in us-central1
- 4.1.12 in us-central1
- 4.1.12 in europe-west2
And I've been able to successfully provision a 4.1.12 cluster us-east4 once, but when attempting it a second time, I encounter the same failure as the other regions.
The specific symptoms I see are:
- Installer runs and bootstrap phase completes
- Masters are provisioned and added to the cluster
- A MachineSet and Machines are created, which provision the GCP VMs for nodes
- These VMs never successfully join the cluster as nodes
- CSRs start piling up, very quickly into the hundreds, which then causes the machine-approver to stop approving CSRs altogether.
- This is shown in the machine-approver logs
> I've been able to confirm this behaviour too when provisioning OCP 4.2.z clusters on GCP.
can you please leave aside 4.1 for now and share the exact version of 4.2.z where this happens and share the must gather logs and also particularly the machine approver logs for that cluster? so we can narrow this down the root problem why this ticket was created and break it down into different problems if needed, then fix in 4.2, then see if anything needs to be back ported to 4.1.
In https://bugzilla.redhat.com/show_bug.cgi?id=1779866#c2 I can't see machine approver logs and I'm not sure https://bugzilla.redhat.com/show_bug.cgi?id=1779866#c10 belongs to the original cluster why this ticket was reported since logs seems ok.
(In reply to Alberto from comment #15)
> > I've been able to confirm this behaviour too when provisioning OCP 4.2.z clusters on GCP.
> can you please leave aside 4.1 for now and share the exact version of 4.2.z
> where this happens and share the must gather logs and also particularly the
> machine approver logs for that cluster?
I'm sorry Alberto, but I misspoke on version numbers in https://bugzilla.redhat.com/show_bug.cgi?id=1779866#c13. I've verified that the cluster versions I was attempting to install are actually 4.2.2 and 4.2.12, not 4.1.z.
The machine approver logs I uploaded in https://bugzilla.redhat.com/show_bug.cgi?id=1779866#attach_1647438 are from a 4.2.12 cluster.
Hey Christoph we have not been able to reproduce this so far. Please reopen if this is still relevant.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days