Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1810674

Summary: When installing OpenShift on GCP, cluster fails to to initialize
Product: OpenShift Container Platform Reporter: DC <dcook>
Component: Cloud ComputeAssignee: Danil Grigorev <dgrigore>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED EOL Docs Contact:
Severity: medium    
Priority: unspecified CC: agarcial, dcook, jupittma, rvanderp
Version: 4.2.z   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-12 13:04:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
machine-approver logs
none
machine-approver log file none

Description DC 2020-03-05 17:29:57 UTC
Description of problem:
When installing OpenShift 4.2 on GCP, bootstrap process completes but cluster and operators fail to initialize.

Version-Release number of selected component (if applicable):
4.2.21

How reproducible:
This is consistent

Steps to Reproduce:
1.  Create GCP project as described here:  https://docs.openshift.com/container-platform/4.2/installing/installing_gcp/installing-gcp-account.html
2.  Create OpenShift cluster with default configuration properties as described here:  https://docs.openshift.com/container-platform/4.2/installing/installing_gcp/installing-gcp-default.html
3.

Actual results:
Cluster does not initialize

time="2020-03-05T09:28:07-06:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.21: 99% complete"
time="2020-03-05T09:30:37-06:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"
time="2020-03-05T09:34:22-06:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.21: 99% complete"
time="2020-03-05T09:36:52-06:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"
time="2020-03-05T09:39:48-06:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring"


Expected results:
OpenShift cluster with three masters and three workers

Additional info:
All GCP Compute Engine instances and related infrastructure are apparently created.

$ oc --config kubeconfig get nodes
NAME                                                            STATUS   ROLES    AGE    VERSION
thing-9fpxc-m-0.us-central1-a.c.my-big-ocp42-project.internal   Ready    master   142m   v1.14.6+47933cbcc
thing-9fpxc-m-1.us-central1-b.c.my-big-ocp42-project.internal   Ready    master   142m   v1.14.6+47933cbcc
thing-9fpxc-m-2.us-central1-c.c.my-big-ocp42-project.internal   Ready    master   143m   v1.14.6+47933cbcc


$ oc --config kubeconfig get pods
No resources found.

$ oc --config kubeconfig get machine -n openshift-machine-api
NAME                    STATE     TYPE            REGION        ZONE            AGE
thing-9fpxc-m-0         RUNNING   n1-standard-4   us-central1   us-central1-a   142m
thing-9fpxc-m-1         RUNNING   n1-standard-4   us-central1   us-central1-b   142m
thing-9fpxc-m-2         RUNNING   n1-standard-4   us-central1   us-central1-c   142m
thing-9fpxc-w-a-k787d   RUNNING   n1-standard-4   us-central1   us-central1-a   140m
thing-9fpxc-w-b-lq2lm   RUNNING   n1-standard-4   us-central1   us-central1-b   140m
thing-9fpxc-w-c-659j5   RUNNING   n1-standard-4   us-central1   us-central1-c   140m

Comment 1 DC 2020-03-10 13:35:03 UTC
Hi Alberto, does your note mean that the target release for the IPI deployment on GCP is OpenShift 4.5?

Comment 2 Alberto 2020-03-10 13:46:06 UTC
Hey, GCP is supported since 4.2. We just point issues target release always to the latest version of the product so a possible bug is fixed, validated and back ported appropriately to any lower release.

Comment 3 DC 2020-03-10 13:51:05 UTC
Thank you for the clarification.  Any suggestions on how I can resolve the issue I am facing now?

Comment 4 Alberto 2020-03-10 14:00:18 UTC
Need to figure out why your machines are not becoming nodes. Can you share cluster-machine-approver logs?

Comment 5 DC 2020-03-10 23:05:34 UTC
Hi there, 
Attached are my cluster-machine-approver log.

I got via this call:  oc logs machine-approver-5fdfd5595b-d6hvk -n openshift-cluster-machine-approver

I hope that was correct.

Comment 6 DC 2020-03-10 23:07:21 UTC
Created attachment 1669128 [details]
machine-approver logs

Requested  openshift-cluster-machine-approver logs

Comment 7 DC 2020-03-10 23:10:35 UTC
Created attachment 1669129 [details]
machine-approver log file

Comment 8 rvanderp 2020-04-01 13:30:24 UTC
We were able to complete the installation by using one of the masters as a bastion node and ssh'ing to the compute nodes from there.  Once on the compute nodes, we used hostnamectl to set the hostname in accordance with the node's corresponding machine object.  Once we did that, CSR were generated and approved for the nodes.  I think the root cause here is the compute nodes came up with a hostname of localhost.  This is being looked at in BZ 1811827.  

https://bugzilla.redhat.com/show_bug.cgi?id=1811827

Comment 10 Alberto 2020-04-03 08:34:42 UTC
> I think the root cause here is the compute nodes came up with a hostname of localhost.  This is being looked at in BZ 1811827.
We need to understand why this happens occasionally on GCP. We having being able to reproduce this so far.

Comment 12 DC 2020-04-14 03:15:22 UTC
Hi Alberto,
Do you need additional information from me on this?  If so, what can I send you?

This happens consistently every time I try an install on GCP.

Comment 13 Alberto 2020-05-12 13:04:15 UTC
This seems a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1809345.
Let's track it there. And consider any back port after it passes qe validation.

Closing as duplicate is giving a bz error, so selecting EOL because of that.