Bug 1879826

Summary:	[gcp] Machine stuck at 'Provisioned' phase
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Machine Config Operator	Assignee:	Yu Qi Zhang <jerzhang>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5	CC:	amurdaca, jerzhang, jiajliu, m.andre, mgugino, mnguyen, walters, wjiang, zhsun
Target Milestone:	---
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1859428	Environment:
Last Closed:	2020-11-24 03:27:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1859428
Bug Blocks:

Comment 1 Yu Qi Zhang 2020-09-17 21:41:27 UTC

So I took a look at the must-gather. There appears to have been 2 requests for worker configs at some point but no workers ever joined the cluster. I don't think this is exactly https://bugzilla.redhat.com/show_bug.cgi?id=1870343 because the machine-config-daemon-pull.service was added in 4.6.

I actually don't see any errors in the MCO namespace. It appears to me that the install has a lot of other failures:

From your clusterversion:
      Some cluster operators are still updating: authentication, console,
      csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator,
      monitoring

And in clusteroperators:
      IngressStateEndpointsDegraded: No endpoints found for oauth-server
      RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.auto-jiajliu-618212.qe.gcp.devcluster.openshift.com: []

How reproducible is this? It seems to me that your cluster installation failed long before it got to the worker boot.
 
If you think the other errors aren't the root cause, and want to check the MCO, the must-gather doesn't capture enough. I would need either journal or console logs from the failed worker nodes, if they are directly accessible.

Comment 2 liujia 2020-09-18 01:02:50 UTC

> I don't think this is exactly https://bugzilla.redhat.com/show_bug.cgi?id=1870343 because the machine-config-daemon-pull.service was added in 4.6.
This bug is more like bug https://bugzilla.redhat.com/show_bug.cgi?id=1859428(marked with duplicated with #1870343), so i cloned from #1859428. If it's not the same issue from mco side, let's change back to machine-config side for further debug. 

> I actually don't see any errors in the MCO namespace. It appears to me that the install has a lot of other failures:
Yeah, the installation can not finish with many operators error due to no worker nodes can be scheduled. 

> How reproducible is this? It seems to me that your cluster installation failed long before it got to the worker boot.
I tried twice yestoday, both of installation failed at the same error.

Comment 3 Yu Qi Zhang 2020-09-18 14:34:34 UTC

Ah I see, can you set up an env where you are able to access the workers? i.e. could you check whether the instances themselves have booted, and whether you can get console/journal logs? They never joined the cluster but it looks like they did make requests to the server at some point so I can look into that for you if you are able to get journal logs from worker nodes.

Comment 4 Joel Speed 2020-11-13 13:11:38 UTC

@jiajliu You moved this to the Cloud Compute team without clear reasoning.

> They never joined the cluster but it looks like they did make requests to the server

Reading through this thread and based on the comment highlighted above, it appears Machine API is working as expected. Is there something you wanted us to check out?

I think this should be assigned back to the MCO team

Comment 5 liujia 2020-11-16 02:12:25 UTC

(In reply to Joel Speed from comment #4) 
> Reading through this thread and based on the comment highlighted above, it
> appears Machine API is working as expected. Is there something you wanted us
> to check out?
> 
> I think this should be assigned back to the MCO team

I changed it to Cloud back based on comment1 and comment2. My original issue is the same failure with #1859428. But i'm not sure which component the root cause is related with now.  Since #1859428 is now moved to MCO, let's move it to MCO now to have further debug.

Comment 6 liujia 2020-11-16 02:13:43 UTC

> Ah I see, can you set up an env where you are able to access the workers? i.e. could you check whether the instances themselves have booted, and whether you can get console/journal logs? They never joined the cluster but it looks like they did make requests to the server at some point so I can look into that for you if you are able to get journal logs from worker nodes.

@Yu Qi Zhang I miss your last comment, i will try to reproduce the issue today. And based above comment, I just move the bug back to mco to comply with #1859428. Feel free to update the component after your debug. Thx!

Comment 7 liujia 2020-11-16 03:30:37 UTC

@Yu Qi Zhang
I just try to reproduce it with v4.5.10 and latest v4.5.19 payload. Both of installations succeed. So i think it's not a 100% reproduce issue. If original must-gather info(still available) in description is not enough to debug the issue, i suggest to close the bug with worksforme and to open it when we hit it again. hdyt?

Comment 8 Yu Qi Zhang 2020-11-24 03:27:09 UTC

Yes, that sounds good. If we ever hit this again please attach the console logs for the nodes. Thanks!