So I took a look at the must-gather. There appears to have been 2 requests for worker configs at some point but no workers ever joined the cluster. I don't think this is exactly https://bugzilla.redhat.com/show_bug.cgi?id=1870343 because the machine-config-daemon-pull.service was added in 4.6. I actually don't see any errors in the MCO namespace. It appears to me that the install has a lot of other failures: From your clusterversion: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring And in clusteroperators: IngressStateEndpointsDegraded: No endpoints found for oauth-server RouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.auto-jiajliu-618212.qe.gcp.devcluster.openshift.com: [] How reproducible is this? It seems to me that your cluster installation failed long before it got to the worker boot. If you think the other errors aren't the root cause, and want to check the MCO, the must-gather doesn't capture enough. I would need either journal or console logs from the failed worker nodes, if they are directly accessible.
> I don't think this is exactly https://bugzilla.redhat.com/show_bug.cgi?id=1870343 because the machine-config-daemon-pull.service was added in 4.6. This bug is more like bug https://bugzilla.redhat.com/show_bug.cgi?id=1859428(marked with duplicated with #1870343), so i cloned from #1859428. If it's not the same issue from mco side, let's change back to machine-config side for further debug. > I actually don't see any errors in the MCO namespace. It appears to me that the install has a lot of other failures: Yeah, the installation can not finish with many operators error due to no worker nodes can be scheduled. > How reproducible is this? It seems to me that your cluster installation failed long before it got to the worker boot. I tried twice yestoday, both of installation failed at the same error.
Ah I see, can you set up an env where you are able to access the workers? i.e. could you check whether the instances themselves have booted, and whether you can get console/journal logs? They never joined the cluster but it looks like they did make requests to the server at some point so I can look into that for you if you are able to get journal logs from worker nodes.
@jiajliu You moved this to the Cloud Compute team without clear reasoning. > They never joined the cluster but it looks like they did make requests to the server Reading through this thread and based on the comment highlighted above, it appears Machine API is working as expected. Is there something you wanted us to check out? I think this should be assigned back to the MCO team
(In reply to Joel Speed from comment #4) > Reading through this thread and based on the comment highlighted above, it > appears Machine API is working as expected. Is there something you wanted us > to check out? > > I think this should be assigned back to the MCO team I changed it to Cloud back based on comment1 and comment2. My original issue is the same failure with #1859428. But i'm not sure which component the root cause is related with now. Since #1859428 is now moved to MCO, let's move it to MCO now to have further debug.
> Ah I see, can you set up an env where you are able to access the workers? i.e. could you check whether the instances themselves have booted, and whether you can get console/journal logs? They never joined the cluster but it looks like they did make requests to the server at some point so I can look into that for you if you are able to get journal logs from worker nodes. @Yu Qi Zhang I miss your last comment, i will try to reproduce the issue today. And based above comment, I just move the bug back to mco to comply with #1859428. Feel free to update the component after your debug. Thx!
@Yu Qi Zhang I just try to reproduce it with v4.5.10 and latest v4.5.19 payload. Both of installations succeed. So i think it's not a 100% reproduce issue. If original must-gather info(still available) in description is not enough to debug the issue, i suggest to close the bug with worksforme and to open it when we hit it again. hdyt?
Yes, that sounds good. If we ever hit this again please attach the console logs for the nodes. Thanks!