Version: [fedora@preserve-jiwei ~]$ openshift-install version openshift-install 4.9.0-0.nightly-2021-09-27-105859 built from commit 4adb828b31a7cccbf4dfda8a4b65bfdb91f7e32a release image registry.ci.openshift.org/ocp/release@sha256:7f8a62e4579f36d8e128d52f4d97a067f1c9d82c54043cb3aebb41fec7ff6082 release architecture amd64 [fedora@preserve-jiwei ~]$ Platform: GCP Please specify: * IPI (automated install with `openshift-install`. If you don't know, then it's IPI) What happened? Worker nodes failed to be launched and installation failed finally. What did you expect to happen? Worker nodes should be launched successfully and installation should succeed. How to reproduce it (as minimally and precisely as possible)? FYI the corresponding QE test case is at https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-32001. IPI using an 'install-config.yaml' having platform.gcp section like below, where 2 additional lines are inserted: platform: gcp: projectID: openshift-qe region: us-central1 licenses: - https://compute.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx Anything else we need to know? The sourceImage should be something like 'https://compute.googleapis.com/compute/v1/projects/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image', rather than 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. [fedora@preserve-jiwei ~]$ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-bb-9sgxx-master-0.c.openshift-qe.internal Ready master 38m v1.22.0-rc.0+af080cb jiwei-bb-9sgxx-master-1.c.openshift-qe.internal Ready master 38m v1.22.0-rc.0+af080cb jiwei-bb-9sgxx-master-2.c.openshift-qe.internal Ready master 38m v1.22.0-rc.0+af080cb [fedora@preserve-jiwei ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 43m Unable to apply 4.9.0-0.nightly-2021-09-27-105859: some cluster operators have not yet rolled out [fedora@preserve-jiwei ~]$ oc get co | grep -Ev 'True False False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-09-27-105859 False False True 37m OAuthServerRouteEndpointAccessibleControllerAvailable: route.route.openshift.io "oauth-openshift" not found... console 4.9.0-0.nightly-2021-09-27-105859 False False True 25m RouteHealthAvailable: console route is not admitted image-registry False True True 27m Available: The deployment does not have available replicas... ingress False True True 26m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) monitoring False True True 20m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.9.0-0.nightly-2021-09-27-105859 True True False 37m Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready [fedora@preserve-jiwei ~]$ [fedora@preserve-jiwei ~]$ oc logs machine-api-controllers-6944c86995-6gmg8 -n openshift-machine-api -c machine-controller | grep error E0929 11:06:28.670672 1 actuator.go:53] jiwei-bb-9sgxx-worker-a-6gn95 error: jiwei-bb-9sgxx-worker-a-6gn95: reconciler failed to C reate machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid W0929 11:06:28.671846 1 controller.go:366] jiwei-bb-9sgxx-worker-a-6gn95: failed to create machine: jiwei-bb-9sgxx-worker-a-6gn95: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid I0929 11:06:28.671958 1 controller.go:470] Actuator returned invalid configuration error: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid I0929 11:06:28.672078 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="jiwei-bb-9sgxx-worker-a-6gn95: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jiwei-bb-9sgxx-worker-a-6gn95","uid":"b6ada9d1-71a2-4f48-8982-dd308ab3a8bd","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"9730"} "reason"="FailedCreate" E0929 11:06:49.202333 1 actuator.go:53] jiwei-bb-9sgxx-worker-b-dnsvj error: jiwei-bb-9sgxx-worker-b-dnsvj: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid W0929 11:06:49.203363 1 controller.go:366] jiwei-bb-9sgxx-worker-b-dnsvj: failed to create machine: jiwei-bb-9sgxx-worker-b-dnsvj: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid I0929 11:06:49.203491 1 controller.go:470] Actuator returned invalid configuration error: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid I0929 11:06:49.203573 1 recorder.go:104] controller-runtime/manager/events "msg"="Warning" "message"="jiwei-bb-9sgxx-worker-b-dnsvj: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jiwei-bb-9sgxx-worker-b-dnsvj","uid":"d7fe5320-a482-4245-9bf6-5b462e47ff24","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"9743"} "reason"="FailedCreate" E0929 11:13:09.220953 1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-api-provider-gcp-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps cluster-api-provider-gcp-leader) [fedora@preserve-jiwei ~]$
Upgrade a cluster from 4.3.40->4.4.33->4.5.41->4.6.46->4.7.32->4.8.13->4.9.0-0.nightly-2021-09-29-172320 then scale up machineset, met same issue. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun9-z8mz5-m-0 Running n1-standard-4 us-central1 us-central1-a 23h zhsun9-z8mz5-m-1 Running n1-standard-4 us-central1 us-central1-b 23h zhsun9-z8mz5-m-2 Running n1-standard-4 us-central1 us-central1-c 23h zhsun9-z8mz5-w-a-r2f8l Running n1-standard-4 us-central1 us-central1-a 20h zhsun9-z8mz5-w-b-lc8hm Running n1-standard-4 us-central1 us-central1-b 13h zhsun9-z8mz5-w-c-gxz78 Failed 5m6s zhsun9-z8mz5-w-c-pj9dw Failed 5m6s zhsun9-z8mz5-w-c-r4cjs Running n1-standard-4 us-central1 us-central1-c 20h zhsun9-z8mz5-w-c-r8cd5 Failed 5m6s zhsun9-z8mz5-w-f-fbrqp Running n1-standard-4 us-central1 us-central1-f 133m zhsun9-z8mz5-w-f-wsxp7 Failed 24m $ oc edit machine zhsun9-z8mz5-w-c-gxz78 status: conditions: - lastTransitionTime: "2021-09-30T03:53:57Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists errorMessage: 'error launching instance: googleapi: Error 400: Invalid value for field ''resource.disks[0].initializeParams.sourceImage'': ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/zhsun9-z8mz5-rhcos-image''. The URL is malformed., invalid' errorReason: InvalidConfiguration lastUpdated: "2021-09-30T03:53:57Z" phase: Failed providerStatus: conditions: - lastProbeTime: "2021-09-30T03:53:57Z" lastTransitionTime: "2021-09-30T03:53:57Z" message: 'googleapi: Error 400: Invalid value for field ''resource.disks[0].initializeParams.sourceImage'': ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/zhsun9-z8mz5-rhcos-image''. The URL is malformed., invalid' reason: MachineCreationFailed status: "False" type: MachineCreated
Missing a "projects" between v1 and openshift-qe in the URL based on https://cloud.google.com/compute/docs/reference/rest/v1/images/list#http-request
I'm investigating... It seems like the installer generates correct URLs for images: https://github.com/openshift/installer/blob/master/pkg/asset/rhcos/image.go#L103 So something happens after that. Could you please provide a must-gather output for this issue?
Mike, seems this bug is same with https://bugzilla.redhat.com/show_bug.cgi?id=2009127#c1 must-gather is provided
PR: https://github.com/openshift/cluster-api-provider-gcp/pull/175/files
(In reply to sunzhaohua from comment #1) > Upgrade a cluster from > 4.3.40->4.4.33->4.5.41->4.6.46->4.7.32->4.8.13->4.9.0-0.nightly-2021-09-29- > 172320 then scale up machineset, met same issue. > > $ oc get machine > NAME PHASE TYPE REGION ZONE > AGE > zhsun9-z8mz5-m-0 Running n1-standard-4 us-central1 > us-central1-a 23h > zhsun9-z8mz5-m-1 Running n1-standard-4 us-central1 > us-central1-b 23h > zhsun9-z8mz5-m-2 Running n1-standard-4 us-central1 > us-central1-c 23h > zhsun9-z8mz5-w-a-r2f8l Running n1-standard-4 us-central1 > us-central1-a 20h > zhsun9-z8mz5-w-b-lc8hm Running n1-standard-4 us-central1 > us-central1-b 13h > zhsun9-z8mz5-w-c-gxz78 Failed > 5m6s > zhsun9-z8mz5-w-c-pj9dw Failed > 5m6s > zhsun9-z8mz5-w-c-r4cjs Running n1-standard-4 us-central1 > us-central1-c 20h > zhsun9-z8mz5-w-c-r8cd5 Failed > 5m6s > zhsun9-z8mz5-w-f-fbrqp Running n1-standard-4 us-central1 > us-central1-f 133m > zhsun9-z8mz5-w-f-wsxp7 Failed > 24m > > $ oc edit machine zhsun9-z8mz5-w-c-gxz78 > status: > conditions: > - lastTransitionTime: "2021-09-30T03:53:57Z" > message: Instance has not been created > reason: InstanceNotCreated > severity: Warning > status: "False" > type: InstanceExists > errorMessage: 'error launching instance: googleapi: Error 400: Invalid > value for > field ''resource.disks[0].initializeParams.sourceImage'': > ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/ > zhsun9-z8mz5-rhcos-image''. > The URL is malformed., invalid' > errorReason: InvalidConfiguration > lastUpdated: "2021-09-30T03:53:57Z" > phase: Failed > providerStatus: > conditions: > - lastProbeTime: "2021-09-30T03:53:57Z" > lastTransitionTime: "2021-09-30T03:53:57Z" > message: 'googleapi: Error 400: Invalid value for field > ''resource.disks[0].initializeParams.sourceImage'': > > ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/ > zhsun9-z8mz5-rhcos-image''. > The URL is malformed., invalid' > reason: MachineCreationFailed > status: "False" > type: MachineCreated Can you please share machineset and machines (running and failed) manifests?
Looking into a code i don't understand why this test did passing before. This https://github.com/openshift/cluster-api-provider-gcp/blob/release-4.9/pkg/cloud/gcp/actuators/machine/reconciler.go#L74 been there for last year at least, related installer parts which i'm aware of did not change for quite a while as well. Need investigate this.
Ok, base path was changed inside google sdk, so, my fix should be valid. Previous ocp versions should not be affected. Evidences might be found in diff: git diff --output=diff c6faa4bae2ca201573c628e92b112971833284e7~1..HEAD vendor/google.golang.org/api/compute/v1/compute-gen.go
I changed target release for being able to backport patch to 4.9 using your existing automation.
*** Bug 2009127 has been marked as a duplicate of this bug. ***
verified clusterversion: 4.10.0-0.nightly-2021-10-07-212540 upgrade from 4.9.0-rc.1 to 4.10.0-0.nightly-2021-10-07-212540, upgrade is successful. After upgrade, machine could be created successful. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun1081-j96wp-master-0 Running n1-standard-4 us-central1 us-central1-a 129m zhsun1081-j96wp-master-1 Running n1-standard-4 us-central1 us-central1-b 129m zhsun1081-j96wp-master-2 Running n1-standard-4 us-central1 us-central1-c 129m zhsun1081-j96wp-worker-a-z2r2p Running n1-standard-4 us-central1 us-central1-a 122m zhsun1081-j96wp-worker-b-vdbkz Running n1-standard-4 us-central1 us-central1-b 122m zhsun1081-j96wp-worker-c-4b74m Running n1-standard-4 us-central1 us-central1-c 3m13s zhsun1081-j96wp-worker-c-f9wfn Running n1-standard-4 us-central1 us-central1-c 122m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056