Bug 2009111 - [IPI-on-GCP] 'Install a cluster with nested virtualization enabled' failed due to unable to launch compute instances
Summary: [IPI-on-GCP] 'Install a cluster with nested virtualization enabled' failed du...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.10.0
Assignee: dmoiseev
QA Contact: sunzhaohua
URL:
Whiteboard:
: 2009127 (view as bug list)
Depends On:
Blocks: 2009738
TreeView+ depends on / blocked
 
Reported: 2021-09-30 01:03 UTC by Jianli Wei
Modified: 2022-04-11 08:33 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Due to backward incompatible change in google cloud sdk, machine controller was not able to create machines due to incorrect resulting image url. Image url logic repaired according latest google sdk changes.
Clone Of:
: 2009738 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:14:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-gcp pull 175 0 None open Bug 2009111: Try fix broken path defaulting for disk image 2021-09-30 20:22:15 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:15:18 UTC

Description Jianli Wei 2021-09-30 01:03:49 UTC
Version:

[fedora@preserve-jiwei ~]$ openshift-install version
openshift-install 4.9.0-0.nightly-2021-09-27-105859
built from commit 4adb828b31a7cccbf4dfda8a4b65bfdb91f7e32a
release image registry.ci.openshift.org/ocp/release@sha256:7f8a62e4579f36d8e128d52f4d97a067f1c9d82c54043cb3aebb41fec7ff6082
release architecture amd64
[fedora@preserve-jiwei ~]$

Platform: GCP

Please specify:
* IPI (automated install with `openshift-install`. If you don't know, then it's IPI)

What happened?

Worker nodes failed to be launched and installation failed finally. 

What did you expect to happen?

Worker nodes should be launched successfully and installation should succeed. 

How to reproduce it (as minimally and precisely as possible)?

FYI the corresponding QE test case is at https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-32001. 

IPI using an 'install-config.yaml' having platform.gcp section like below, where 2 additional lines are inserted: 

platform:
  gcp:
    projectID: openshift-qe
    region: us-central1
    licenses:
    - https://compute.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx

Anything else we need to know?

The sourceImage should be something like 'https://compute.googleapis.com/compute/v1/projects/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image', rather than 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. 

[fedora@preserve-jiwei ~]$ oc get nodes
NAME                                              STATUS   ROLES    AGE   VERSION
jiwei-bb-9sgxx-master-0.c.openshift-qe.internal   Ready    master   38m   v1.22.0-rc.0+af080cb
jiwei-bb-9sgxx-master-1.c.openshift-qe.internal   Ready    master   38m   v1.22.0-rc.0+af080cb
jiwei-bb-9sgxx-master-2.c.openshift-qe.internal   Ready    master   38m   v1.22.0-rc.0+af080cb
[fedora@preserve-jiwei ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          43m     Unable to apply 4.9.0-0.nightly-2021-09-27-105859: some cluster operators have not yet rolled out
[fedora@preserve-jiwei ~]$ oc get co | grep -Ev 'True        False         False'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2021-09-27-105859   False       False         True       37m     OAuthServerRouteEndpointAccessibleControllerAvailable: route.route.openshift.io "oauth-openshift" not found...
console                                    4.9.0-0.nightly-2021-09-27-105859   False       False         True       25m     RouteHealthAvailable: console route is not admitted
image-registry                                                                 False       True          True       27m     Available: The deployment does not have available replicas...
ingress                                                                        False       True          True       26m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
monitoring                                                                     False       True          True       20m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.9.0-0.nightly-2021-09-27-105859   True        True          False      37m     Deployment "openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
[fedora@preserve-jiwei ~]$ 
[fedora@preserve-jiwei ~]$ oc logs machine-api-controllers-6944c86995-6gmg8 -n openshift-machine-api -c machine-controller | grep error
E0929 11:06:28.670672       1 actuator.go:53] jiwei-bb-9sgxx-worker-a-6gn95 error: jiwei-bb-9sgxx-worker-a-6gn95: reconciler failed to C
reate machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage':
 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid
W0929 11:06:28.671846       1 controller.go:366] jiwei-bb-9sgxx-worker-a-6gn95: failed to create machine: jiwei-bb-9sgxx-worker-a-6gn95: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid
I0929 11:06:28.671958       1 controller.go:470] Actuator returned invalid configuration error: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid
I0929 11:06:28.672078       1 recorder.go:104] controller-runtime/manager/events "msg"="Warning"  "message"="jiwei-bb-9sgxx-worker-a-6gn95: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jiwei-bb-9sgxx-worker-a-6gn95","uid":"b6ada9d1-71a2-4f48-8982-dd308ab3a8bd","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"9730"} "reason"="FailedCreate"
E0929 11:06:49.202333       1 actuator.go:53] jiwei-bb-9sgxx-worker-b-dnsvj error: jiwei-bb-9sgxx-worker-b-dnsvj: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid
W0929 11:06:49.203363       1 controller.go:366] jiwei-bb-9sgxx-worker-b-dnsvj: failed to create machine: jiwei-bb-9sgxx-worker-b-dnsvj: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid
I0929 11:06:49.203491       1 controller.go:470] Actuator returned invalid configuration error: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid
I0929 11:06:49.203573       1 recorder.go:104] controller-runtime/manager/events "msg"="Warning"  "message"="jiwei-bb-9sgxx-worker-b-dnsvj: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://compute.googleapis.com/compute/v1/openshift-qe/global/images/jiwei-bb-9sgxx-rhcos-image'. The URL is malformed., invalid" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jiwei-bb-9sgxx-worker-b-dnsvj","uid":"d7fe5320-a482-4245-9bf6-5b462e47ff24","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"9743"} "reason"="FailedCreate"
E0929 11:13:09.220953       1 leaderelection.go:330] error retrieving resource lock openshift-machine-api/cluster-api-provider-gcp-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps cluster-api-provider-gcp-leader)
[fedora@preserve-jiwei ~]$

Comment 1 sunzhaohua 2021-09-30 04:06:22 UTC
Upgrade a cluster from 4.3.40->4.4.33->4.5.41->4.6.46->4.7.32->4.8.13->4.9.0-0.nightly-2021-09-29-172320 then scale up machineset, met same issue.

$ oc get machine
NAME                     PHASE     TYPE            REGION        ZONE            AGE
zhsun9-z8mz5-m-0         Running   n1-standard-4   us-central1   us-central1-a   23h
zhsun9-z8mz5-m-1         Running   n1-standard-4   us-central1   us-central1-b   23h
zhsun9-z8mz5-m-2         Running   n1-standard-4   us-central1   us-central1-c   23h
zhsun9-z8mz5-w-a-r2f8l   Running   n1-standard-4   us-central1   us-central1-a   20h
zhsun9-z8mz5-w-b-lc8hm   Running   n1-standard-4   us-central1   us-central1-b   13h
zhsun9-z8mz5-w-c-gxz78   Failed                                                  5m6s
zhsun9-z8mz5-w-c-pj9dw   Failed                                                  5m6s
zhsun9-z8mz5-w-c-r4cjs   Running   n1-standard-4   us-central1   us-central1-c   20h
zhsun9-z8mz5-w-c-r8cd5   Failed                                                  5m6s
zhsun9-z8mz5-w-f-fbrqp   Running   n1-standard-4   us-central1   us-central1-f   133m
zhsun9-z8mz5-w-f-wsxp7   Failed                                                  24m

$ oc edit machine zhsun9-z8mz5-w-c-gxz78
status:
  conditions:
  - lastTransitionTime: "2021-09-30T03:53:57Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  errorMessage: 'error launching instance: googleapi: Error 400: Invalid value for
    field ''resource.disks[0].initializeParams.sourceImage'': ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/zhsun9-z8mz5-rhcos-image''.
    The URL is malformed., invalid'
  errorReason: InvalidConfiguration
  lastUpdated: "2021-09-30T03:53:57Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2021-09-30T03:53:57Z"
      lastTransitionTime: "2021-09-30T03:53:57Z"
      message: 'googleapi: Error 400: Invalid value for field ''resource.disks[0].initializeParams.sourceImage'':
        ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/zhsun9-z8mz5-rhcos-image''.
        The URL is malformed., invalid'
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreated

Comment 2 sunzhaohua 2021-09-30 04:10:40 UTC
Missing a "projects" between v1 and openshift-qe in the URL based on https://cloud.google.com/compute/docs/reference/rest/v1/images/list#http-request

Comment 3 Mike Fedosin 2021-09-30 10:28:24 UTC
I'm investigating... It seems like the installer generates correct URLs for images: https://github.com/openshift/installer/blob/master/pkg/asset/rhcos/image.go#L103
So something happens after that.

Could you please provide a must-gather output for this issue?

Comment 4 sunzhaohua 2021-09-30 11:08:48 UTC
Mike, seems this bug is same with https://bugzilla.redhat.com/show_bug.cgi?id=2009127#c1 must-gather is provided

Comment 6 dmoiseev 2021-09-30 16:09:31 UTC
(In reply to sunzhaohua from comment #1)
> Upgrade a cluster from
> 4.3.40->4.4.33->4.5.41->4.6.46->4.7.32->4.8.13->4.9.0-0.nightly-2021-09-29-
> 172320 then scale up machineset, met same issue.
> 
> $ oc get machine
> NAME                     PHASE     TYPE            REGION        ZONE       
> AGE
> zhsun9-z8mz5-m-0         Running   n1-standard-4   us-central1  
> us-central1-a   23h
> zhsun9-z8mz5-m-1         Running   n1-standard-4   us-central1  
> us-central1-b   23h
> zhsun9-z8mz5-m-2         Running   n1-standard-4   us-central1  
> us-central1-c   23h
> zhsun9-z8mz5-w-a-r2f8l   Running   n1-standard-4   us-central1  
> us-central1-a   20h
> zhsun9-z8mz5-w-b-lc8hm   Running   n1-standard-4   us-central1  
> us-central1-b   13h
> zhsun9-z8mz5-w-c-gxz78   Failed                                             
> 5m6s
> zhsun9-z8mz5-w-c-pj9dw   Failed                                             
> 5m6s
> zhsun9-z8mz5-w-c-r4cjs   Running   n1-standard-4   us-central1  
> us-central1-c   20h
> zhsun9-z8mz5-w-c-r8cd5   Failed                                             
> 5m6s
> zhsun9-z8mz5-w-f-fbrqp   Running   n1-standard-4   us-central1  
> us-central1-f   133m
> zhsun9-z8mz5-w-f-wsxp7   Failed                                             
> 24m
> 
> $ oc edit machine zhsun9-z8mz5-w-c-gxz78
> status:
>   conditions:
>   - lastTransitionTime: "2021-09-30T03:53:57Z"
>     message: Instance has not been created
>     reason: InstanceNotCreated
>     severity: Warning
>     status: "False"
>     type: InstanceExists
>   errorMessage: 'error launching instance: googleapi: Error 400: Invalid
> value for
>     field ''resource.disks[0].initializeParams.sourceImage'':
> ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/
> zhsun9-z8mz5-rhcos-image''.
>     The URL is malformed., invalid'
>   errorReason: InvalidConfiguration
>   lastUpdated: "2021-09-30T03:53:57Z"
>   phase: Failed
>   providerStatus:
>     conditions:
>     - lastProbeTime: "2021-09-30T03:53:57Z"
>       lastTransitionTime: "2021-09-30T03:53:57Z"
>       message: 'googleapi: Error 400: Invalid value for field
> ''resource.disks[0].initializeParams.sourceImage'':
>        
> ''https://compute.googleapis.com/compute/v1/openshift-qe/global/images/
> zhsun9-z8mz5-rhcos-image''.
>         The URL is malformed., invalid'
>       reason: MachineCreationFailed
>       status: "False"
>       type: MachineCreated

Can you please share machineset and machines (running and failed) manifests?

Comment 7 dmoiseev 2021-09-30 16:16:50 UTC
Looking into a code i don't understand why this test did passing before.

This 
https://github.com/openshift/cluster-api-provider-gcp/blob/release-4.9/pkg/cloud/gcp/actuators/machine/reconciler.go#L74 been there for last year at least, related installer parts which i'm aware of did not change for quite a while as well. Need investigate this.

Comment 8 dmoiseev 2021-09-30 16:34:27 UTC
Ok, base path was changed inside google sdk, so, my fix should be valid. Previous ocp versions should not be affected.
Evidences might be found in diff:

git diff --output=diff c6faa4bae2ca201573c628e92b112971833284e7~1..HEAD vendor/google.golang.org/api/compute/v1/compute-gen.go

Comment 10 dmoiseev 2021-09-30 20:19:39 UTC
I changed target release for being able to backport patch to 4.9 using your existing automation.

Comment 13 dmoiseev 2021-10-06 12:50:34 UTC
*** Bug 2009127 has been marked as a duplicate of this bug. ***

Comment 14 sunzhaohua 2021-10-08 04:12:21 UTC
verified
clusterversion: 4.10.0-0.nightly-2021-10-07-212540

upgrade from 4.9.0-rc.1 to 4.10.0-0.nightly-2021-10-07-212540, upgrade is successful. After upgrade, machine could be created successful.

$ oc get machine
NAME                             PHASE     TYPE            REGION        ZONE            AGE
zhsun1081-j96wp-master-0         Running   n1-standard-4   us-central1   us-central1-a   129m
zhsun1081-j96wp-master-1         Running   n1-standard-4   us-central1   us-central1-b   129m
zhsun1081-j96wp-master-2         Running   n1-standard-4   us-central1   us-central1-c   129m
zhsun1081-j96wp-worker-a-z2r2p   Running   n1-standard-4   us-central1   us-central1-a   122m
zhsun1081-j96wp-worker-b-vdbkz   Running   n1-standard-4   us-central1   us-central1-b   122m
zhsun1081-j96wp-worker-c-4b74m   Running   n1-standard-4   us-central1   us-central1-c   3m13s
zhsun1081-j96wp-worker-c-f9wfn   Running   n1-standard-4   us-central1   us-central1-c   122m

Comment 17 errata-xmlrpc 2022-03-10 16:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.