Bug 1898194 - GCP: can't install on custom machine types
Summary: GCP: can't install on custom machine types
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Jeremiah Stuever
QA Contact: To Hung Sze
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-16 15:58 UTC by Manuel Dewald
Modified: 2021-02-24 15:34 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:33:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4386 0 None closed Bug 1898194: installconfig/gcp/validation: handle custom machine types 2021-02-17 19:44:51 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:34:06 UTC

Description Manuel Dewald 2020-11-16 15:58:26 UTC
Version:

$ openshift-install version
openshift-v4.7.0-0.nightly-2020-11-16-091948-nightly

Platform: GCP

Please specify:
* IPI

What happened?

Installing a cluster on GCP with custom instance types is not working anymore (was working in 4.6)
This breaks running OSD clusters on 4.7, including osde2e tests.

> controlPlane.platform.gcp.type: Invalid value: \"custom-4-16384\": instance type custom-4-16384 not found, compute[0].platform.gcp.type: Invalid value: \"custom-4-16384\": instance type custom-4-16384 not found

custom-4-16384 is a valid custom machine type. See also https://cloud.google.com/compute/docs/instances/creating-instance-with-custom-machine-type#gcloud

Seems like a new name-based validation of machine types has been introduced with https://github.com/openshift/installer/commit/0d74861bd7fa7cb28cde3f79fa63dc43704b69b2 .


What did you expect to happen?

successful cluster installation

How to reproduce it (as minimally and precisely as possible)?

using the OCM cli:

> ocm create cluster --version 4.7.0-0.nightly-2020-11-16-091948-nightly mdewald-47-gcp --provider gcp --region us-east1



installer log:

time="2020-11-16T15:56:18Z" level=debug msg="OpenShift Installer v4.7.0"
time="2020-11-16T15:56:18Z" level=debug msg="Built from commit 8d9d7cb16bd681f5ff5ff1b22305d3a6d5466529"
time="2020-11-16T15:56:18Z" level=debug msg="Fetching Master Machines..."
time="2020-11-16T15:56:18Z" level=debug msg="Loading Master Machines..."
time="2020-11-16T15:56:18Z" level=debug msg="  Loading Cluster ID..."
time="2020-11-16T15:56:18Z" level=debug msg="    Loading Install Config..."
time="2020-11-16T15:56:18Z" level=debug msg="      Loading SSH Key..."
time="2020-11-16T15:56:18Z" level=debug msg="      Loading Base Domain..."
time="2020-11-16T15:56:18Z" level=debug msg="        Loading Platform..."
time="2020-11-16T15:56:18Z" level=debug msg="      Loading Cluster Name..."
time="2020-11-16T15:56:18Z" level=debug msg="        Loading Base Domain..."
time="2020-11-16T15:56:18Z" level=debug msg="        Loading Platform..."
time="2020-11-16T15:56:18Z" level=debug msg="      Loading Pull Secret..."
time="2020-11-16T15:56:18Z" level=debug msg="      Loading Platform..."
time="2020-11-16T15:56:18Z" level=info msg="Credentials loaded from environment variable \"GOOGLE_CREDENTIALS\", file \"/.gcp/osServiceAccount.json\""
time="2020-11-16T15:56:23Z" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": [controlPlane.platform.gcp.type: Invalid value: \"custom-4-16384\": instance type custom-4-16384 not found, compute[0].platform.gcp.type: Invalid value: \"custom-4-16384\": instance type custom-4-16384 not found]"

Comment 1 To Hung Sze 2020-11-16 16:11:10 UTC
Simpler way to test (this manually):
create an install-config by openshift-install create install-config --dir <install-dir>
modify the install-config.yaml and replace the {} after GCP with:
  type: custom-4-16384

Then run: openshift-install create manifests --dir <install-dir>

4.7 is spitting out an error like this:
FATAL failed to fetch Master Machines: failed to load asset "Install Config": [controlPlane.platform.gcp.type: Invalid value: "custom-4-16384": instance type custom-4-16384 not found, compute[0].platform.gcp.type: Invalid value: "custom-4-16384": instance type custom-4-16384 not found] 

Also, 'gcloud compute machine-types list' doesn't seem to return custom types.

Comment 2 To Hung Sze 2020-11-16 16:11:10 UTC
Simpler way to test (this manually):
create an install-config by openshift-install create install-config --dir <install-dir>
modify the install-config.yaml and replace the {} after GCP with:
  type: custom-4-16384

Then run: openshift-install create manifests --dir <install-dir>

4.7 is spitting out an error like this:
FATAL failed to fetch Master Machines: failed to load asset "Install Config": [controlPlane.platform.gcp.type: Invalid value: "custom-4-16384": instance type custom-4-16384 not found, compute[0].platform.gcp.type: Invalid value: "custom-4-16384": instance type custom-4-16384 not found] 

Also, 'gcloud compute machine-types list' doesn't seem to return custom types.

Comment 4 Jeremiah Stuever 2020-11-16 18:01:03 UTC
We will need to add special handling for custom machine types. They are named using a specific format: [type-]custom-<cpu>-<memory> and defaults to N1 when no type is specified.

Comment 7 Jeremiah Stuever 2020-11-19 00:40:54 UTC
This can be tested by using the following custom machine types in the control plane section of the install-config:

No errors as these are valid:
n1-standard-4, custom-4-16384, n1-custom-4-16384

Google API 404 (not found):
n1-dne-4, custom-2, custom-a, custom-2-b, n1-custom-2, n1-custom-a, n1-custom-2-b
controlPlane.platform.gcp.type: Internal error: googleapi: Error 404: The resource 'projects/myproject/zones/us-west1-b/machineTypes/custom-2' was not found, notFound

Google API Internal Error (invalidResourceUsage):
custom-4-16383, n1-custom-4-16383, custom-3-16384, n1-custom-3-16384
controlPlane.platform.gcp.type: Internal error: googleapi: Error 400: Invalid resource usage: 'Memory should be a multiple of 256MiB, while 16383MiB is requested'., invalidResourceUsage
controlPlane.platform.gcp.type: Internal error: googleapi: Error 400: Invalid resource usage: 'Number of vCPUs should be multiple of 2 if greater than 2, while 3 is requested'., invalidResourceUsage

Invalid memory and CPU:
n1-standard-2, custom-2-7680, n1-custom-2-7680,
controlPlane.platform.gcp.type: Invalid value: "custom-2-7680": instance type does not meet minimum resource requirements of 4 vCPUs
controlPlane.platform.gcp.type: Invalid value: "custom-2-7680": instance type does not meet minimum resource requirements of 15360 MB Memory

Comment 9 To Hung Sze 2020-11-24 19:11:18 UTC
Verified with openshift-install-linux-4.7.0-0.nightly-2020-11-23-074526

FATAL failed to fetch Master Machines: failed to load asset "Install Config": compute[0].platform.gcp.type: Invalid value: "n1-custom-4-16383": instance type n1-custom-4-16383 not found

FATAL failed to fetch Master Machines: failed to load asset "Install Config": compute[0].platform.gcp.type: Invalid value: "n1-dn2-4": instance type n1-dn2-4 not found

FATAL failed to fetch Master Machines: failed to load asset "Install Config": [controlPlane.platform.gcp.type: Invalid value: "n1-standard-2": instance type does not meet minimum resource requirements of 4 vCPUs, controlPlane.platform.gcp.type: Invalid value: "n1-standard-2": instance type does not meet minimum resource requirements of 15360 MB Memory]

Comment 10 To Hung Sze 2020-12-03 14:37:02 UTC
added automation for the changes here in OCP-36886

Comment 13 errata-xmlrpc 2021-02-24 15:33:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.