Bug 1878108

Summary: OCP 4.6 installation fails in CPU quota check for OSD on GCP
Product: OpenShift Container Platform Reporter: Manuel Dewald <mdewald>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: To Hung Sze <tsze>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: adahiya, gshereme, jeder, lseelye, mdewald, mwoodson, rrackow, yanyang
Version: 4.6Keywords: Regression, ServiceDeliveryBlocker
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:40:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manuel Dewald 2020-09-11 10:55:12 UTC
Description of problem:

When installing 4.6 in OSD, the quota check returns

failed to generate asset \\\"Platform Quota Check\\\": error(MissingQuota): compute.googleapis.com/cpus is not available in us-east1 because The required number of resources (114688) is more than the limit of 2400\"\nlevel=fatal msg=\"bootstrap host address and at least one control plane host address must be provided\"\n" installID=cz4gg5hr 

How reproducible:

Install a OCP cluster with a install config like the following on GCP:


compute:
- name: worker
  platform:
    gcp:
      osDisk:
        DiskSizeGB: 0
        DiskType: ""
      type: custom-4-16384
  replicas: 4
controlPlane:
  name: master
  platform:
    gcp:
      osDisk:
        DiskSizeGB: 0
        DiskType: ""
      type: custom-4-16384
  replicas: 3
kind: InstallConfig
metadata:
  creationTimestamp: null
  labels:
    api.openshift.com/environment: integration
    api.openshift.com/id: some-id
    api.openshift.com/managed: "true"
    api.openshift.com/name: some-name
    hive.openshift.io/cluster-type: managed
  name: gshereme-test1
  namespace: uhc-integration-1fkg61fq9oieabah0b1i03k2es3l1mgs
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 10.0.0.0/16
  machineNetwork:
  - cidr: 10.0.0.0/16
  serviceNetwork:
  - 172.30.0.0/16
platform:
  gcp:
    projectID: some-project
    region: us-east1
pullSecret: ""
sshKey: ssh-rsa.....


Actual results:

failed to generate asset \\\"Platform Quota Check\\\": error(MissingQuota): compute.googleapis.com/cpus is not available in us-east1 because The required number of resources (114688) is more than the limit of 2400\"\nlevel=fatal msg=\"bootstrap host address and at least one control plane host address must be provided\"\n" installID=cz4gg5hr 


Expected results:

The cluster installs successfully

Potential cause:
The cause is most likely in the installer quota check:
https://github.com/openshift/installer/blob/5972c875c3ef1cd13c43c52b7cda2660efdda1b3/pkg/asset/quota/gcp/gcp.go#L164

As we use custom machine types (name: custom-4-16384)
The quota check assumes the second number in the string is the number of CPUs, but in our case it is not. It extracts this second number as CPU count here: https://github.com/openshift/installer/blob/5972c875c3ef1cd13c43c52b7cda2660efdda1b3/pkg/asset/quota/gcp/gcp.go#L199

and this is multiplied with the count of machines (compute: 4, controlPlane: 7) which results in 7*16384 = 114688 CPUs.

Comment 2 Greg Sheremeta 2020-09-11 11:22:23 UTC
The default GCP machine names look like this: m2-ultramem-208
where Abhinav's assumption in machineTypeToQuota() holds up. 208 is the vcpu count.

But GCP can have custom types too, and we use these extensively in OSD.
custom-4-16384

Comment 3 Rick Rackow 2020-09-11 15:41:04 UTC
preserving information from Slack:
GPC is allowing custom image that don't start with the machine type and then assumes "N1" as type.

```$ gcloud beta compute instances create example-instance-test --machine-type custom-4-3840
Created [https://www.googleapis.com/compute/beta/projects/innate-attic-182119/zones/us-east1-b/instances/example-instance-test].
NAME                   ZONE        MACHINE_TYPE               PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP    STATUS
example-instance-test  us-east1-b  custom (4 vCPU, 3.75 GiB)               10.142.0.4   34.75.155.249  RUNNING```

```$ gcloud beta compute instances create example-instance-test-1 --machine-type test-custom-4-3840
ERROR: (gcloud.beta.compute.instances.create) Could not fetch machine type:
 - The resource 'projects/innate-attic-182119/zones/us-east1-b/machineTypes/test-custom-4-3840' was not found```

That is causing a problem when the name of the machine type is analzed here https://github.com/openshift/installer/blob/287658271951b5f8dbf1a77c77ff1557d81c5931/pkg/asset/quota/gcp/gcp.go#L199

Comment 4 To Hung Sze 2020-09-17 18:07:18 UTC
@mdewald, did you create the custom type: custom-4-16384?
If yes, could you please include more details about the type?
I want to add a test case to capture this change and reflect what you have / had.
Thanks in advance.

Comment 6 Greg Sheremeta 2020-09-17 19:17:04 UTC
custom-4-16384 is a shortened alias for n1-custom-4-16384, which is a built-in GCP machine type.

It means 4 CPU, 16384 memory.

https://cloud.google.com/compute/docs/machine-types#custom_machine_types

Comment 9 To Hung Sze 2020-09-21 19:27:21 UTC
using 4.6.0-0.nightly-2020-09-21-114202 and
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    gcp:
            osDisk:
              DiskSizeGB: 0
              DiskType: ""
            type: custom-4-16384
    replicas: 3
(same for control)
I am able to bring up a cluster and the machine (as shown in web console) has correct type)
Instance Type
custom-4-16384

From gcp console:
Machine type
custom (4 vCPUs, 16 GB memory)

Comment 14 errata-xmlrpc 2020-10-27 16:40:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196