Bug 1876680

Summary: incorrect validation for resources while installation.
Product: OpenShift Container Platform Reporter: Sudarshan Chaudhari <suchaudh>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aarapov, adahiya, jhou, miyadav, zhsun
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:38:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sudarshan Chaudhari 2020-09-07 23:10:29 UTC
Description of problem:

While trying the OCP 4.6 Vsphere IPI installation, we tried creating the instances below the recommended value to see if the IPI installation completes. 

The installer created the master vms in vsphere. 

NAME                      STATUS   ROLES    AGE    VERSION
ocp46ipi-mnj6b-master-0   Ready    master   144m   v1.19.0-rc.2+514f31a
ocp46ipi-mnj6b-master-1   Ready    master   143m   v1.19.0-rc.2+514f31a
ocp46ipi-mnj6b-master-2   Ready    master   143m   v1.19.0-rc.2+514f31a

and the installer failed as the ingress and oauth were looking for worker nodes.

# oc get events -n openshift-machine-api
3s          Warning   ReconcileError      machineset/ocp46ipi-mnj6b-worker                    failed to sync machines: admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120); admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120); admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120); admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 50: diskGiB is below minimum value (120)

How reproducible:
Everytime

Install-config:
~~~~~~
apiVersion: v1
baseDomain: ocp.gsslab.pnq2.redhat.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    vsphere:
      osDisk:
        diskSizeGB: 50
      cpus: 6
      memoryMB: 16384
      coresPerSocket: 2
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere:
      osDisk:
        diskSizeGB: 50
      cpus: 4
      memoryMB: 12288
      coresPerSocket: 2
  replicas: 3
metadata:
  creationTimestamp: null
  name: ocp46ipi
networking:
  clusterNetwork:
  - cidr: 13.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.168.3.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 140.30.0.0/16
fips: true
platform:
  vsphere:
    apiVIP: 192.168.3.8
    cluster: OCP-Cluster
    datacenter: OCP-DC
    defaultDatastore: vsanDatastore
    ingressVIP: 192.168.3.9
    network: DSwitch-vSAN
    password: xxxxxxxxxxxx
    username: xxxxxxxxx
    vCenter: vsphere7.ocp.gsslab.pnq2.redhat.com
publish: External
~~~~~~


Actual results:
the installer created the resources and then the workers MachineSet failed to add the workers.

Expected results:
the installer should not process if the provided resources are less then what is recommended. 

Additional info:
Based on the investigation the openshift-installer created the vms in vsphere but the scaleup using machineSet failed with the validation. This would result in the post-install management errors.

Comment 1 Abhinav Dahiya 2020-09-08 16:56:53 UTC
The machine-api operator added this validation for minimum disk size in 4.6 after the API was already GA'd in 4.5 . So this validation breaked backwards compatibility. The machine-api team needs to back this validation out to a warning. saw something over the weekend that they can maybe use https://kubernetes.io/blog/2020/09/03/warnings/

Comment 2 Alberto 2020-09-22 14:51:44 UTC
*** Bug 1861974 has been marked as a duplicate of this bug. ***

Comment 5 Milind Yadav 2020-09-25 12:41:25 UTC
VERIFIED ON Cluster version is 4.6.0-0.nightly-2020-09-24-235241

Steps :
1.Created machineset with default values - refer - https://gist.github.com/miyadav/ab17954c1075db1533eee13f0ffda58c

[miyadav@miyadav vsphere]$ oc create -f defaultvsp.yaml --config vsp
Flag --config has been deprecated, use --kubeconfig instead
W0925 14:49:48.285805 24972 warnings.go:67] providerSpec.numCPUs: 0 is less than the minimum value (2): the minimum value will be used instead
W0925 14:49:48.285861 24972 warnings.go:67] providerSpec.memoryMiB: 0 is less than the recommended minimum value (2048): nodes may not boot correctly
W0925 14:49:48.285865 24972 warnings.go:67] providerSpec.diskGiB: 0 is less than the recommended minimum (120): nodes may fail to start if disk size is too low
machineset.machine.openshift.io/zhsunvs-r7cfl-worker-default created

 

but machine failed to provision with below event:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 5m33s vspherecontroller zhsunvs-r7cfl-worker-default-5bqs9: reconciler failed to Create machine: error getting disk spec for "": can't resize template disk down, initial capacity is larger: 16777216KiB > 0KiB

step2.Modified machineset as per warnings :

[miyadav@miyadav vsphere]$ oc get machines --config vsp
Flag --config has been deprecated, use --kubeconfig instead
NAME                                  PHASE         TYPE   REGION   ZONE   AGE
zhsunvs-r7cfl-master-0                Running                              7h7m
zhsunvs-r7cfl-master-1                Running                              7h7m
zhsunvs-r7cfl-master-2                Running                              7h7m
zhsunvs-r7cfl-worker-defaulti-r9q8p   Provisioned                          27m
zhsunvs-r7cfl-worker-vk776            Running                              6h55m
zhsunvs-r7cfl-worker1-9tk7m           Running                              4h9m
zhsunvs-r7cfl-worker1-dbfms           Running                              3h19m
zhsunvs-r7cfl-worker1-fxxfr           Running                              3h19m

[miyadav@miyadav vsphere]$ oc get machines --config vsp
Flag --config has been deprecated, use --kubeconfig instead
NAME                                  PHASE         TYPE   REGION   ZONE   AGE
zhsunvs-r7cfl-master-0                Running                              7h14m
zhsunvs-r7cfl-master-1                Running                              7h14m
zhsunvs-r7cfl-master-2                Running                              7h14m
zhsunvs-r7cfl-worker-defaulti-r9q8p   Provisioned                          34m
zhsunvs-r7cfl-worker-vk776            Running                              7h2m
zhsunvs-r7cfl-worker1-9tk7m           Running                              4h16m
zhsunvs-r7cfl-worker1-dbfms           Running                              3h27m
zhsunvs-r7cfl-worker1-fxxfr           Running                              3h27m
[miyadav@miyadav vsphere]$ 


But if I modify memoryMiB to 10248 instead of 2048 it works 
[miyadav@miyadav vsphere]$ oc create -f defaultvsp.yaml --config vsp
Flag --config has been deprecated, use --kubeconfig instead
W0925 17:05:47.468749 4025 warnings.go:67] providerSpec.numCPUs: 0 is less than the minimum value (2): the minimum value will be used instead
machineset.machine.openshift.io/zhsunvs-r7cfl-worker-default created


[miyadav@miyadav vsphere]$ oc get machines --config vsp
Flag --config has been deprecated, use --kubeconfig instead
NAME PHASE TYPE REGION ZONE AGE
zhsunvs-r7cfl-master-0 Running 6h17m
zhsunvs-r7cfl-master-1 Running 6h17m
zhsunvs-r7cfl-master-2 Running 6h17m
zhsunvs-r7cfl-worker-default-j69q2 Provisioning 14s
zhsunvs-r7cfl-worker-vk776 Running 6h5m
zhsunvs-r7cfl-worker1-9tk7m Running 3h19m
zhsunvs-r7cfl-worker1-dbfms Running 149m
zhsunvs-r7cfl-worker1-fxxfr Running 149m


[miyadav@miyadav vsphere]$ oc get machines --config vsp
Flag --config has been deprecated, use --kubeconfig instead
NAME PHASE TYPE REGION ZONE AGE
zhsunvs-r7cfl-master-0 Running 6h23m
zhsunvs-r7cfl-master-1 Running 6h23m
zhsunvs-r7cfl-master-2 Running 6h23m
zhsunvs-r7cfl-worker-default-j69q2 Running 6m2s
zhsunvs-r7cfl-worker-vk776 Running 6h10m
zhsunvs-r7cfl-worker1-9tk7m Running 3h24m
zhsunvs-r7cfl-worker1-dbfms Running 155m
zhsunvs-r7cfl-worker1-fxxfr Running 155m


[miyadav@miyadav vsphere]$ oc get nodes --config vsp
Flag --config has been deprecated, use --kubeconfig instead
NAME STATUS ROLES AGE VERSION
zhsunvs-r7cfl-master-0 Ready master 6h21m v1.19.0+8a39924
zhsunvs-r7cfl-master-1 Ready master 6h21m v1.19.0+8a39924
zhsunvs-r7cfl-master-2 Ready master 6h21m v1.19.0+8a39924
zhsunvs-r7cfl-worker-default-j69q2 Ready worker 2m11s v1.19.0+8a39924
zhsunvs-r7cfl-worker-vk776 Ready worker 6h7m v1.19.0+8a39924
zhsunvs-r7cfl-worker1-9tk7m Ready worker 3h22m v1.19.0+8a39924
zhsunvs-r7cfl-worker1-dbfms Ready worker 152m v1.19.0+8a39924
zhsunvs-r7cfl-worker1-fxxfr Ready worker 152m v1.19.0+8a39924

the nodes came up in ready status and machines in Running status for those warning values

Additional info :
 Only hiccup I am seeing is the warning :
W0925 15:52:22.833707 29301 warnings.go:67] providerSpec.memoryMiB: 0 is less than the recommended minimum value (2048): nodes may not boot correctly
 ,* is not correct and it worked for 10248 MiB for us, I think if we can improve on that value , i mean suggest some other recommendation instead of a value that may not work ? 

But moving to VERIFIED and review if it we can create a new Bug to track it later .

Comment 8 errata-xmlrpc 2020-10-27 16:38:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196