Bug 2117687

Summary: SNO cluster install fails when compute.replicas is missing from install-config.yaml
Product: OpenShift Container Platform Reporter: Richard Su <rwsu>
Component: InstallerAssignee: Pawan Pinjarkar <ppinjark>
Installer sub component: Agent based installation QA Contact: Manoj Hans <mhans>
Status: RELEASE_PENDING --- Docs Contact:
Severity: medium    
Priority: high CC: asegurap, mhans, ppinjark, zbitter
Version: 4.11   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
install-config.yaml
none
agent-config.yaml none

Description Richard Su 2022-08-11 15:45:49 UTC
Created attachment 1904978 [details]
install-config.yaml

Description:

When controlPlane.replicas = 1 and compute.replicas is missing from install-config.yaml, assisted-service validation fails because it is looking for more worker nodes and expecting api_vip and ingress_vip to be set. It doesn't recognize that I'm trying to install a SNO cluster. Only when compute.replicas is set to 0 does it recognize it to be SNO. 

We should add a validation to warn users that compute.replicas needs to be set to 0 if controlPlane.replicas = 1.

Steps to reproduce:

1. Create agent.iso using install-config.yaml and agent-config.yaml
2. Deploy a SNO cluster using agent.iso
3. openshift-install agent wait-for install-complete

Expected:

Cluster installation is successful

Actual:

Validation fails

[rwsu@hardprov-fx2-22 openshift-installer]$ ./openshift-install agent wait-for install-complete
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
INFO Cluster is not ready for install. Check host validations 
WARNING Cluster has stopped installing... working to recover installation 
WARNING Cluster has stopped installing... working to recover installation 
WARNING Cluster has stopped installing... working to recover installation 
INFO Checking for validation failures ---------------------------------------------- 
ERROR Validation failure found for cluster          category=hosts-data label=all-hosts-are-ready-to-install message=The cluster has hosts that are not ready to install.
ERROR Validation failure found for cluster          category=hosts-data label=sufficient-masters-count message=Clusters must have exactly 3 dedicated masters and if workers are added, there should be at least 2 workers. Please check your configuration and add or remove hosts as to meet the above requirement.
ERROR Validation failure found for cluster          category=network label=Machine CIDR message=The Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs.
ERROR Validation failure found for cluster          category=network label=api-vip-defined message=The API virtual IP is undefined and must be provided.
ERROR Validation failure found for cluster          category=network label=ingress-vip-defined message=The Ingress virtual IP is undefined and must be provided.
INFO Checking for validation failures ---------------------------------------------- 
ERROR Validation failure found for cluster          category=hosts-data label=all-hosts-are-ready-to-install message=The cluster has hosts that are not ready to install.
ERROR Validation failure found for cluster          category=hosts-data label=sufficient-masters-count message=Clusters must have exactly 3 dedicated masters and if workers are added, there should be at least 2 workers. Please check your configuration and add or remove hosts as to meet the above requirement.
ERROR Validation failure found for cluster          category=network label=Machine CIDR message=The Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs.
ERROR Validation failure found for cluster          category=network label=api-vip-defined message=The API virtual IP is undefined and must be provided.
ERROR Validation failure found for cluster          category=network label=ingress-vip-defined message=The Ingress virtual IP is undefined and must be provided.
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=DNS wildcard not configured message=Parse error for domain name resolutions result
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=Machine CIDR message=Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=NTP synchronization message=Host couldn't synchronize with any NTP server
INFO Checking for validation failures ---------------------------------------------- 
ERROR Validation failure found for cluster          category=hosts-data label=all-hosts-are-ready-to-install message=The cluster has hosts that are not ready to install.
ERROR Validation failure found for cluster          category=hosts-data label=sufficient-masters-count message=Clusters must have exactly 3 dedicated masters and if workers are added, there should be at least 2 workers. Please check your configuration and add or remove hosts as to meet the above requirement.
ERROR Validation failure found for cluster          category=network label=Machine CIDR message=The Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs.
ERROR Validation failure found for cluster          category=network label=api-vip-defined message=The API virtual IP is undefined and must be provided.
ERROR Validation failure found for cluster          category=network label=ingress-vip-defined message=The Ingress virtual IP is undefined and must be provided.
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=Machine CIDR message=Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=NTP synchronization message=Host couldn't synchronize with any NTP server

Comment 1 Richard Su 2022-08-11 15:46:32 UTC
Created attachment 1904979 [details]
agent-config.yaml

Comment 2 Pawan Pinjarkar 2022-08-11 19:03:38 UTC
PR https://github.com/openshift/installer/pull/6223

Comment 5 Manoj Hans 2022-09-29 12:23:35 UTC
It is still failing in case of compute.replicas missing from install-config.yaml. 

DEBUG OpenShift Installer unreleased-master-7004-g1fb1397635c89ff8b3645fed4c4c264e4119fa84-dirty 
DEBUG Built from commit 1fb1397635c89ff8b3645fed4c4c264e4119fa84 
DEBUG Fetching Agent Installer ISO...              
DEBUG Loading Agent Installer ISO...               
DEBUG   Loading Agent Installer Ignition...        
DEBUG     Loading Agent Manifests...               
DEBUG       Loading Agent PullSecret...            
DEBUG         Loading Install Config...            
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x321d83b]

goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/agent.(*OptionalInstallConfig).validateSNOConfiguration(0x2?, 0xc000fc7400)
	/home/mhans/installer/pkg/asset/agent/installconfig.go:169 +0x81b
github.com/openshift/installer/pkg/asset/agent.(*OptionalInstallConfig).validateInstallConfig(0xc00123ede8?, 0x1ab1aa00?)
	/home/mhans/installer/pkg/asset/agent/installconfig.go:107 +0x1a5
github.com/openshift/installer/pkg/asset/agent.(*OptionalInstallConfig).Load(0xc000437f00, {0x1ab1aa00, 0xc00083d2b0})
	/home/mhans/installer/pkg/asset/agent/installconfig.go:62 +0x45
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000b1e510, {0x1ab21ed0, 0xc000437d80}, {0xc000335e18, 0x8})
	/home/mhans/installer/pkg/asset/store/store.go:264 +0x2b2
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000b1e510, {0x1ab21ff0, 0xc000141c10}, {0xc000335de6, 0x6})
	/home/mhans/installer/pkg/asset/store/store.go:247 +0xc05

Comment 6 Pawan Pinjarkar 2022-10-05 13:59:36 UTC
With PR https://github.com/openshift/installer/pull/6462, the validation message in the case when compute.replicas missing from install-config.yaml, will be

FATAL failed to fetch Agent Installer ISO: failed to load asset "Install Config": invalid install-config configuration: Compute.Replicas: Required value: Total number of Compute.Replicas must be 0 for none platform. Found 3


The installer's default install config settings sets the Compute.Replicas to 3 hence the error message saying "Found 3".

Sample install config 1: Compute is missing altogether

apiVersion: v1
baseDomain: test.metalkube.org
controlPlane: 
  hyperthreading: Enabled 
  name: master
  replicas: 1 
metadata:
  namespace: cluster-0
  name: ostest 
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14 
    hostPrefix: 23 
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 192.168.122.0/23
  serviceNetwork: 
  - 172.30.0.0/16
platform:
  none: {}
fips: false 
pullSecret: 
sshKey: 



Sample install config 2: Only Compute.Replicas are missing

apiVersion: v1
baseDomain: test.metalkube.org
compute: 
- hyperthreading: Enabled 
  name: worker
controlPlane: 
  hyperthreading: Enabled 
  name: master
metadata:
  namespace: cluster-0
  name: ostest 
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14 
    hostPrefix: 23 
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 192.168.122.0/23
  serviceNetwork: 
  - 172.30.0.0/16
platform:
  none: {}
fips: false 
pullSecret: 
sshKey:

Comment 7 Manoj Hans 2022-10-10 12:36:21 UTC
Bug has been verified with master branch. It's working as expected.