Bug 2117687 - SNO cluster install fails when compute.replicas is missing from install-config.yaml
Summary: SNO cluster install fails when compute.replicas is missing from install-confi...
Keywords:
Status: RELEASE_PENDING
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.12.0
Assignee: Pawan Pinjarkar
QA Contact: Manoj Hans
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-11 15:45 UTC by Richard Su
Modified: 2023-03-02 18:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
install-config.yaml (429 bytes, text/plain)
2022-08-11 15:45 UTC, Richard Su
no flags Details
agent-config.yaml (895 bytes, text/plain)
2022-08-11 15:46 UTC, Richard Su
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 6223 0 None open Bug 2117687: compute.replicas must be 0 for SNO 2022-08-17 02:27:57 UTC

Description Richard Su 2022-08-11 15:45:49 UTC
Created attachment 1904978 [details]
install-config.yaml

Description:

When controlPlane.replicas = 1 and compute.replicas is missing from install-config.yaml, assisted-service validation fails because it is looking for more worker nodes and expecting api_vip and ingress_vip to be set. It doesn't recognize that I'm trying to install a SNO cluster. Only when compute.replicas is set to 0 does it recognize it to be SNO. 

We should add a validation to warn users that compute.replicas needs to be set to 0 if controlPlane.replicas = 1.

Steps to reproduce:

1. Create agent.iso using install-config.yaml and agent-config.yaml
2. Deploy a SNO cluster using agent.iso
3. openshift-install agent wait-for install-complete

Expected:

Cluster installation is successful

Actual:

Validation fails

[rwsu@hardprov-fx2-22 openshift-installer]$ ./openshift-install agent wait-for install-complete
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
INFO Cluster is not ready for install. Check host validations 
WARNING Cluster has stopped installing... working to recover installation 
WARNING Cluster has stopped installing... working to recover installation 
WARNING Cluster has stopped installing... working to recover installation 
INFO Checking for validation failures ---------------------------------------------- 
ERROR Validation failure found for cluster          category=hosts-data label=all-hosts-are-ready-to-install message=The cluster has hosts that are not ready to install.
ERROR Validation failure found for cluster          category=hosts-data label=sufficient-masters-count message=Clusters must have exactly 3 dedicated masters and if workers are added, there should be at least 2 workers. Please check your configuration and add or remove hosts as to meet the above requirement.
ERROR Validation failure found for cluster          category=network label=Machine CIDR message=The Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs.
ERROR Validation failure found for cluster          category=network label=api-vip-defined message=The API virtual IP is undefined and must be provided.
ERROR Validation failure found for cluster          category=network label=ingress-vip-defined message=The Ingress virtual IP is undefined and must be provided.
INFO Checking for validation failures ---------------------------------------------- 
ERROR Validation failure found for cluster          category=hosts-data label=all-hosts-are-ready-to-install message=The cluster has hosts that are not ready to install.
ERROR Validation failure found for cluster          category=hosts-data label=sufficient-masters-count message=Clusters must have exactly 3 dedicated masters and if workers are added, there should be at least 2 workers. Please check your configuration and add or remove hosts as to meet the above requirement.
ERROR Validation failure found for cluster          category=network label=Machine CIDR message=The Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs.
ERROR Validation failure found for cluster          category=network label=api-vip-defined message=The API virtual IP is undefined and must be provided.
ERROR Validation failure found for cluster          category=network label=ingress-vip-defined message=The Ingress virtual IP is undefined and must be provided.
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=DNS wildcard not configured message=Parse error for domain name resolutions result
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=Machine CIDR message=Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=NTP synchronization message=Host couldn't synchronize with any NTP server
INFO Checking for validation failures ---------------------------------------------- 
ERROR Validation failure found for cluster          category=hosts-data label=all-hosts-are-ready-to-install message=The cluster has hosts that are not ready to install.
ERROR Validation failure found for cluster          category=hosts-data label=sufficient-masters-count message=Clusters must have exactly 3 dedicated masters and if workers are added, there should be at least 2 workers. Please check your configuration and add or remove hosts as to meet the above requirement.
ERROR Validation failure found for cluster          category=network label=Machine CIDR message=The Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs.
ERROR Validation failure found for cluster          category=network label=api-vip-defined message=The API virtual IP is undefined and must be provided.
ERROR Validation failure found for cluster          category=network label=ingress-vip-defined message=The Ingress virtual IP is undefined and must be provided.
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=Machine CIDR message=Machine Network CIDR is undefined; the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs
ERROR Validation failure found for control1.ostest.test.metalkube.org  category=network label=NTP synchronization message=Host couldn't synchronize with any NTP server

Comment 1 Richard Su 2022-08-11 15:46:32 UTC
Created attachment 1904979 [details]
agent-config.yaml

Comment 2 Pawan Pinjarkar 2022-08-11 19:03:38 UTC
PR https://github.com/openshift/installer/pull/6223

Comment 5 Manoj Hans 2022-09-29 12:23:35 UTC
It is still failing in case of compute.replicas missing from install-config.yaml. 

DEBUG OpenShift Installer unreleased-master-7004-g1fb1397635c89ff8b3645fed4c4c264e4119fa84-dirty 
DEBUG Built from commit 1fb1397635c89ff8b3645fed4c4c264e4119fa84 
DEBUG Fetching Agent Installer ISO...              
DEBUG Loading Agent Installer ISO...               
DEBUG   Loading Agent Installer Ignition...        
DEBUG     Loading Agent Manifests...               
DEBUG       Loading Agent PullSecret...            
DEBUG         Loading Install Config...            
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x321d83b]

goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/agent.(*OptionalInstallConfig).validateSNOConfiguration(0x2?, 0xc000fc7400)
	/home/mhans/installer/pkg/asset/agent/installconfig.go:169 +0x81b
github.com/openshift/installer/pkg/asset/agent.(*OptionalInstallConfig).validateInstallConfig(0xc00123ede8?, 0x1ab1aa00?)
	/home/mhans/installer/pkg/asset/agent/installconfig.go:107 +0x1a5
github.com/openshift/installer/pkg/asset/agent.(*OptionalInstallConfig).Load(0xc000437f00, {0x1ab1aa00, 0xc00083d2b0})
	/home/mhans/installer/pkg/asset/agent/installconfig.go:62 +0x45
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000b1e510, {0x1ab21ed0, 0xc000437d80}, {0xc000335e18, 0x8})
	/home/mhans/installer/pkg/asset/store/store.go:264 +0x2b2
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000b1e510, {0x1ab21ff0, 0xc000141c10}, {0xc000335de6, 0x6})
	/home/mhans/installer/pkg/asset/store/store.go:247 +0xc05

Comment 6 Pawan Pinjarkar 2022-10-05 13:59:36 UTC
With PR https://github.com/openshift/installer/pull/6462, the validation message in the case when compute.replicas missing from install-config.yaml, will be

FATAL failed to fetch Agent Installer ISO: failed to load asset "Install Config": invalid install-config configuration: Compute.Replicas: Required value: Total number of Compute.Replicas must be 0 for none platform. Found 3


The installer's default install config settings sets the Compute.Replicas to 3 hence the error message saying "Found 3".

Sample install config 1: Compute is missing altogether

apiVersion: v1
baseDomain: test.metalkube.org
controlPlane: 
  hyperthreading: Enabled 
  name: master
  replicas: 1 
metadata:
  namespace: cluster-0
  name: ostest 
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14 
    hostPrefix: 23 
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 192.168.122.0/23
  serviceNetwork: 
  - 172.30.0.0/16
platform:
  none: {}
fips: false 
pullSecret: 
sshKey: 



Sample install config 2: Only Compute.Replicas are missing

apiVersion: v1
baseDomain: test.metalkube.org
compute: 
- hyperthreading: Enabled 
  name: worker
controlPlane: 
  hyperthreading: Enabled 
  name: master
metadata:
  namespace: cluster-0
  name: ostest 
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14 
    hostPrefix: 23 
  networkType: OVNKubernetes
  machineNetwork:
  - cidr: 192.168.122.0/23
  serviceNetwork: 
  - 172.30.0.0/16
platform:
  none: {}
fips: false 
pullSecret: 
sshKey:

Comment 7 Manoj Hans 2022-10-10 12:36:21 UTC
Bug has been verified with master branch. It's working as expected.


Note You need to log in before you can comment on or make changes to this bug.