Bug 2005440

Summary: Dual-stack KubeAPI multi-node cluster with single Machine Network does not fail validation
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Mat Kowalski <mko>
Component: Infrastructure OperatorAssignee: Mat Kowalski <mko>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact: Christopher Dawson <cdawson>
Priority: unspecified    
Version: rhacm-2.4CC: asegurap, ccrum, fpercoco, mfilanov, trwest, yfirst
Target Milestone: ---Keywords: Triaged
Target Release: rhacm-2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Platform
Fixed In Version: OCP-Metal-v1.0.27.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2009760 (view as bug list) Environment:
Last Closed: 2022-10-03 20:18:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2009760, 2013207    

Description Mat Kowalski 2021-09-17 16:28:32 UTC
+++ Scenario

* dual-stack
* KubeAPI
* multi-node
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 1 Machine Network (IPv4)
* 1 API and 1 Ingress VIP (IPv4)

+++ Current state

Cluster (i.e. ACI resource) is flapping in "preparing for installation"

+++ Desired state

Cluster should be failing validation

+++ Internal info

What I have observed is that we are trying to generate the ignition, but because OCP Installer has a validator for network configuration it does not allow us to create a correct one. We do not handle the error coming from ignition.go in any meaningful way so we are going back to "preparing for installation" and the process starts again.

We should either create a new validator or plug some logic into existing one that would be checking the network configuration (i.e. alignment of all the *-networks) and failing before even trying to pass anything to the ignition.go

Comment 4 Mat Kowalski 2021-10-08 12:20:43 UTC
There seem to be two scenarios for this validator and only one of the fails. Looks like the flow for cluster creation is bypassing the validator function and the flow for cluster update is correct.

+++ Scenario 1

Initial ACI contains
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 1 Machine Network (IPv4)

In this case the validation is not failing and we get the cluster in

```
  - lastProbeTime: "2021-10-08T12:19:37Z"
    lastTransitionTime: "2021-10-08T12:19:37Z"
    message: SyncOK
    reason: SyncOK
    status: "True"
    type: SpecSynced
  - lastProbeTime: "2021-10-08T12:19:37Z"
    lastTransitionTime: "2021-10-08T12:19:37Z"
    message: 'The cluster''s validations are failing: Clusters must have exactly 3
      dedicated masters. Please either add hosts, or disable the worker host,Hosts
      have not been discovered yet,Hosts have not been discovered yet,Hosts have not
      been discovered yet'
    reason: ValidationsFailing
    status: "False"
    type: Validated
```

+++ Scenario 2

Initial ACI contains
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 2 Machine Networks (IPv4 + IPv6)

An updated ACI contains
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 1 Machine Network (IPv4)

In this case the validation fails and we see

```
  - lastProbeTime: "2021-10-08T12:16:37Z"
    lastTransitionTime: "2021-10-08T12:16:37Z"
    message: 'The Spec could not be synced due to an input error: Expected 2 machine
      networks, found 1'
    reason: InputError
    status: "False"
    type: SpecSynced
```

Comment 5 Mat Kowalski 2021-10-08 13:04:10 UTC
Creating a cluster via ACI does not include Machine Networks in the `params.NewClusterParams`, the object looks like this

```
time="2021-10-08T12:58:12Z" level=info msg="CHOCOBOMB: Creating cluster with params: &{AdditionalNtpSource:<nil> BaseDNSDomain:hive.example.com ClusterNetworkCidr:<nil> Clu
sterNetworkHostPrefix:0 ClusterNetworks:[0xc0025f9830 0xc0025f9860] CPUArchitecture: DiskEncryption:<nil> HighAvailabilityMode:<nil> HTTPProxy:<nil> HTTPSProxy:<nil> Hypert
hreading:<nil> IngressVip:192.168.111.101 MachineNetworks:[] Name:0xc001fcd710 NetworkType:<nil> NoProxy:<nil> OcpReleaseImage: OlmOperators:[] OpenshiftVersion:0xc001fcd72
0 Platform:<nil> PullSecret:0xc001fcd730 SchedulableMasters:<nil> ServiceNetworkCidr:<nil> ServiceNetworks:[0xc0015f2ae0 0xc0015f2b00] SSHPublicKey:ssh-rsa AAAAB3NzaC1yc2EA
AAADAQABAAABgQC1b/IibQkel9sU5OYuNkoL3qda0vzgx2Sb2lmF5hFsZ3L2D+w+Ixkwjw1g0jQAsQ+00rlKYgdxVmUWYpGE2ZKLQ75kHzs4qChupTMb1rJL5YH8xVeKuCN86WkW2rn5vT7gY8r+m/odCBkL4WQDxGVXdHcevhO6
klehsb2PdhqKkbm+xNMrHSOWOnxbV2O7U4VdWgHMcPt9vlSf4ewNHMNer0cTmmqIIg9Lqbp5p8zcM20uSdMQBjar+A2PHu29CyjqVMczu7S6G/DLbTG4GnovcPJwOiNUgOLEt13kNLRbODXl610DmESS4Si4bAZvi555fXmoAgrW
4uLCZ8zOEgMaz+G6yhcMqJ47WjznhbJRJeWmqz3pjd+252SCrznAmXrbD/mpjYZulDLPIejENJzd7LRBp3DBDQtgrWeP+04CosNYD2vXWV+Xlofd/uSdVzyY+kKkuatGx7R13PHK+WlgxW3albEPEgz8T+3IRKNNfDmwtEem6R0K
AhTuC0volGk= root.lab.eng.rdu2.redhat.com UserManagedNetworking:0xc0018d5ba9 VipDhcpAllocation:0xc0018d5ba8}" func="github.com/openshift/assisted-service/inter
nal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:477" cluster_id=ba34e502-a244-440
6-961e-a9fd78a3c0fa go-id=592 pkg=Inventory request_id=f5508ad2-c4a7-4734-b036-cf9fffd8db9d                                                                                 

[...]

time="2021-10-08T12:58:12Z" level=info msg="ClusterDeployment Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeplo$
mentsReconciler).Reconcile" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:117" cluster_deployment=dual-aci clu$
ter_deployment_namespace=assisted-installer-2 go-id=592 request_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="update cluster ba34e502-a244-4406-961e-a9fd78a3c0fa with params: &{AdditionalNtpSource:<nil> APIVip:0xc001b2a820 APIVipDNSName:$
nil> BaseDNSDomain:<nil> ClusterNetworkCidr:<nil> ClusterNetworkHostPrefix:<nil> ClusterNetworks:[] DiskEncryption:<nil> HTTPProxy:<nil> HTTPSProxy:<nil> Hyperthreading:<ni
l> IngressVip:<nil> MachineNetworkCidr:<nil> MachineNetworks:[0xc001c09dc0] Name:<nil> NetworkType:0xc001b2a810 NoProxy:<nil> OlmOperators:[] Platform:<nil> PullSecret:<nil
> SchedulableMasters:<nil> ServiceNetworkCidr:<nil> ServiceNetworks:[] SSHPublicKey:<nil> UserManagedNetworking:<nil> VipDhcpAllocation:<nil>}" func="github.com/openshift/a
ssisted-service/internal/bminventory.(*bareMetalInventory).v2UpdateClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2375" go-id=
592 pkg=Inventory request_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="CHOCOBOMB: Updating cluster with params: &{AdditionalNtpSource:<nil> APIVip:0xc001b2a820 APIVipDNSName:<nil> BaseDNSDomain:<nil>
 ClusterNetworkCidr:<nil> ClusterNetworkHostPrefix:<nil> ClusterNetworks:[] DiskEncryption:<nil> HTTPProxy:<nil> HTTPSProxy:<nil> Hyperthreading:<nil> IngressVip:<nil> Mach
ineNetworkCidr:<nil> MachineNetworks:[0xc001c09dc0] Name:<nil> NetworkType:0xc001b2a810 NoProxy:<nil> OlmOperators:[] Platform:<nil> PullSecret:<nil> SchedulableMasters:<ni
l> ServiceNetworkCidr:<nil> ServiceNetworks:[] SSHPublicKey:<nil> UserManagedNetworking:<nil> VipDhcpAllocation:<nil>}" func="github.com/openshift/assisted-service/internal
/bminventory.(*bareMetalInventory).validateAndUpdateClusterParams" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2153" go-id=592 pkg=Inventory
 request_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="Updated clusterDeployment assisted-installer-2/dual-aci" func="github.com/openshift/assisted-service/internal/controller/control
lers.(*ClusterDeploymentsReconciler).updateIfNeeded" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:739" agent_c
luster_install=dual-aci agent_cluster_install_namespace=assisted-installer-2 cluster_deployment=dual-aci cluster_deployment_namespace=assisted-installer-2 go-id=592 request
_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="ClusterDeployment Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeployme
ntsReconciler).Reconcile.func1" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:114" agent_cluster_install=dual-a
ci agent_cluster_install_namespace=assisted-installer-2 cluster_deployment=dual-aci cluster_deployment_namespace=assisted-installer-2 go-id=592 request_id=ed803828-2f16-476
a-a868-f1bab0f5864e
```

Comment 6 Mike Ng 2021-10-12 20:24:50 UTC
G2Bsync 941430821 comment 
 CrystalChun Tue, 12 Oct 2021 20:04:17 UTC 
 G2Bsync 
PRs are already merged
https://github.com/openshift/assisted-service/pull/2731
https://github.com/openshift/assisted-service/pull/2660

Comment 7 nshidlin 2022-04-13 07:31:17 UTC
Verified using ACM 2.5.0-DOWNSTREAM-2022-04-11-09-21-38

on creation of dual-stack ACI with single machineNetwork:

oc get agentclusterinstalls.extensions.hive.openshift.io spoke-0 -o json | jq '.spec.networking'
{
  "clusterNetwork": [
    {
      "cidr": "10.128.0.0/14",
      "hostPrefix": 23
    },
    {
      "cidr": "fd01::/48",
      "hostPrefix": 64
    }
  ],
  "machineNetwork": [
    {
      "cidr": "fd2e:6f44:5dd8:5::/64"
    }
  ],
  "serviceNetwork": [
    "172.30.0.0/16",
    "fd02::/112"
  ]
}

spec is not synced with a clear messsage:
oc get agentclusterinstalls.extensions.hive.openshift.io spoke-0 -o json | jq '.status.conditions | map(select(.type=="SpecSynced"))'

[
  {
    "lastProbeTime": "2022-04-13T07:08:04Z",
    "lastTransitionTime": "2022-04-13T07:08:04Z",
    "message": "The Spec could not be synced due to an input error: Expected 2 machine networks, found 1",
    "reason": "InputError",
    "status": "False",
    "type": "SpecSynced"
  }
]

Patching in the missing machine CIDR allows the ACI to sync.

Updating the ACI to remove the second machine CIDR brings the ACI back to not-synced state, so this flow did not regress