Bug 2005440 - Dual-stack KubeAPI multi-node cluster with single Machine Network does not fail validation
Summary: Dual-stack KubeAPI multi-node cluster with single Machine Network does not fa...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: rhacm-2.5
Assignee: Mat Kowalski
QA Contact:
Christopher Dawson
URL:
Whiteboard: AI-Team-Platform
Depends On:
Blocks: 2009760 2013207
TreeView+ depends on / blocked
 
Reported: 2021-09-17 16:28 UTC by Mat Kowalski
Modified: 2022-10-03 20:18 UTC (History)
6 users (show)

Fixed In Version: OCP-Metal-v1.0.27.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2009760 (view as bug list)
Environment:
Last Closed: 2022-10-03 20:18:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github open-cluster-management backlog issues 16749 0 None None None 2021-10-01 17:59:21 UTC
Github openshift assisted-service pull 2660 0 None Merged Bug 2007106: Extend dual-stack network subnet validations 2021-10-08 12:27:46 UTC
Github openshift assisted-service pull 2731 0 None open [WIP] Bug 2005440: Use current and new cluster data for dual-stack check 2021-10-08 14:29:17 UTC
Red Hat Issue Tracker MGMTBUGSM-41 0 None None None 2022-02-04 06:54:49 UTC

Description Mat Kowalski 2021-09-17 16:28:32 UTC
+++ Scenario

* dual-stack
* KubeAPI
* multi-node
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 1 Machine Network (IPv4)
* 1 API and 1 Ingress VIP (IPv4)

+++ Current state

Cluster (i.e. ACI resource) is flapping in "preparing for installation"

+++ Desired state

Cluster should be failing validation

+++ Internal info

What I have observed is that we are trying to generate the ignition, but because OCP Installer has a validator for network configuration it does not allow us to create a correct one. We do not handle the error coming from ignition.go in any meaningful way so we are going back to "preparing for installation" and the process starts again.

We should either create a new validator or plug some logic into existing one that would be checking the network configuration (i.e. alignment of all the *-networks) and failing before even trying to pass anything to the ignition.go

Comment 4 Mat Kowalski 2021-10-08 12:20:43 UTC
There seem to be two scenarios for this validator and only one of the fails. Looks like the flow for cluster creation is bypassing the validator function and the flow for cluster update is correct.

+++ Scenario 1

Initial ACI contains
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 1 Machine Network (IPv4)

In this case the validation is not failing and we get the cluster in

```
  - lastProbeTime: "2021-10-08T12:19:37Z"
    lastTransitionTime: "2021-10-08T12:19:37Z"
    message: SyncOK
    reason: SyncOK
    status: "True"
    type: SpecSynced
  - lastProbeTime: "2021-10-08T12:19:37Z"
    lastTransitionTime: "2021-10-08T12:19:37Z"
    message: 'The cluster''s validations are failing: Clusters must have exactly 3
      dedicated masters. Please either add hosts, or disable the worker host,Hosts
      have not been discovered yet,Hosts have not been discovered yet,Hosts have not
      been discovered yet'
    reason: ValidationsFailing
    status: "False"
    type: Validated
```

+++ Scenario 2

Initial ACI contains
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 2 Machine Networks (IPv4 + IPv6)

An updated ACI contains
* 2 Cluster Networks (IPv4 + IPv6)
* 2 Service Networks (IPv4 + IPv6)
* 1 Machine Network (IPv4)

In this case the validation fails and we see

```
  - lastProbeTime: "2021-10-08T12:16:37Z"
    lastTransitionTime: "2021-10-08T12:16:37Z"
    message: 'The Spec could not be synced due to an input error: Expected 2 machine
      networks, found 1'
    reason: InputError
    status: "False"
    type: SpecSynced
```

Comment 5 Mat Kowalski 2021-10-08 13:04:10 UTC
Creating a cluster via ACI does not include Machine Networks in the `params.NewClusterParams`, the object looks like this

```
time="2021-10-08T12:58:12Z" level=info msg="CHOCOBOMB: Creating cluster with params: &{AdditionalNtpSource:<nil> BaseDNSDomain:hive.example.com ClusterNetworkCidr:<nil> Clu
sterNetworkHostPrefix:0 ClusterNetworks:[0xc0025f9830 0xc0025f9860] CPUArchitecture: DiskEncryption:<nil> HighAvailabilityMode:<nil> HTTPProxy:<nil> HTTPSProxy:<nil> Hypert
hreading:<nil> IngressVip:192.168.111.101 MachineNetworks:[] Name:0xc001fcd710 NetworkType:<nil> NoProxy:<nil> OcpReleaseImage: OlmOperators:[] OpenshiftVersion:0xc001fcd72
0 Platform:<nil> PullSecret:0xc001fcd730 SchedulableMasters:<nil> ServiceNetworkCidr:<nil> ServiceNetworks:[0xc0015f2ae0 0xc0015f2b00] SSHPublicKey:ssh-rsa AAAAB3NzaC1yc2EA
AAADAQABAAABgQC1b/IibQkel9sU5OYuNkoL3qda0vzgx2Sb2lmF5hFsZ3L2D+w+Ixkwjw1g0jQAsQ+00rlKYgdxVmUWYpGE2ZKLQ75kHzs4qChupTMb1rJL5YH8xVeKuCN86WkW2rn5vT7gY8r+m/odCBkL4WQDxGVXdHcevhO6
klehsb2PdhqKkbm+xNMrHSOWOnxbV2O7U4VdWgHMcPt9vlSf4ewNHMNer0cTmmqIIg9Lqbp5p8zcM20uSdMQBjar+A2PHu29CyjqVMczu7S6G/DLbTG4GnovcPJwOiNUgOLEt13kNLRbODXl610DmESS4Si4bAZvi555fXmoAgrW
4uLCZ8zOEgMaz+G6yhcMqJ47WjznhbJRJeWmqz3pjd+252SCrznAmXrbD/mpjYZulDLPIejENJzd7LRBp3DBDQtgrWeP+04CosNYD2vXWV+Xlofd/uSdVzyY+kKkuatGx7R13PHK+WlgxW3albEPEgz8T+3IRKNNfDmwtEem6R0K
AhTuC0volGk= root.lab.eng.rdu2.redhat.com UserManagedNetworking:0xc0018d5ba9 VipDhcpAllocation:0xc0018d5ba8}" func="github.com/openshift/assisted-service/inter
nal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:477" cluster_id=ba34e502-a244-440
6-961e-a9fd78a3c0fa go-id=592 pkg=Inventory request_id=f5508ad2-c4a7-4734-b036-cf9fffd8db9d                                                                                 

[...]

time="2021-10-08T12:58:12Z" level=info msg="ClusterDeployment Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeplo$
mentsReconciler).Reconcile" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:117" cluster_deployment=dual-aci clu$
ter_deployment_namespace=assisted-installer-2 go-id=592 request_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="update cluster ba34e502-a244-4406-961e-a9fd78a3c0fa with params: &{AdditionalNtpSource:<nil> APIVip:0xc001b2a820 APIVipDNSName:$
nil> BaseDNSDomain:<nil> ClusterNetworkCidr:<nil> ClusterNetworkHostPrefix:<nil> ClusterNetworks:[] DiskEncryption:<nil> HTTPProxy:<nil> HTTPSProxy:<nil> Hyperthreading:<ni
l> IngressVip:<nil> MachineNetworkCidr:<nil> MachineNetworks:[0xc001c09dc0] Name:<nil> NetworkType:0xc001b2a810 NoProxy:<nil> OlmOperators:[] Platform:<nil> PullSecret:<nil
> SchedulableMasters:<nil> ServiceNetworkCidr:<nil> ServiceNetworks:[] SSHPublicKey:<nil> UserManagedNetworking:<nil> VipDhcpAllocation:<nil>}" func="github.com/openshift/a
ssisted-service/internal/bminventory.(*bareMetalInventory).v2UpdateClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2375" go-id=
592 pkg=Inventory request_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="CHOCOBOMB: Updating cluster with params: &{AdditionalNtpSource:<nil> APIVip:0xc001b2a820 APIVipDNSName:<nil> BaseDNSDomain:<nil>
 ClusterNetworkCidr:<nil> ClusterNetworkHostPrefix:<nil> ClusterNetworks:[] DiskEncryption:<nil> HTTPProxy:<nil> HTTPSProxy:<nil> Hyperthreading:<nil> IngressVip:<nil> Mach
ineNetworkCidr:<nil> MachineNetworks:[0xc001c09dc0] Name:<nil> NetworkType:0xc001b2a810 NoProxy:<nil> OlmOperators:[] Platform:<nil> PullSecret:<nil> SchedulableMasters:<ni
l> ServiceNetworkCidr:<nil> ServiceNetworks:[] SSHPublicKey:<nil> UserManagedNetworking:<nil> VipDhcpAllocation:<nil>}" func="github.com/openshift/assisted-service/internal
/bminventory.(*bareMetalInventory).validateAndUpdateClusterParams" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2153" go-id=592 pkg=Inventory
 request_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="Updated clusterDeployment assisted-installer-2/dual-aci" func="github.com/openshift/assisted-service/internal/controller/control
lers.(*ClusterDeploymentsReconciler).updateIfNeeded" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:739" agent_c
luster_install=dual-aci agent_cluster_install_namespace=assisted-installer-2 cluster_deployment=dual-aci cluster_deployment_namespace=assisted-installer-2 go-id=592 request
_id=ed803828-2f16-476a-a868-f1bab0f5864e
time="2021-10-08T12:58:12Z" level=info msg="ClusterDeployment Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeployme
ntsReconciler).Reconcile.func1" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:114" agent_cluster_install=dual-a
ci agent_cluster_install_namespace=assisted-installer-2 cluster_deployment=dual-aci cluster_deployment_namespace=assisted-installer-2 go-id=592 request_id=ed803828-2f16-476
a-a868-f1bab0f5864e
```

Comment 6 Mike Ng 2021-10-12 20:24:50 UTC
G2Bsync 941430821 comment 
 CrystalChun Tue, 12 Oct 2021 20:04:17 UTC 
 G2Bsync 
PRs are already merged
https://github.com/openshift/assisted-service/pull/2731
https://github.com/openshift/assisted-service/pull/2660

Comment 7 nshidlin 2022-04-13 07:31:17 UTC
Verified using ACM 2.5.0-DOWNSTREAM-2022-04-11-09-21-38

on creation of dual-stack ACI with single machineNetwork:

oc get agentclusterinstalls.extensions.hive.openshift.io spoke-0 -o json | jq '.spec.networking'
{
  "clusterNetwork": [
    {
      "cidr": "10.128.0.0/14",
      "hostPrefix": 23
    },
    {
      "cidr": "fd01::/48",
      "hostPrefix": 64
    }
  ],
  "machineNetwork": [
    {
      "cidr": "fd2e:6f44:5dd8:5::/64"
    }
  ],
  "serviceNetwork": [
    "172.30.0.0/16",
    "fd02::/112"
  ]
}

spec is not synced with a clear messsage:
oc get agentclusterinstalls.extensions.hive.openshift.io spoke-0 -o json | jq '.status.conditions | map(select(.type=="SpecSynced"))'

[
  {
    "lastProbeTime": "2022-04-13T07:08:04Z",
    "lastTransitionTime": "2022-04-13T07:08:04Z",
    "message": "The Spec could not be synced due to an input error: Expected 2 machine networks, found 1",
    "reason": "InputError",
    "status": "False",
    "type": "SpecSynced"
  }
]

Patching in the missing machine CIDR allows the ACI to sync.

Updating the ACI to remove the second machine CIDR brings the ACI back to not-synced state, so this flow did not regress


Note You need to log in before you can comment on or make changes to this bug.