1999297 – [Assisted-4.8 ][SaaS] vip-dhcp-allocation mode broken cannot set networking for cluster

Bug 1999297 - [Assisted-4.8 ][SaaS] vip-dhcp-allocation mode broken cannot set networking for cluster

Summary: [Assisted-4.8 ][SaaS] vip-dhcp-allocation mode broken cannot set networking f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	assisted-installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Mat Kowalski
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-30 20:04 UTC by Yuri Obshansky
Modified:	2021-10-18 17:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:	OCP-Metal-V1.0.25.3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:49:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
scenario 1 screenshot (78.76 KB, image/png) 2021-08-30 20:04 UTC, Yuri Obshansky	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift-assisted assisted-ui-lib pull 767	None	None	None	2021-08-31 15:34:42 UTC
Github	openshift assisted-service pull 2512	None	None	None	2021-08-30 23:14:48 UTC
Github	openshift assisted-service pull 2527	None	None	None	2021-08-31 21:38:53 UTC
Github	openshift assisted-service pull 2545	None	None	None	2021-09-02 13:17:34 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:50:11 UTC

Description Yuri Obshansky 2021-08-30 20:04:22 UTC

Created attachment 1819203 [details]
scenario 1 screenshot

Description of problem:
Impossible to set cluster Networking in any way while 
Allocate virtual IPs via DHCP server is True or False
Failed on both cases

Version-Release number of selected component (if applicable):
v1.0.25.2 

How reproducible:
Scenario 1: Allocate virtual IPs via DHCP server - True
- Create cluster, download ISO, discover nodes, go to Networking
- do nothing with "Allocate virtual IPs via DHCP server" 
Result- >
AI found API Virtual IP and Ingress Virtual IP but after several seconds failed
1. The DHCP server failed to allocate the IP
The API virtual IP is undefined; IP allocation from the DHCP server timed out.
2. The DHCP server failed to allocate the IP
The Ingress virtual IP is undefined; IP allocation from the DHCP server timed out.
3. Cluster is not ready yet
The following requirements must be met:
The API virtual IP is undefined; IP allocation from the DHCP server timed out.
The Ingress virtual IP is undefined; IP allocation from the DHCP server timed out.

Scenario 2: Allocate virtual IPs via DHCP server - False
- Create cluster, download ISO, discover nodes, go to Networking
- Uncheck "Allocate virtual IPs via DHCP server" 
- Put correct 
API Virtual IP *  - 192.168.123.5
Ingress Virtual IP * - 192.168.123.10

Result- >
Failed to update the cluster
Setting Machine network CIDR is forbidden when cluster is not in vip-dhcp-allocation mode

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Mat Kowalski 2021-08-31 14:58:46 UTC

For scenario 2 I see the following payload in the UI

```
{"api_vip":"192.168.127.52","ingress_vip":"192.168.127.51","ssh_public_key":"ssh-rsa AAAAB[...]W+b6wp5c=","vip_dhcp_allocation":false,"network_type":"OVNKubernetes","user_managed_networking":false,"cluster_networks":[{"cidr":"10.128.0.0/14","host_prefix":23}],"service_networks":[{"cidr":"172.30.0.0/16"}],"machine_networks":[{"cidr":"192.168.127.0/24"}]}
```

Modifying it and sending manually via curl as PATCH with machine_networks removed, succeeds

```
'{"api_vip":"192.168.127.52","ingress_vip":"192.168.127.51","ssh_public_key":"ssh-rsa AAA[...]wp5c=","vip_dhcp_allocation":false,"network_type":"OVNKubernetes","user_managed_networking":false,"cluster_networks":[{"cidr":"10.128.0.0/14","host_prefix":23}],"service_networks":[{"cidr":"172.30.0.0/16"}]}'
```

Comment 4 Mat Kowalski 2021-08-31 15:05:24 UTC

The succeeding payload can as well contain *empty* list of "machine_networks", i.e.

```
'{"api_vip":"192.168.127.52","ingress_vip":"192.168.127.51","ssh_public_key":"ssh-rsa AAAA[...]W+b6wp5c=","vip_dhcp_allocation":false,"network_type":"OVNKubernetes","user_managed_networking":false,"cluster_networks":[{"cidr":"10.128.0.0/14","host_prefix":23}],"service_networks":[{"cidr":"172.30.0.0/16"}],"machine_networks":[]}'
```

Comment 5 Mat Kowalski 2021-08-31 15:13:13 UTC

For scenario 1 the error message "IP allocation from the DHCP server timed out" is coming from the following validator function

```
func isDhcpLeaseAllocationTimedOut(c *clusterPreprocessContext) bool {
	return c.cluster.MachineNetworkCidrUpdatedAt.String() != "" && time.Since(c.cluster.MachineNetworkCidrUpdatedAt) > DhcpLeaseTimeoutMinutes*time.Minute
}
```

Those values are updated via

```
func UpdateMachineCidr(db *gorm.DB, cluster *common.Cluster, machineCidr string) error {

[...]

		return db.Model(&common.Cluster{}).Where("id = ?", cluster.ID.String()).Updates(map[string]interface{}{
			"machine_network_cidr":            machineCidr,
			"machine_network_cidr_updated_at": time.Now(),
		}).Error

```

Comment 6 Mat Kowalski 2021-08-31 20:54:02 UTC

++++++++++++++++++
+++ SCENARIO 1 +++
++++++++++++++++++

Those are payloads used to configure respective options

+++ Cluster-Managed Networking, disabling "Allocate virtual IPs via DHCP server", manually providing VIPs

`--data '{"api_vip":"192.168.127.201","ingress_vip":"192.168.127.202","vip_dhcp_allocation":false,"network_type":"OVNKubernetes","user_managed_networking":false,"cluster_networks":[{"cidr":"10.128.0.0/14","host_prefix":23}],"service_networks":[{"cidr":"172.30.0.0/16"}]}'`

+++ Cluster-Managed Networking, enabling "Allocate virtual IPs via DHCP server", machine network #1

`--data '{"vip_dhcp_allocation":true,"network_type":"OVNKubernetes","user_managed_networking":false,"cluster_networks":[{"cidr":"10.128.0.0/14","host_prefix":23}],"service_networks":[{"cidr":"172.30.0.0/16"}],"machine_networks":[{"cidr":"192.168.127.0/24"}]}'`

+++ Cluster-Managed Networking, enabling "Allocate virtual IPs via DHCP server", machine network #2

`--data '{"vip_dhcp_allocation":true,"network_type":"OVNKubernetes","user_managed_networking":false,"cluster_networks":[{"cidr":"10.128.0.0/14","host_prefix":23}],"service_networks":[{"cidr":"172.30.0.0/16"}],"machine_networks":[{"cidr":"192.168.145.0/24"}]}'`

------------------

Now the observations

- disabling auto-allocation just works, no more comments
- enabling auto-allocations causes `GET /api/.../clusters/UUID | jq '.api_vip'` to return NULL for about 15 seconds after sending the request; after this time, the IP is returned consistently
- sending any PATCH request containing "machine_networks" (even if the value does not change) causes GET to once again return NULL for about 15 seconds

From this I believe what happens is the following

- whenever params.ClusterUpdateParams.MachineNetworks is not empty, we are triggering a reallocation of the VIP
- it does not matter whether params.ClusterUpdateParams.MachineNetworks is equal to cluster.MachineNetworks or not

What should happen

- reallocation should be triggered only when params.ClusterUpdateParams.MachineNetworks != cluster.MachineNetworks

Comment 7 Mat Kowalski 2021-08-31 21:25:09 UTC

Additional observation, not impacting the fundamental flaw of the reallocation logic described above, but something that causes the issue to be visible - when I created a cluster and then using the UI I go to the Networking tab, the browser keeps sending PATCH requests recurrently even though I'm not touching the UI at all.

I would have expected PATCH to be sent only when I explicitly click something in the UI that changes the underlying value, but I observe them being sent even when I don't touch the browser.

Comment 8 Mat Kowalski 2021-08-31 21:47:27 UTC

The current state is believed to be

- fix for scenario (1) in https://github.com/openshift/assisted-service/pull/2527
- fix for scenario (2) in https://github.com/openshift-assisted/assisted-ui-lib/pull/767

https://github.com/openshift/assisted-service/pull/2512, as not directly affecting the issue, should not be considered as a scope of this BZ.

Comment 11 Yuri Obshansky 2021-09-07 13:48:47 UTC

Verified both scenarios on Staging
UI 1.5.35
BE v1.0.25.3

Comment 14 errata-xmlrpc 2021-10-18 17:49:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.