Bug 1451023

Summary:	Changes to the default clusterNetworkCIDR & hostSubnetLength via installer does not take in account old default value when adding new master.
Product:	OpenShift Container Platform	Reporter:	Ryan Howe <rhowe>
Component:	Installer	Assignee:	Andrew Butcher <abutcher>
Status:	CLOSED ERRATA	QA Contact:	Gan Huang <ghuang>
Severity:	medium	Docs Contact:
Priority:	high
Version:	3.4.0	CC:	aos-bugs, erich, jiajliu, jokerman, mmccomas, mwoodson, rhowe, sdodson, smilner, weshi
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	3.7.0	Flags:	jiajliu: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch	Doc Type:	Bug Fix
Doc Text:	Cause: When upgrading between versions (specifically 3.3/1.3 or earlier to 3.4 or later) the default values for clusterNetworkCIDR and hostSubnetLength changed. If the inventory file didn't specify corresponding inventory variables the upgrade will fail. Consequence: Controller service fails to start back up. Fix: The following are now required inventory variables when upgrading or installing: - osm_cluster_network_cidr - osm_host_subnet_length - openshift_portal_net Result: The the required variables are not set the upgrade/install will stop early and let the admin know the variables must be set and where they can find the corresponding values. Message: osm_cluster_network_cidr, osm_host_subnet_length, and openshift_portal_net are required inventory variables when upgrading. These variables should match what is currently used in the cluster. If you don't remember what these values are you can find them in /etc/origin/master/master-config.yaml on a master with the names clusterNetworkCIDR (osm_cluster_network_cidr), hostSubnetLength (osm_host_subnet_length), and serviceNetworkCIDR (openshift_portal_net).	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 21:54:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ryan Howe 2017-05-15 15:03:50 UTC

Description of problem:

If a cluster was installed 3.3 or earlier with the default clusterNetworkCIDR & hostSubnetLength (10.1.0.0/16, subnet length of 8) and the cluster upgrades to 3.4 or later, when they add a new master to the cluster the master will get the new default 10.128.0.0/14 , subnet length of 9.

Version-Release number of selected component (if applicable):
3.4 later

How reproducible:
100%

Steps to Reproduce:
1. Install a cluster with 3.3 do not set osm_cluster_network_cidr or osm_host_subnet_length, allow for installer to set default.

2. Upgrade cluster to 3.4.

3. Add a new master to the cluster
4. Stop controllers on all old masters

Actual results:

Controller service fails to start back up.

F0512 22:57:22.078015 1 run_components.go:384] SDN initialization failed: [Error: Existing node subnet: 10.1.26.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.27.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.17.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.11.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.41.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.36.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.4.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.8.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.31.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.2.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.30.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.37.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.15.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.1.0/24 is not part of cluster network: 10.128.0.0/14, Error:

Expected results:

The install to use the old default value.

Additional info:

1. We advise people to not change cluster network IP range once installed.
2. If are using the default values in the installer, the default values should not change for parameters that are not advised to changed once a cluster is already installed.
3. When adding a master we should look at what the other masters have set for clusterNetworkCIDR & hostSubnetLength, and not set the new master with the new default value.

Comment 1 Ryan Howe 2017-05-15 15:04:59 UTC

Change to the defaults. 

https://github.com/openshift/openshift-ansible/commit/b50b4ea0b03feb9431abd7294fe4fb6b549ddfc0

Comment 3 Scott Dodson 2017-06-01 20:27:36 UTC

A workaround is of course to set osm_cluster_network_cidr and osm_host_subnet_length to the old values before running the scaleup playbook. 

While we should fix this I'm lowering severity to medium based on the easy workaround.

Comment 4 Steve Milner 2017-08-29 15:50:21 UTC

Added an inventory check before upgrade which makes sure that these two variables are explicitly set. This is what it looks like when one or both are not set via inventory:


Tuesday 29 August 2017  11:35:50 -0400 (0:00:00.013)       0:00:03.567 ********                           
fatal: [192.168.124.234]: FAILED! => {                                                                    
    "assertion": "osm_cluster_network_cidr is defined",                                                   
    "changed": false,                                                                                     
    "evaluated_to": false,                                                                                
    "failed": true                                                                                                                                                                                                   
}                                                                                                         
                                                                                                                                                                                                                     
MSG:                                                                                                      
                                                                                                          
osm_cluster_network_cidr and openshift_portal_net are required inventory variables when upgrading. These variables should match what is currently used in the cluster. If you don't remember what these values are yo
u can find them in /etc/origin/master/master-config.yaml on a master with the names clusterNetworkCIDR(osm_cluster_network_cidr)  and hostSubnetLength (openshift_portal_net).



PR: https://github.com/openshift/openshift-ansible/pull/5256

Comment 5 Steve Milner 2017-09-11 18:28:30 UTC

Merged

Comment 7 Steve Milner 2017-09-13 13:18:59 UTC

Updated message: https://github.com/openshift/openshift-ansible/pull/5386

Added doc text.

Comment 9 Johnny Liu 2017-09-14 10:29:36 UTC

From QE's perspective, the resolution described in comment 2 should be the best one, and the fix should be landed into scale up playbook, but not upgrade playbook.

> When performing a scaleup we need to read the CIDR values from an existing master then set that fact on the scaled up masters.

Even if can not implement that fix in short term, as a compromise, when performing a scaleup, once installer find CIDR values from an existing master is mismatched with the new CIDR values for new master, installer exit and prompt user to set osm_cluster_network_cidr and osm_host_subnet_length to the old values before running the scaleup playbook. 

From customer's perspective, when running upgrade, it is not reasonable to force user to set osm_cluster_network_cidr and osm_host_subnet_length into inventory host file. it is not relative to upgrade, but scaleup.

So assign this bug ack.

Comment 20 Gan Huang 2017-09-22 05:53:39 UTC

*** Bug 1493268 has been marked as a duplicate of this bug. ***

Comment 21 Gan Huang 2017-09-22 05:58:12 UTC

The final fix for the issue: https://github.com/openshift/openshift-ansible/pull/5473

Moving to MODIFIED as it's not built into rpm package.

Comment 22 Gan Huang 2017-09-30 06:12:53 UTC

Tested with openshift-ansible-3.7.0-0.134.0.git.0.6f43fc3.el7.noarch.rpm

1. Trigger master HA installation with original network parameters
# cat inventory_host
<--snip-->
osm_cluster_network_cidr=11.0.0.0/16
osm_host_subnet_length=8
openshift_master_portal_net=172.31.0.0/16
<--snip-->

2. Removed above network parameters from inventory file

3. Scale up one master against the env above

##Result:

Installation succeeded, but the master-controllers on the new master was indicating some errors:

" failed to start SDN plugin controller: cannot change the serviceNetworkCIDR of an already-deployed cluster"

Dig more found that the new master was still using the new default portal net:
# grep -nri "serviceNetworkCIDR:" /etc/origin/master/master-config.yaml
162:  serviceNetworkCIDR: 172.30.0.0/16

As `initialize_facts.yml` was executed prior to `set_network_facts.yml`, so installer would still take the default `portal_net` for the following tasks.

./playbooks/common/openshift-cluster/initialize_facts.yml:143:        portal_net: "{{ openshift_portal_net | default(openshift_master_portal_net) | default(None) }}"

Comment 24 Andrew Butcher 2017-10-09 19:37:40 UTC

https://github.com/openshift/openshift-ansible/pull/5621

Comment 26 Gan Huang 2017-10-11 08:54:52 UTC

Verified with openshift-ansible-3.7.0-0.147.0.git.0.2fb41ee.el7.noarch.rpm

Test steps:
1. Trigger master HA installation with original network parameters
# cat inventory_host
<--snip-->
osm_cluster_network_cidr=10.1.0.0/16
osm_host_subnet_length=8
openshift_portal_net=172.31.0.0/16
<--snip-->

2. Removed above network parameters from inventory file

3. Scale up one master against the env above

4. Check the new master parameters against the new master:

# grep -E "NetworkCIDR|Length" /etc/origin/master/master-config.yaml
  clusterNetworkCIDR: 10.1.0.0/16
  externalIPNetworkCIDRs:
  hostSubnetLength: 8
  serviceNetworkCIDR: 172.31.0.0/16

5. Trigger S2I build against the new master, works well

Comment 30 errata-xmlrpc 2017-11-28 21:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188