Bug 1451023 - Changes to the default clusterNetworkCIDR & hostSubnetLength via installer does not take in account old default value when adding new master.
Summary: Changes to the default clusterNetworkCIDR & hostSubnetLength via installer d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 3.7.0
Assignee: Andrew Butcher
QA Contact: Gan Huang
URL:
Whiteboard:
: 1493268 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-15 15:03 UTC by Ryan Howe
Modified: 2017-11-28 21:54 UTC (History)
10 users (show)

Fixed In Version: openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch
Doc Type: Bug Fix
Doc Text:
Cause: When upgrading between versions (specifically 3.3/1.3 or earlier to 3.4 or later) the default values for clusterNetworkCIDR and hostSubnetLength changed. If the inventory file didn't specify corresponding inventory variables the upgrade will fail. Consequence: Controller service fails to start back up. Fix: The following are now required inventory variables when upgrading or installing: - osm_cluster_network_cidr - osm_host_subnet_length - openshift_portal_net Result: The the required variables are not set the upgrade/install will stop early and let the admin know the variables must be set and where they can find the corresponding values. Message: osm_cluster_network_cidr, osm_host_subnet_length, and openshift_portal_net are required inventory variables when upgrading. These variables should match what is currently used in the cluster. If you don't remember what these values are you can find them in /etc/origin/master/master-config.yaml on a master with the names clusterNetworkCIDR (osm_cluster_network_cidr), hostSubnetLength (osm_host_subnet_length), and serviceNetworkCIDR (openshift_portal_net).
Clone Of:
Environment:
Last Closed: 2017-11-28 21:54:33 UTC
Target Upstream Version:
Embargoed:
jiajliu: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Ryan Howe 2017-05-15 15:03:50 UTC
Description of problem:

  If a cluster was installed 3.3 or earlier with the default clusterNetworkCIDR & hostSubnetLength (10.1.0.0/16, subnet length of 8) and the cluster upgrades to 3.4 or later, when they add a new master to the cluster the master will get the new default 10.128.0.0/14 , subnet length of 9. 


Version-Release number of selected component (if applicable):
3.4 later 

How reproducible:
100%

Steps to Reproduce:
1. Install a cluster with 3.3 do not set osm_cluster_network_cidr or osm_host_subnet_length, allow for installer to set default. 

2. Upgrade cluster to 3.4. 

3. Add a new master to the cluster
4. Stop controllers on all old masters


Actual results:

Controller service fails to start back up. 

F0512 22:57:22.078015       1 run_components.go:384] SDN initialization failed: [Error: Existing node subnet: 10.1.26.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.27.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.17.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.11.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.41.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.36.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.4.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.8.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.31.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.2.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.30.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.37.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.15.0/24 is not part of cluster network: 10.128.0.0/14, Error: Existing node subnet: 10.1.1.0/24 is not part of cluster network: 10.128.0.0/14, Error: 


Expected results:

The install to use the old default value. 


Additional info:

1. We advise people to not change cluster network IP range once installed. 
2. If are using the default values in the installer, the default values should not change for parameters that are not advised to changed once a cluster is already installed. 
3. When adding a master we should look at what the other masters have set for clusterNetworkCIDR & hostSubnetLength, and not set the new master with the new default value.

Comment 3 Scott Dodson 2017-06-01 20:27:36 UTC
A workaround is of course to set osm_cluster_network_cidr and osm_host_subnet_length to the old values before running the scaleup playbook. 

While we should fix this I'm lowering severity to medium based on the easy workaround.

Comment 4 Steve Milner 2017-08-29 15:50:21 UTC
Added an inventory check before upgrade which makes sure that these two variables are explicitly set. This is what it looks like when one or both are not set via inventory:


Tuesday 29 August 2017  11:35:50 -0400 (0:00:00.013)       0:00:03.567 ********                           
fatal: [192.168.124.234]: FAILED! => {                                                                    
    "assertion": "osm_cluster_network_cidr is defined",                                                   
    "changed": false,                                                                                     
    "evaluated_to": false,                                                                                
    "failed": true                                                                                                                                                                                                   
}                                                                                                         
                                                                                                                                                                                                                     
MSG:                                                                                                      
                                                                                                          
osm_cluster_network_cidr and openshift_portal_net are required inventory variables when upgrading. These variables should match what is currently used in the cluster. If you don't remember what these values are yo
u can find them in /etc/origin/master/master-config.yaml on a master with the names clusterNetworkCIDR(osm_cluster_network_cidr)  and hostSubnetLength (openshift_portal_net).



PR: https://github.com/openshift/openshift-ansible/pull/5256

Comment 5 Steve Milner 2017-09-11 18:28:30 UTC
Merged

Comment 7 Steve Milner 2017-09-13 13:18:59 UTC
Updated message: https://github.com/openshift/openshift-ansible/pull/5386

Added doc text.

Comment 9 Johnny Liu 2017-09-14 10:29:36 UTC
From QE's perspective, the resolution described in comment 2 should be the best one, and the fix should be landed into scale up playbook, but not upgrade playbook.

> When performing a scaleup we need to read the CIDR values from an existing master then set that fact on the scaled up masters.

Even if can not implement that fix in short term, as a compromise, when performing a scaleup, once installer find CIDR values from an existing master is mismatched with the new CIDR values for new master, installer exit and prompt user to set osm_cluster_network_cidr and osm_host_subnet_length to the old values before running the scaleup playbook. 

From customer's perspective, when running upgrade, it is not reasonable to force user to set osm_cluster_network_cidr and osm_host_subnet_length into inventory host file. it is not relative to upgrade, but scaleup.

So assign this bug ack.

Comment 20 Gan Huang 2017-09-22 05:53:39 UTC
*** Bug 1493268 has been marked as a duplicate of this bug. ***

Comment 21 Gan Huang 2017-09-22 05:58:12 UTC
The final fix for the issue: https://github.com/openshift/openshift-ansible/pull/5473

Moving to MODIFIED as it's not built into rpm package.

Comment 22 Gan Huang 2017-09-30 06:12:53 UTC
Tested with openshift-ansible-3.7.0-0.134.0.git.0.6f43fc3.el7.noarch.rpm

1. Trigger master HA installation with original network parameters
# cat inventory_host
<--snip-->
osm_cluster_network_cidr=11.0.0.0/16
osm_host_subnet_length=8
openshift_master_portal_net=172.31.0.0/16
<--snip-->

2. Removed above network parameters from inventory file

3. Scale up one master against the env above

##Result:

Installation succeeded, but the master-controllers on the new master was indicating some errors:

" failed to start SDN plugin controller: cannot change the serviceNetworkCIDR of an already-deployed cluster"

Dig more found that the new master was still using the new default portal net:
# grep -nri "serviceNetworkCIDR:" /etc/origin/master/master-config.yaml
162:  serviceNetworkCIDR: 172.30.0.0/16

As `initialize_facts.yml` was executed prior to `set_network_facts.yml`, so installer would still take the default `portal_net` for the following tasks.

./playbooks/common/openshift-cluster/initialize_facts.yml:143:        portal_net: "{{ openshift_portal_net | default(openshift_master_portal_net) | default(None) }}"

Comment 26 Gan Huang 2017-10-11 08:54:52 UTC
Verified with openshift-ansible-3.7.0-0.147.0.git.0.2fb41ee.el7.noarch.rpm

Test steps:
1. Trigger master HA installation with original network parameters
# cat inventory_host
<--snip-->
osm_cluster_network_cidr=10.1.0.0/16
osm_host_subnet_length=8
openshift_portal_net=172.31.0.0/16
<--snip-->

2. Removed above network parameters from inventory file

3. Scale up one master against the env above

4. Check the new master parameters against the new master:

# grep -E "NetworkCIDR|Length" /etc/origin/master/master-config.yaml
  clusterNetworkCIDR: 10.1.0.0/16
  externalIPNetworkCIDRs:
  hostSubnetLength: 8
  serviceNetworkCIDR: 172.31.0.0/16

5. Trigger S2I build against the new master, works well

Comment 30 errata-xmlrpc 2017-11-28 21:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.