Bug 1473858

Summary: Installer does not configure flannel correctly for openstack installs.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED ERRATA QA Contact: Gan Huang <ghuang>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.0CC: aos-bugs, eminguez, erich, jack.ottofaro, jokerman, misalunk, mmccomas, sdodson
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-ansible-3.7.0-0.126.1.git.0.0bb5b0c.el7.noarch Doc Type: Bug Fix
Doc Text:
The flannel network was previously defined using the same subnet as the kubernetes services subnet. This caused a conflict between services and SDN networks. The flannel network is now correctly defined by the osm_cluster_network_cidr variable.
Story Points: ---
Clone Of:
: 1491412 1491413 1594310 (view as bug list) Environment:
Last Closed: 2017-11-28 22:05:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1490388, 1491412, 1491413, 1594310    

Description Ryan Howe 2017-07-21 21:39:00 UTC
Description of problem:

For flannel installs we are setting the pod network = to the service network. Instead the pod network needs to be set to the value we pass for osm_cluster_network_cidr when configuring flannel. 

Version-Release number of the following components:
OCP 3.5 


Additional info:

The flannel configuration uses the portal_net and default to 172.30.0.0/16. We also hard set the min network to 172.30.5.0  This value should be set via the installer host variables passed.

https://github.com/openshift/openshift-ansible/blob/master/roles/flannel/tasks/main.yml#L15

https://github.com/openshift/openshift-ansible/blob/master/roles/flannel_register/defaults/main.yaml

Kubernetes                OpenShift         Ansible_Installer
========                   =======           =============
--cluster-cidr               clusterNetworkCIDR   osm_cluster_network_cidr
--service-cluster-ip-range   serviceNetworkCIDR   openshift_port_net

Comment 2 Eduardo Minguez 2017-08-11 12:16:27 UTC
3.6 as well.

Comment 5 Gan Huang 2017-09-18 06:38:53 UTC
Tested with openshift-ansible-3.7.0-0.126.4.git.0.3fc2b9b.el7.noarch.rpm


Installation failed: 

TASK [flannel_register : Generate etcd configuration for etcd] *****************
Monday 18 September 2017  06:37:14 +0000 (0:00:00.130)       0:05:32.474 ****** 
fatal: [host-8-241-75.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

AnsibleError: {{ 32 - openshift.master.sdn_host_subnet_length }}: Unexpected templating type error occurred on ({{ 32 - openshift.master.sdn_host_subnet_length }}): unsupported operand type(s) for -: 'int' and 'AnsibleUnsafeText'

Comment 6 Miheer Salunke 2017-09-18 06:52:44 UTC
After setting ->
osm_cluster_network_cidr=10.128.0.0/14
osm_host_subnet_length=9

I get the following error ->
AnsibleError: {{ 32 - openshift.master.sdn_host_subnet_length }}: Unexpected templating type error occurred on ({{ 32 - openshift.master.sdn_host_subnet_length }}): unsupported operand type(s) for -: 'int' and 'AnsibleUnsafeText'




The following helped to set the CIDR ->

openshift.master.sdn_cluster_network_cidr=10.128.0.0/14
openshift.master.sdn_host_subnet_length=9

Comment 7 Eduardo Minguez 2017-09-19 09:08:27 UTC
So, I've created a cluster from scratch using the latest 3.6 bits + modifying manually the files provided in the PR, and using the default values (so I didn't touched osm_cluster_network_cidr nor osm_host_subnet_length) and it worked for me:


# alias oetcdctl='etcdctl --cert-file=/etc/etcd/peer.crt --key-file=/etc/etcd/peer.key --ca-file=/etc/etcd/ca.crt --peers="https://master-0.edu.flannel.com:2379,https://master-1.edu.flannel.com:2379,https://master-2.edu.flannel.com:2379"'
# oetcdctl get /openshift.com/network/config
{
    "Network": "10.128.0.0/14",
    "SubnetLen": 23,
    "Backend": {
        "Type": "host-gw"
     }
}
# oetcdctl ls /openshift.com/network/subnets
/openshift.com/network/subnets/10.128.10.0-23
/openshift.com/network/subnets/10.128.108.0-23
/openshift.com/network/subnets/10.128.118.0-23
/openshift.com/network/subnets/10.128.28.0-23
/openshift.com/network/subnets/10.128.40.0-23
/openshift.com/network/subnets/10.128.140.0-23
/openshift.com/network/subnets/10.128.98.0-23
/openshift.com/network/subnets/10.128.12.0-23
# oetcdctl get /openshift.com/network/subnets/10.128.98.0-23
{"PublicIP":"192.168.98.10","BackendType":"host-gw"}

Is there any other modification that can affect that since the version in GA right now (the one I've used)?

$ rpm -qf /usr/share/ansible/openshift-ansible/roles/flannel_register/templates/flannel-config.json 
openshift-ansible-roles-3.6.173.0.21-2.git.0.44a4038.el7.noarch

Comment 8 Scott Dodson 2017-09-19 13:35:43 UTC
Eduardo, your change is only on the master branch you'll want to test a 3.7 version of the installer, right?

Comment 9 Eduardo Minguez 2017-09-20 08:30:07 UTC
(In reply to Scott Dodson from comment #8)
> Eduardo, your change is only on the master branch you'll want to test a 3.7
> version of the installer, right?

I can do it if needed, but the thing is I think it should be backported to older releases as well. I've tested 3.6 + manually patching those files as is the GA bits I can use (IDK how to test 3.7 TBH)

Comment 10 Miheer Salunke 2017-09-21 06:37:04 UTC
(In reply to Eduardo Minguez from comment #7)
> So, I've created a cluster from scratch using the latest 3.6 bits +
> modifying manually the files provided in the PR, and using the default
> values (so I didn't touched osm_cluster_network_cidr nor
> osm_host_subnet_length) and it worked for me:
> 
> 
> # alias oetcdctl='etcdctl --cert-file=/etc/etcd/peer.crt
> --key-file=/etc/etcd/peer.key --ca-file=/etc/etcd/ca.crt
> --peers="https://master-0.edu.flannel.com:2379,https://master-1.edu.flannel.
> com:2379,https://master-2.edu.flannel.com:2379"'
> # oetcdctl get /openshift.com/network/config
> {
>     "Network": "10.128.0.0/14",
>     "SubnetLen": 23,
>     "Backend": {
>         "Type": "host-gw"
>      }
> }
> # oetcdctl ls /openshift.com/network/subnets
> /openshift.com/network/subnets/10.128.10.0-23
> /openshift.com/network/subnets/10.128.108.0-23
> /openshift.com/network/subnets/10.128.118.0-23
> /openshift.com/network/subnets/10.128.28.0-23
> /openshift.com/network/subnets/10.128.40.0-23
> /openshift.com/network/subnets/10.128.140.0-23
> /openshift.com/network/subnets/10.128.98.0-23
> /openshift.com/network/subnets/10.128.12.0-23
> # oetcdctl get /openshift.com/network/subnets/10.128.98.0-23
> {"PublicIP":"192.168.98.10","BackendType":"host-gw"}
> 
> Is there any other modification that can affect that since the version in GA
> right now (the one I've used)?
> 
> $ rpm -qf
> /usr/share/ansible/openshift-ansible/roles/flannel_register/templates/
> flannel-config.json 
> openshift-ansible-roles-3.6.173.0.21-2.git.0.44a4038.el7.noarch



So was it that the following values we used were ignored and the default values were taken ?

openshift.master.sdn_cluster_network_cidr=10.128.0.0/14
openshift.master.sdn_host_subnet_length=9

Also what if a customer want set custom cluster cidr and host subnet length ?

Comment 11 Eduardo Minguez 2017-09-21 07:47:14 UTC
So, I did deployed the OCP cluster not setting any values to pod network nor services network to check if it worked with default values (the most common scenario AFAIK)

I didn't have the chance to test it with different values for the cidr or subnet yet.

Comment 12 Eduardo Minguez 2017-09-21 16:04:41 UTC
I've tested with custom values and it failed. I've created a new PR[1] that fixes the issue in my tests


[cloud-user@bastion ~]$ grep -E 'osm|portal' /etc/ansible/hosts
osm_default_node_selector="role=app"
osm_use_cockpit=true
osm_cluster_network_cidr=10.130.0.0/14
osm_host_subnet_length=8
openshift_portal_net=10.111.0.0/16

After the installation:


[root@master-0 ~]# alias oetcdctl='etcdctl --cert-file=/etc/etcd/peer.crt --key-file=/etc/etcd/peer.key --ca-file=/etc/etcd/ca.crt --peers="https://master-0.edu.flannel.com:2379,https://master-1.edu.flannel.com:2379,https://master-2.edu.flannel.com:2379"'
[root@master-0 ~]# oetcdctl get /openshift.com/network/config
{
    "Network": "10.130.0.0/14",
    "SubnetLen": 24,
    "Backend": {
        "Type": "host-gw"
     }
}


But, the subnets assigned to the nodes are on different subnet:

[root@master-0 ~]# oetcdctl ls /openshift.com/network/subnets
/openshift.com/network/subnets/10.128.83.0-24
/openshift.com/network/subnets/10.128.18.0-24
/openshift.com/network/subnets/10.128.77.0-24
/openshift.com/network/subnets/10.128.101.0-24
/openshift.com/network/subnets/10.128.20.0-24
/openshift.com/network/subnets/10.128.92.0-24
/openshift.com/network/subnets/10.128.58.0-24
/openshift.com/network/subnets/10.128.48.0-24

I think I will need some help with that, as TBH I'm not an openshift-ansible expert.

[1] https://github.com/openshift/openshift-ansible/pull/5493

Comment 13 Eduardo Minguez 2017-09-29 09:02:23 UTC
So the PR has been merged as the subnets were ok (it was just me not knowing how to subnet :D)
Is there anything I can do in order to push it? Will it be backported to <3.7 releases?
Thx!

Comment 14 Scott Dodson 2017-09-29 13:21:46 UTC
PRs against release-3.6, release-1.5, and release-1.4 branches would be helpful. I'll try to get to those today if you don't. We should include both of your fixes in each of those PRs.

Comment 15 Eduardo Minguez 2017-09-29 14:41:24 UTC
(In reply to Scott Dodson from comment #14)
> PRs against release-3.6, release-1.5, and release-1.4 branches would be
> helpful. I'll try to get to those today if you don't. We should include both
> of your fixes in each of those PRs.

I think I got it:

* PR against release-1.4 -> https://github.com/openshift/openshift-ansible/pull/5592
* PR against release-1.5 -> https://github.com/openshift/openshift-ansible/pull/5591
* PR against release-3.6 -> https://github.com/openshift/openshift-ansible/pull/5590

Thanks!

Comment 19 Gan Huang 2017-10-09 08:18:17 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1499651

Comment 20 Gan Huang 2017-10-11 08:48:27 UTC
Verified with openshift-ansible-3.7.0-0.147.0.git.0.2fb41ee.el7.noarch.rpm

Both default vales and custom values works.

Comment 23 errata-xmlrpc 2017-11-28 22:05:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188