1506750 – [3.9] Check at install time if alleged openshift_ip value is actually openshift_public_ip

Bug 1506750 - [3.9] Check at install time if alleged openshift_ip value is actually openshift_public_ip

Summary: [3.9] Check at install time if alleged openshift_ip value is actually openshi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Russell Teague
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-26 17:37 UTC by Dan Winship
Modified:	2018-03-28 14:09 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Variable defined in inventory was being interpreted as a string, not a bool Consequence: Tasks were not being conditionally run as expected Fix: Casting the string to a bool for proper conditional check Result: Tasks run as expected based on inventory variable setting
Clone Of:
Clones:	1538816 (view as bug list)
Environment:
Last Closed:	2018-03-28 14:08:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0489	0	None	None	None	2018-03-28 14:09:21 UTC

Description Dan Winship 2017-10-26 17:37:24 UTC

In situations with separate "public" and "private" node IPs (which I think means AWS and OpenStack), it's important for OpenShift to know both sets of IPs, and we allow configuring this at install time with openshift_ip vs openshift_public_ip (or openshift_hostname vs openshift_public_hostname). But we don't currently check that the user is making this distinction when it's needed.

In particular, if a node has separate public and private IPs, but you specify the public IP as "openshift_ip", then the SDN will not work. (To prevent spoofing, nodes only accept VXLAN packets from IPs that they recognize as being the IPs of other nodes, but that only works if the nodes registered themselves with their "real"/"private" IPs.)

In practice, this means that openshift_ip (or the IP that openshift_hostname resolves to) on each node must be the IP address of some interface on the node; if it's not, ansible should probably refuse to continue and refer the user to the docs (eg, https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-host-variables, although maybe we need to be a little clearer about things there?)

Comment 1 Yan Du 2017-10-31 10:24:30 UTC

It will block the OCP 3.7 installation on openstack

Comment 2 Yan Du 2017-10-31 10:25:56 UTC

You could refer to https://bugzilla.redhat.com/show_bug.cgi?id=1505266 for more information.

Comment 3 Dan Winship 2017-10-31 13:50:49 UTC

This bug is not about reverting the behavior in 1505266, it's about moving that check *sooner*. If there is a problem with 1505266 (which I don't think there is but we can move the discussion there) then it needs to be fixed there and this bug would be WONTFIXed in that case.

Comment 4 Russell Teague 2017-10-31 22:29:55 UTC

Proposed: https://github.com/openshift/openshift-ansible/pull/5970

Comment 5 liujia 2017-11-01 09:50:13 UTC

Upgrade from v3.6 to v3.7 against an cluster deployed on openstack hit the issue too.

Comment 6 Johnny Liu 2017-11-01 11:55:09 UTC

QE encounter a lot of issues on openstack testing due to this change, almost all the env setup on openstack failed.

I think we really need re-consider this change seriously, for old version (<=3.6), QE always use a public hostname which will be resolved to its floating IP, but not an IP address owned by this host. The main reason of using a floating IP is OpenStack network configuration is a little weak compared with EC2/GCE, instances name is not resolved between instances.

Once this change is introduced, QE's upgrade testing would be broke out (from 3.6 to 3.7), and also need re-factor 3.7 fresh install automation job, or else that would break everything.

Here I would mainly take a fresh install as an example:
Launch 2 instances named "qe-jialiu1-master-etcd-nfs-1" and "qe-jialiu1-node-registry-router-1".
Note that:
"qe-jialiu1-master-etcd-nfs-1" and "qe-jialiu1-node-registry-router-1" is not resolved by each other.

According to this change, their hostname need be resolved to an IP address owned by this host. So setting the following:

master:
instance name: qe-jialiu1-master-etcd-nfs-1
system hostname: host-172-16-120-113 (this is automatically assigned by openstack network)

node:
instance name: qe-jialiu1-node-registry-router-1
system hostname: host-172-16-120-32 (this is automatically assigned by openstack network)

Then installation is finished successfully, but found nodename is set to IP, but not hostname in /etc/origin/node/node-config.yaml.
"nodeName: 172.16.120.32"

# oc get nodes
NAME STATUS AGE VERSION
172.16.120.113 Ready,SchedulingDisabled 16m v1.7.6+a08f5eeb62
172.16.120.32 Ready 16m v1.7.6+a08f5eeb62

This would bring a lot trouble when instance ip get changed. I think this was mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1416703#c4.

Comment 7 Dan Winship 2017-11-01 15:24:40 UTC

(In reply to Johnny Liu from comment #6)
> QE encounter a lot of issues on openstack testing due to this change, almost
> all the env setup on openstack failed.

This discussion ought to be happening on bug 1505266 since that's the actual change we're talking about. But anyway, that change is being reverted now, until after 3.7 ships, when we'll bring it back. So the ansible change here should probably also hold off until then.

> I think we really need re-consider this change seriously, for old version
> (<=3.6), QE always use a public hostname which will be resolved to its
> floating IP, but not an IP address owned by this host. The main reason of
> using a floating IP is OpenStack network configuration is a little weak
> compared with EC2/GCE, instances name is not resolved between instances. 

The problem is that a cluster installed that way is *broken*. Most features still work, but some don't. (And in particular, all SDN traffic between nodes will get dropped.) QE can only get away with installing clusters this way because you don't need to test every feature on every cluster, so it's OK if some features are broken on some test clusters. But we assume customers are going to want their clusters installed in a way such that all of OpenShift's features will work, so we should prevent them from misconfiguring them.

> Once this change is introduced, QE's upgrade testing would be broke out
> (from 3.6 to 3.7), and also need re-factor 3.7 fresh install automation job,
> or else that would break everything.

I don't think that's true. If you change the test to set openshift_hostname/openshift_ip correctly, it should work fine on both old and new OpenShift.

> Then installation is finished successfully, but found nodename is set to IP,
> but not hostname in /etc/origin/node/node-config.yaml.
> "nodeName: 172.16.120.32" 
> 
> # oc get nodes
> NAME             STATUS                     AGE       VERSION
> 172.16.120.113   Ready,SchedulingDisabled   16m       v1.7.6+a08f5eeb62
> 172.16.120.32    Ready                      16m       v1.7.6+a08f5eeb62
> 
> This would bring a lot trouble when instance ip get changed. I think this
> was mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1416703#c4.

If the current public-vs-private ip/hostname system doesn't work well for OpenShift-on-OpenStack then maybe we need to look into making it better. (Eg, making nodes able to resolve each other's local hostnames somehow.) Because most customers aren't going to be able to just "cheat" and set openshift_hostname to the public hostname like the tests are doing since that would break things.

Comment 8 Russell Teague 2017-12-15 13:13:17 UTC

Merged: https://github.com/openshift/openshift-ansible/pull/5970

Comment 10 Johnny Liu 2018-01-03 06:33:34 UTC

Re-test this bug with openshift-ansible-3.9.0-0.13.0.git.0.8119a5c.el7.noarch, and FAIL.

1. set "openshift_hostname=host-8-245-68.host.centralci.eng.rdu2.redhat.com" for node, run "playbooks/prerequisites.yaml", get the following error as expectation.
TASK [Query DNS for IP address of host-8-245-68.host.centralci.eng.rdu2.redhat.com] ********************************************************************************************
ok: [host-8-245-68.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "cmd": "getent ahostsv4 host-8-245-68.host.centralci.eng.rdu2.redhat.com | head -n 1 | awk '{ print $1 }'", "delta": "0:00:00.105294", "end": "2018-01-03 01:13:40.367097", "failed": false, "failed_when_result": false, "rc": 0, "start": "2018-01-03 01:13:40.261803", "stderr": "", "stderr_lines": [], "stdout": "10.8.245.68", "stdout_lines": ["10.8.245.68"]}


TASK [Validate openshift_hostname when defined] ********************************************************************************************************************************
fatal: [host-8-245-68.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"changed": false, "failed": true, "msg": "The hostname host-8-245-68.host.centralci.eng.rdu2.redhat.com for host-172-16-120-117 doesn't resolve to an IP address owned by this host. Please set openshift_hostname variable to a hostname that when resolved on the host in question resolves to an IP address matching an interface on this host. This will ensure proper functionality of OpenShift networking features. Inventory setting: openshift_hostname=host-8-245-68.host.centralci.eng.rdu2.redhat.com This check can be overridden by setting openshift_hostname_check=false in the inventory. See https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-host-variables\n"}


2. set "openshift_hostname=host-8-245-68.host.centralci.eng.rdu2.redhat.com" for node, and "openshift_hostname_check=false" in inventory host file, run "playbooks/prerequisites.yaml", still get the same error as #1.


3. set "openshift_ip=10.8.245.68" for node, run "playbooks/prerequisites.yaml", get the following error as expectation.

TASK [Query DNS for IP address of host-8-245-68.host.centralci.eng.rdu2.redhat.com] ********************************************************************************************
ok: [host-8-245-68.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "cmd": "getent ahostsv4 host-8-245-68.host.centralci.eng.rdu2.redhat.com | head -n 1 | awk '{ print $1 }'", "delta": "0:00:00.106362", "end": "2018-01-03 01:01:46.259505", "failed": false, "failed_when_result": false, "rc": 0, "start": "2018-01-03 01:01:46.153143", "stderr": "", "stderr_lines": [], "stdout": "10.8.245.68", "stdout_lines": ["10.8.245.68"]}

TASK [Validate openshift_hostname when defined] ********************************************************************************************************************************
skipping: [host-8-245-68.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}

TASK [Validate openshift_ip exists on node when defined] ***********************************************************************************************************************
fatal: [host-8-245-68.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"changed": false, "failed": true, "msg": "The IP address 10.8.245.68 does not exist on host-172-16-120-117. Please set the openshift_ip variable to an IP address of this node. This will ensure proper functionality of OpenShift networking features. Inventory setting: openshift_ip=10.8.245.68 This check can be overridden by setting openshift_ip_check=false in the inventory. See https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-host-variables\n"}

4. set "openshift_ip=10.8.245.68" for node, and "openshift_ip_check=false" in inventory host file, run "playbooks/prerequisites.yaml", still get the same error as #3.


Based on #2 and #4, seem like "openshift_hostname_check=false" and "openshift_ip_check=false" is not respected, so assgin back this bug.

Some other enhancement, in this RP, openshift_override_hostname_check is replaced by openshift_hostname_check, but several pieces of openshift-ansible code are still referring the old option name.

$ grep -r "openshift_override_hostname_check" *
inventory/hosts.example:#openshift_override_hostname_check=true
playbooks/openstack/sample-inventory/group_vars/OSEv3.yml:openshift_override_hostname_check: true
utils/src/ooinstall/openshift_ansible.py:    base_inventory.write('openshift_override_hostname_check=true\n')

Pls also update them together.

Comment 11 Johnny Liu 2018-01-03 10:15:33 UTC

BTW, about the issue of "openshift_hostname_check=false" and "openshift_ip_check=false" not respected, a following 'bool' filter would fix them.

openshift_hostname_check | default(true) | bool
openshift_ip_check | default(true) | bool

Comment 12 Russell Teague 2018-01-22 18:51:32 UTC

Proposed: https://github.com/openshift/openshift-ansible/pull/6817

Comment 13 Russell Teague 2018-01-23 20:32:03 UTC

Merged

Comment 14 Scott Dodson 2018-01-25 15:41:05 UTC

in openshift-ansible-3.9.0-0.24.0

Comment 15 Johnny Liu 2018-01-29 09:52:42 UTC

Verified this bug with openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch, and PASS.


All the 4 scenarios mentioned in comment 10 are passed.

Comment 18 errata-xmlrpc 2018-03-28 14:08:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.