In situations with separate "public" and "private" node IPs (which I think means AWS and OpenStack), it's important for OpenShift to know both sets of IPs, and we allow configuring this at install time with openshift_ip vs openshift_public_ip (or openshift_hostname vs openshift_public_hostname). But we don't currently check that the user is making this distinction when it's needed. In particular, if a node has separate public and private IPs, but you specify the public IP as "openshift_ip", then the SDN will not work. (To prevent spoofing, nodes only accept VXLAN packets from IPs that they recognize as being the IPs of other nodes, but that only works if the nodes registered themselves with their "real"/"private" IPs.) In practice, this means that openshift_ip (or the IP that openshift_hostname resolves to) on each node must be the IP address of some interface on the node; if it's not, ansible should probably refuse to continue and refer the user to the docs (eg, https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-host-variables, although maybe we need to be a little clearer about things there?)
It will block the OCP 3.7 installation on openstack
You could refer to https://bugzilla.redhat.com/show_bug.cgi?id=1505266 for more information.
This bug is not about reverting the behavior in 1505266, it's about moving that check *sooner*. If there is a problem with 1505266 (which I don't think there is but we can move the discussion there) then it needs to be fixed there and this bug would be WONTFIXed in that case.
Proposed: https://github.com/openshift/openshift-ansible/pull/5970
Upgrade from v3.6 to v3.7 against an cluster deployed on openstack hit the issue too.
QE encounter a lot of issues on openstack testing due to this change, almost all the env setup on openstack failed. I think we really need re-consider this change seriously, for old version (<=3.6), QE always use a public hostname which will be resolved to its floating IP, but not an IP address owned by this host. The main reason of using a floating IP is OpenStack network configuration is a little weak compared with EC2/GCE, instances name is not resolved between instances. Once this change is introduced, QE's upgrade testing would be broke out (from 3.6 to 3.7), and also need re-factor 3.7 fresh install automation job, or else that would break everything. Here I would mainly take a fresh install as an example: Launch 2 instances named "qe-jialiu1-master-etcd-nfs-1" and "qe-jialiu1-node-registry-router-1". Note that: "qe-jialiu1-master-etcd-nfs-1" and "qe-jialiu1-node-registry-router-1" is not resolved by each other. According to this change, their hostname need be resolved to an IP address owned by this host. So setting the following: master: instance name: qe-jialiu1-master-etcd-nfs-1 system hostname: host-172-16-120-113 (this is automatically assigned by openstack network) node: instance name: qe-jialiu1-node-registry-router-1 system hostname: host-172-16-120-32 (this is automatically assigned by openstack network) Then installation is finished successfully, but found nodename is set to IP, but not hostname in /etc/origin/node/node-config.yaml. "nodeName: 172.16.120.32" # oc get nodes NAME STATUS AGE VERSION 172.16.120.113 Ready,SchedulingDisabled 16m v1.7.6+a08f5eeb62 172.16.120.32 Ready 16m v1.7.6+a08f5eeb62 This would bring a lot trouble when instance ip get changed. I think this was mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1416703#c4.
(In reply to Johnny Liu from comment #6) > QE encounter a lot of issues on openstack testing due to this change, almost > all the env setup on openstack failed. This discussion ought to be happening on bug 1505266 since that's the actual change we're talking about. But anyway, that change is being reverted now, until after 3.7 ships, when we'll bring it back. So the ansible change here should probably also hold off until then. > I think we really need re-consider this change seriously, for old version > (<=3.6), QE always use a public hostname which will be resolved to its > floating IP, but not an IP address owned by this host. The main reason of > using a floating IP is OpenStack network configuration is a little weak > compared with EC2/GCE, instances name is not resolved between instances. The problem is that a cluster installed that way is *broken*. Most features still work, but some don't. (And in particular, all SDN traffic between nodes will get dropped.) QE can only get away with installing clusters this way because you don't need to test every feature on every cluster, so it's OK if some features are broken on some test clusters. But we assume customers are going to want their clusters installed in a way such that all of OpenShift's features will work, so we should prevent them from misconfiguring them. > Once this change is introduced, QE's upgrade testing would be broke out > (from 3.6 to 3.7), and also need re-factor 3.7 fresh install automation job, > or else that would break everything. I don't think that's true. If you change the test to set openshift_hostname/openshift_ip correctly, it should work fine on both old and new OpenShift. > Then installation is finished successfully, but found nodename is set to IP, > but not hostname in /etc/origin/node/node-config.yaml. > "nodeName: 172.16.120.32" > > # oc get nodes > NAME STATUS AGE VERSION > 172.16.120.113 Ready,SchedulingDisabled 16m v1.7.6+a08f5eeb62 > 172.16.120.32 Ready 16m v1.7.6+a08f5eeb62 > > This would bring a lot trouble when instance ip get changed. I think this > was mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1416703#c4. If the current public-vs-private ip/hostname system doesn't work well for OpenShift-on-OpenStack then maybe we need to look into making it better. (Eg, making nodes able to resolve each other's local hostnames somehow.) Because most customers aren't going to be able to just "cheat" and set openshift_hostname to the public hostname like the tests are doing since that would break things.
Merged: https://github.com/openshift/openshift-ansible/pull/5970
Re-test this bug with openshift-ansible-3.9.0-0.13.0.git.0.8119a5c.el7.noarch, and FAIL. 1. set "openshift_hostname=host-8-245-68.host.centralci.eng.rdu2.redhat.com" for node, run "playbooks/prerequisites.yaml", get the following error as expectation. TASK [Query DNS for IP address of host-8-245-68.host.centralci.eng.rdu2.redhat.com] ******************************************************************************************** ok: [host-8-245-68.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "cmd": "getent ahostsv4 host-8-245-68.host.centralci.eng.rdu2.redhat.com | head -n 1 | awk '{ print $1 }'", "delta": "0:00:00.105294", "end": "2018-01-03 01:13:40.367097", "failed": false, "failed_when_result": false, "rc": 0, "start": "2018-01-03 01:13:40.261803", "stderr": "", "stderr_lines": [], "stdout": "10.8.245.68", "stdout_lines": ["10.8.245.68"]} TASK [Validate openshift_hostname when defined] ******************************************************************************************************************************** fatal: [host-8-245-68.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"changed": false, "failed": true, "msg": "The hostname host-8-245-68.host.centralci.eng.rdu2.redhat.com for host-172-16-120-117 doesn't resolve to an IP address owned by this host. Please set openshift_hostname variable to a hostname that when resolved on the host in question resolves to an IP address matching an interface on this host. This will ensure proper functionality of OpenShift networking features. Inventory setting: openshift_hostname=host-8-245-68.host.centralci.eng.rdu2.redhat.com This check can be overridden by setting openshift_hostname_check=false in the inventory. See https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-host-variables\n"} 2. set "openshift_hostname=host-8-245-68.host.centralci.eng.rdu2.redhat.com" for node, and "openshift_hostname_check=false" in inventory host file, run "playbooks/prerequisites.yaml", still get the same error as #1. 3. set "openshift_ip=10.8.245.68" for node, run "playbooks/prerequisites.yaml", get the following error as expectation. TASK [Query DNS for IP address of host-8-245-68.host.centralci.eng.rdu2.redhat.com] ******************************************************************************************** ok: [host-8-245-68.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "cmd": "getent ahostsv4 host-8-245-68.host.centralci.eng.rdu2.redhat.com | head -n 1 | awk '{ print $1 }'", "delta": "0:00:00.106362", "end": "2018-01-03 01:01:46.259505", "failed": false, "failed_when_result": false, "rc": 0, "start": "2018-01-03 01:01:46.153143", "stderr": "", "stderr_lines": [], "stdout": "10.8.245.68", "stdout_lines": ["10.8.245.68"]} TASK [Validate openshift_hostname when defined] ******************************************************************************************************************************** skipping: [host-8-245-68.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} TASK [Validate openshift_ip exists on node when defined] *********************************************************************************************************************** fatal: [host-8-245-68.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"changed": false, "failed": true, "msg": "The IP address 10.8.245.68 does not exist on host-172-16-120-117. Please set the openshift_ip variable to an IP address of this node. This will ensure proper functionality of OpenShift networking features. Inventory setting: openshift_ip=10.8.245.68 This check can be overridden by setting openshift_ip_check=false in the inventory. See https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-host-variables\n"} 4. set "openshift_ip=10.8.245.68" for node, and "openshift_ip_check=false" in inventory host file, run "playbooks/prerequisites.yaml", still get the same error as #3. Based on #2 and #4, seem like "openshift_hostname_check=false" and "openshift_ip_check=false" is not respected, so assgin back this bug. Some other enhancement, in this RP, openshift_override_hostname_check is replaced by openshift_hostname_check, but several pieces of openshift-ansible code are still referring the old option name. $ grep -r "openshift_override_hostname_check" * inventory/hosts.example:#openshift_override_hostname_check=true playbooks/openstack/sample-inventory/group_vars/OSEv3.yml:openshift_override_hostname_check: true utils/src/ooinstall/openshift_ansible.py: base_inventory.write('openshift_override_hostname_check=true\n') Pls also update them together.
BTW, about the issue of "openshift_hostname_check=false" and "openshift_ip_check=false" not respected, a following 'bool' filter would fix them. openshift_hostname_check | default(true) | bool openshift_ip_check | default(true) | bool
Proposed: https://github.com/openshift/openshift-ansible/pull/6817
Merged
in openshift-ansible-3.9.0-0.24.0
Verified this bug with openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch, and PASS. All the 4 scenarios mentioned in comment 10 are passed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489