Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1737097

Summary:	[OSP] openshift-installer creates multiple IP addresses on worker and master nodes that aren't allowed by OpenStack Security Groups
Product:	OpenShift Container Platform	Reporter:	Ken Holden <kholden>
Component:	Installer	Assignee:	Tomas Sedovic <tsedovic>
Installer sub component:	openshift-installer	QA Contact:	David Sanz <dsanzmor>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	asimonel, eduen, juriarte
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:05:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ken Holden 2019-08-02 15:27:38 UTC

Description of problem:

Version-Release number of the following components:
* OSP13
* openshift-install version:
bin/openshift-install unreleased-master-1483-gade37f550cc2b4f26f23d1705b6a34e020a0bd5f-dirty
built from commit ade37f550cc2b4f26f23d1705b6a34e020a0bd5f
release image registry.svc.ci.openshift.org/origin/release:4.2


How reproducible:

Steps to Reproduce:
1. run bin/openshift-install --log-level=debug create cluster --dir rhte
2. install will reach 100% status but will not complete
3. master and worker shows connections refused between each other
4. master and worker can't ping each other's sub interfaces

Actual results:
if you log into master and worker during install and run journalctl -f, you can see that they are unable to reach each others sub interface IPs.  if you run an `ip a` on both worker and master, you can see that they each have the IP address that OpenStack assigned during creation but there are additional IPs assigned to each node.  The OpenStack Security Group is only aware of the IP address assigned during VM creation so any additional IP addresses are blocked by the security groups regardless of what ports are allowed.  

To address this, I obtained the neutron port UUID for master and worker and set allowed-address to 0.0.0.0/0

# master
(openshift) [stack@director13 ~]$ openstack port list |grep master
| 80328773-8293-47f1-a71b-34966958a8a8 | rhte-gkbvm-master-port-0 | fa:16:3e:f8:42:57 | ip_address='10.10.0.14', subnet_id='8cc857a8-d228-4b1b-a6a9-6dff67f45bf5' | ACTIVE |

(openshift) [stack@director13 ~]$ openstack port set --allowed-address ip-address=0.0.0.0/0 80328773-8293-47f1-a71b-34966958a8a8

# worker
(openshift) [stack@director13 ~]$ openstack port list |grep worker
| ed9da2c1-9735-4ba0-b717-0a6ecaf29eec | rhte-gkbvm-worker-zkql5 | fa:16:3e:b7:c8:c6 | ip_address='10.10.0.33', subnet_id='8cc857a8-d228-4b1b-a6a9-6dff67f45bf5' | ACTIVE |

(openshift) [stack@director13 ~]$ openstack port set --allowed-address ip-address=0.0.0.0/0 ed9da2c1-9735-4ba0-b717-0a6ecaf29eec

Once I did this, the deploy completed.  If you don't do this, the deploy will sit at 100% for some time and then fail.

Comment 1 Eric Duen 2019-08-02 16:00:52 UTC

@Tomas, I believe you were taking a look at ports and SGs.  Can you take a look at this?

Comment 2 Tomas Sedovic 2019-08-27 12:02:45 UTC

Thank you.

It is possible there's a default security group rule in your OpenStack environment that prevents this.

Can you share an example of the extra IP addresss that are assigned to the nodes?

If they are in the form `10.10.0.5` or `6` or `7`, those should be assigned to the `api-port`, `dns-port` and `ingress-port` respectively and be managed by Keepalived running on the nodes. It is true that these IPs are not associated with the servers in a normal manner, but we set allowed address pairs on all ports to whitelist them.

Are you talking about these addresses or some other ones? I'm not aware of any other extra IPs that we create that would need special handling.

Comment 3 Ken Holden 2019-08-27 13:04:17 UTC

I checked and the only security group that was applied was the one created for the instance itself. I ensured PING and SSH were enabled in that applied security group, but was unable to ping or SSH to .6 or .7 however, 10.10.0.5 worked fine which makes sense as its the IP neutron assigned for the port. 

I didn't have to change anything when testing the previous installation method that used the service vm for IPI install.  Perhaps the allowed address pairs weren't set or the application of them didn't get correctly set when I tested the newer non-service-vm method of IPI install.

Comment 4 Tomas Sedovic 2019-08-27 13:49:57 UTC

Thanks!

The .5-.7 IP addresses are for the internal use of the cluster. A person deploying it is not expected to interact with them in any way. If you got to or near a 100%, that means at least the .5 and .6 VIPs worked as expected. Otherwise you wouldn't even get past bootstrapping. The service VM did not need any of this VIP config. It is here to provide highly-available access to the load balancer and dns services we're running on the master nodes now.

For what it's worth, I just ran a deployment on the latest checkout and it succeeded fine so this isn't something that was introduced recently (other than the service VM removal).

I wonder if this might actually have something to do with just with a new port that's not open for some reason. Unrelated to the IP addresses. But that will be tricky to figure out. At any rate, we will have to find which security group rules to add.

Please try to run the installation again and get it to fail (it should quit at most 30 minutes after the `Waiting up to 30m0s for the cluster at https://api.example.com:6443 to initialize` message)

And then, please provide the following:

1. Output of `openstack port show` (on a failing deployment before you do any manual fixes) for the master and worker ports
2. Output of `openstack port show <cluster>-<id>-api-port>` and `openstack port show <cluster>-<id>-ingress-port`
3. How many masters and workers are you deploying with?
4. Outputs of both master and worker security group rules: `openstack security group rule list <cluster>-<id>-master` and ``openstack security group rule list <cluster>-<id>-worker`
5. The `.openshift_install.log` file in your `--dir=<directory>` directroy ("rhte/.openshift_install.log" by the looks of it)
6. Are you deploying via the interactive prompt or writing an install-config.yaml. If the latter, your install-config with the sensitive information (pull secret) omitted
7. If you have access to the underlying OpenStack (I saw Director so you might), could you report the value of `allow_same_net_traffic` in `/etc/nova/nova.conf` (on a controller node)
8. Output of `oc get pod -A`. This will be quite long but it should help us figure out which pods are blocking the deployment success. You will need a kubeconfig for this. You can get this output by running the following:

$ export KUBECONFIG=$GOPATH/src/github.com/openshift/installer/<dir>/auth/kubeconfig
$ oc get pod -A > get-pod.txt

Thank you! I know this is a lot of info, but since I can't reproduce this, there's not much to go on at this point.

Comment 5 Tomas Sedovic 2019-08-27 15:02:47 UTC

Oh actually, also: according to the commit you've applied an older version of the "remove service VM" pull request. It's possible something is wrong there.

Please try this again from the master branch or a nightly build.

Comment 7 David Sanz 2019-09-11 09:29:19 UTC

No connection refused connection between servers on latest release.

Comment 9 errata-xmlrpc 2020-01-23 11:05:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 10 Red Hat Bugzilla 2023-09-14 05:41:00 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days