Bug 1411501

Summary: Only one ipfailover container can be run on a node
Product: OpenShift Container Platform Reporter: Ben Bennett <bbennett>
Component: NetworkingAssignee: Phil Cameron <pcameron>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aloughla, aos-bugs, bmeng, eparis, ramr, tdawson
Version: 3.5.0   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: pods can start on same node Consequence: Fix: force pods to start on different nodes Result:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-12 19:08:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Bennett 2017-01-09 20:56:51 UTC
Description of problem:

If you create two ipfailover instances and try to make it so they run on the same node, it fails because both try to use hostPort 1985.


Version-Release number of selected component (if applicable):

Origin 1.5 and below


How reproducible:

Always


Steps to Reproduce:
1. Assuming two nodes
2. oadm ipfailover ipf-1 --virtual-ips=10.1.1.1 --replicas=2
3. oadm ipfailover ipf-2 --virtual-ips=10.1.1.2 --replicas=2 --vrrp-id-offset=1


Actual results:

Half of the ipfailover containers will not be scheduled because they collide on port 1985


Expected results:

All four containers should be scheduled and running.


Additional info:

Nothing actually uses port 1985.  We think that it was set that way to provide a cheap form of anti-affinity.  But we should use the proper pod anti-affinity (it is described for the router in https://docs.openshift.org/latest/admin_guide/manage_nodes.html#pod-anti-affinity BUT I am not certain that it is done with an annotation now, please research if it is supported as a core capability) to spread the pods across nodes.

Comment 1 Phil Cameron 2017-01-12 20:32:11 UTC
 bbennett  digging around I found:
Pod anti-affinity does not work in Openshift (Jan 2017)
https://access.redhat.com/solutions/2840171

Also port 1985 is not anti-affinity since the pod is started and just gets stuck on the port. The pod is not moved to another node.

Comment 2 Phil Cameron 2017-01-12 20:33:53 UTC
ramr
Do you know the intended use for port 1985?

Comment 3 Ram Ranganathan 2017-01-12 22:49:31 UTC
@phil - there is nothing binding to the port 1985 if that's what you are asking.

As Ben mentioned, it was a cheap way (and the only way when this was done) to ensure that two ipfailover pods are not placed on the same node/host when kubernetes scheduled the pods. You can't run two keepaliveds on the same node as it would clash managing the same network/interfaces and the VRRP messages [src/dest would be the same for 2 pods].
Also note the ipfailover pod _has_ to run in host networking mode.

Comment 4 Ben Bennett 2017-01-13 13:12:37 UTC
@ramr - We can run two _different_ configurations of keepalived (i.e. managing different addresses and with different virtual_router_ids) on the same node, right?  The problem is just if you run the same config on one node that it would fight.  Phil and I tried the same config and different configs, and with the same config keepalived detects a problem and logs vociferously.  With different configs all was good (and we are already setting the virtual_router_id differently).

http://serverfault.com/questions/473058/keepaliveds-virtual-router-id-should-it-be-unique-per-node seems to back up this assessment.

Comment 8 Phil Cameron 2017-01-23 18:05:33 UTC
Proposal: Continue to use port number with each ipf config having a different port. The port for a config could be port 1985 (current prot) + the vrrp_id in the config. vrrp_id is in the range 0-255 so the actual port would be in the range 1985-2240 (assuming that range is available.) There is one port per config taken from that range.

When pod affinity, anti-affinity become GA we can switch to that. Affinity in 1.4, 1.5 is alpha using annotations, in 1.6 it becomes beta using a field. When beta arrives alpha is deprecated. 

I think we need to figure out an upgrade path for customers that use this. Hopefully there are not very many. The port based configurations would continue work going forward. The affinity base solution would require customer mods to the dc as part of upgrades.

Comment 9 Ben Bennett 2017-01-23 21:27:09 UTC
pcameron: That seems reasonable.  If we do that in 1.5 (and perhaps also apply the change in 1.4) then we should not have an upgrade problem.  We do need to flip to anti-affinity at some point, but the port hack doesn't hurt.  So let's do that now, and make a card for the future anti-affinity change so we don't lose track of it.

Please consider the port range we are using... I'm not sure if there's a good reason to start at 1980 (and we need to make sure there's nothing between 1980 and 1980 + 255 that we care about).  I suspect there is, so we need to work out if there's a better range to use.  Since nothing actually binds to the ports, we could use a high range (in the dynamic assigned area) to avoid conflicts.

Comment 10 Ben Bennett 2017-01-27 15:53:23 UTC
In merge queue.

Comment 11 openshift-github-bot 2017-01-27 16:08:55 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/959a4bfd2fb30b64400124a50d59219901dda012
Allow multiple ipfailover configs on same node

The ipfailover pods for a given configuration must run on
different nodes.  We are using the ServicePort as a mechanism
to prevent multiple pods for same configuration from starting
on the same node. Since pods for different configurations can
run on the same node a different ServicePort is used for each
configuration.

In the future, this may be changed to pod anti-affinity.

bug 1411501
https://bugzilla.redhat.com/show_bug.cgi?id=1411501

Signed-off-by: Phil Cameron <pcameron>

Comment 12 Troy Dawson 2017-02-03 22:49:31 UTC
This has been merged into ocp and is in OCP v3.5.0.16 or newer.

Comment 13 zhaozhanqi 2017-02-04 02:40:58 UTC
Verified this bug on 
 openshift version
openshift v3.5.0.16+a26133a
kubernetes v1.5.2+43a9be4
etcd 3.1.0

When creating two ipfailover pod in same node. both they are working well.

Comment 15 errata-xmlrpc 2017-04-12 19:08:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884