Bug 1477685

Summary: A/B deployment seems to round-robin across all pods in multiple services, instead of proportional routing to services
Product: OpenShift Container Platform Reporter: Phil Cameron <pcameron>
Component: NetworkingAssignee: Phil Cameron <pcameron>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, atragler, bbennett, dneary, eparis, rkhan, sukulkar, xtian, yadu, zzhao
Version: 3.7.0Keywords: Reopened
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: See Docs PR 5816 Consequence: Fix: Result:
Story Points: ---
Clone Of: 1470350 Environment:
Last Closed: 2017-11-28 22:06:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1470350    
Bug Blocks: 1473736    

Description Phil Cameron 2017-08-02 15:55:57 UTC
+++ This bug was initially created as a clone of Bug #1470350 +++

The algorithm for setting weight can result in an unexpected A/B balance when service weight is distributed over too many pods. The minimum service weight is 1 which means that all endpoints on the service will get 1. For example when the service weight is 2 and there are 4 pods each pod will get 1 and the total service weight becomes 4. The service ends up getting too much weight. At present this is documented suggesting that the deployment be scaled to achieve the desired weight.

This change distributes the service weight to pods until the total is used. The remaining pods in the service, if any, will get weight 0. This will preserve the total desired service weight. In the above example, 2 pods will get 1 and 2 pods will get 0 for a service total of 2.

----------------------  

While doing an OpenShift training session, I set up two services with different static HTML served. We set traffic to route across 2 services 50%/50%. I changed the number of pod replicas for service A (dneary/v3simple-spatial) to 4, and set the number of pod replicas for service B (dneary/green) to 1.

I ran the following script:

for i in `seq 1 50`; do 
  curl "http://v3simple-spatial-dneary.apps.class.molw.io";
  echo;
done

What I expected: I expected to get a 50/50 split of old text ("Hello OpenShift Ninja without a DB?") and new text ("Hello there. Have you considered OpenShift?") - either alternating or randomly across the 2 services.

What I observed: I got multiple copies of old text, with 1 copy of new text. It appears that the application is round-robin distributing load across all of the pods (4 in service A, 1 in service B). This appears to be incompatible with a 50/50 split across 2 services.


Output of command:
for i in `seq 1 50`; do
  curl "http://v3simple-spatial-dneary.apps.class.molw.io";
  echo;
done
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
...

--- Additional comment from Eric Paris on 2017-07-13 00:24:45 EDT ---

Wasn't there a requirement that you had to set an annotation on the route
haproxy.router.openshift.io/balance = leastconn
or something like that?

You can try that, and if it does not work I assume we are going to want to see the yaml/json for the route in question.

--- Additional comment from Yan Du on 2017-07-13 02:05:04 EDT ---

I could reproduce the issue with latest OCP env
openshift v3.6.143
kubernetes v1.6.1+5115d708d7

$ oc set route-backends route1
NAME           KIND     TO                  WEIGHT
routes/route1  Service  service-unsecure    50 (50%)
routes/route1  Service  service-unsecure-2  50 (50%)

$ oc scale rc test-rc-1 --replicas=4
replicationcontroller "test-rc-1" scaled
$ oc get pod -w
NAME              READY     STATUS    RESTARTS   AGE
test-rc-1-33rhr   1/1       Running   0          11s
test-rc-1-mfjw6   1/1       Running   0         12m
test-rc-1-tnn5g   1/1       Running   0         11s
test-rc-1-w5gh1   1/1       Running   0         11s
test-rc-2-mmf4r   1/1       Running   0         12m

$ for i in {1..50}; do curl route1-sess.0713-u9a.qe.rhcloud.com ; done
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080

The route balance is roundrobin by default with multiple service in route, but even I set haproxy.router.openshift.io/balance = leastconn, it doesn't work too.
For the route yaml file and the haproxy.config are in attachment.

--- Additional comment from Yan Du on 2017-07-13 02:05 EDT ---



--- Additional comment from Ben Bennett on 2017-07-13 11:02:17 EDT ---

This is actually functioning as intended... but the behavior is not properly documented (and is rather confusing anyway).

The weights apply to each backend, so if you have a route with service A at weight 1 and service B at weight 2, if the number of back-ends for each is equal, then a pod backing A will get 33% of the traffic and a pod backing B will get 66%.  BUT if there are 2 endpoints for A and 1 for B, the numbers change... an A pod will respond 50% of the time and a B pod will respond 50% of the time.

Since weighting requires round-robin, you would expect to see round-robin behavior if all weights are equal.

If you set the balance type to leastconn, weighting has no effect.

We could take a feature request to change the behavior to set the weights proportionally based on the fraction of endpoints that service has, but that is not the way this was originally designed and approved.

We really need to update the docs to make this more clear, trello card https://trello.com/c/MajuXbiV tracks the docs improvements, and I have asked our networking docs person, and a networking developer to look at improving this ASAP.

--- Additional comment from Ben Bennett on 2017-07-13 14:51:50 EDT ---

Re-opened because the networking team decided that while we made a deliberate choice of this behavior... it was not a great one and the current behavior will surprise a lot of users.

--- Additional comment from Dave Neary on 2017-07-13 15:54:22 EDT ---

Thanks for the update and the confirmation Ben, Yan - I know this surprised me (and the evangelist training us). Also, thanks for showing me up with the {1..50} shell built-in over running seq, yan ;-)

--- Additional comment from Phil Cameron on 2017-07-25 16:53:46 EDT ---

Docs PR 4847
https://github.com/openshift/openshift-docs/pull/4847

--- Additional comment from Phil Cameron on 2017-07-25 16:53:56 EDT ---

Docs PR 4847
https://github.com/openshift/openshift-docs/pull/4847

--- Additional comment from openshift-github-bot on 2017-07-27 00:43:29 EDT ---

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/0ccbe544c0db84a9436738fed7051ea1a959b3ae
Document a/b deployment

A route can front up to 4 services that handle the requests.
The load balancing strategy governs which endpoint gets each request.
When roundrobin is chosen, the portion of the requests that each
service handles is governed by the weight assigned to the service.
Each endpoint in the service gets a fraction of the service's requests

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350

Code change is in origin PR 15309
https://github.com/openshift/origin/pull/15309

--- Additional comment from openshift-github-bot on 2017-08-02 06:29:59 EDT ---

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/b0db0b2e3bcf1db113ed61ed1083853c986b6ef9
Make A/B deployment proportional to service weight

Distribute requests among the route's services based on service weight.
The portion of the total weight that each service has is distributed
evenly among the service's endpoints.

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350

https://github.com/openshift/origin/commit/b64c94eb239034b0a8df8abf290f5322d7600855
Merge pull request #15309 from pecameron/bz1470350

Automatic merge from submit-queue (batch tested with PRs 15533, 15414, 15584, 15529, 15309)

Make A/B deployment proportional to service weight

Distribute requests among the route's services based on service weight.
The portion of the total weight that each service has is distributed
evenly among the service's endpoints.

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350

Comment 2 Phil Cameron 2017-08-29 18:41:23 UTC
See PR 15970 for details. 
We decided to not continue with these changes. Scaling the number of pods, as documented, gets past the problem of having too many pods with minimum weight of 1. If there is a customer need we will re-open this bug.

Comment 3 Dave Neary 2017-08-29 19:35:50 UTC
Thanks for the update - I read through https://github.com/openshift/origin/pull/15970 and that seems to be addressing a different issue (or at least, I don't see the relation to this bug report). No need to re-open.

Comment 4 openshift-github-bot 2017-10-20 00:51:22 UTC
Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/00d2d715129ee3e6e09d0713f68181acf6ac89f3
Router - A/B weights distribution improvement

Service weight is roughly distributed among the endpoints. The
service can end up with more or less weight than desired.

This change finds the desired endpoint weight for each service and
then scales the maximum weight to 256 (the maximum).

The scaling is done before writing the haproxy.config file to
account for any endpoint changes.

bug: 1477685
https://bugzilla.redhat.com/show_bug.cgi?id=1477685

Trello: BhWCH3vu
https://trello.com/c/BhWCH3vu/543-3-fix-problem-in-a-b-weight-algorithm

https://github.com/openshift/origin/commit/296e98abb6a9938112db9e1f9d99b6cdbc5f7db7
Merge pull request #16090 from pecameron/bz1477685a

Automatic merge from submit-queue (batch tested with PRs 16896, 16908, 16935, 16898, 16090).

Router - A/B weights distribution improvement

Service weight is roughly distributed among the endpoints. The
service can end up with more or less weight than desired.

This change finds the desired endpoint weight for each service and
then scales the maximum weight to 256 (the maximum).

The scaling is done before writing the haproxy.config file to
account for any endpoint changes.

bug: 1477685
https://bugzilla.redhat.com/show_bug.cgi?id=1477685

Trello: BhWCH3vu
https://trello.com/c/BhWCH3vu/543-3-fix-problem-in-a-b-weight-algorithm

Comment 5 Phil Cameron 2017-10-20 12:14:36 UTC
As part of the new A/B load balancing algorithm change, PR 16090 also fixes a problem where an endpoint change is not reflected in the router configuration. This can lead to 503 responses even though there is an endpoint on the service.

Comment 6 Phil Cameron 2017-10-20 15:26:29 UTC
DOCS PR 5816
https://github.com/openshift/openshift-docs/pull/5816

Comment 7 Yan Du 2017-10-24 05:14:10 UTC
Test on OCP 3.7, issue have been fixed.
openshift v3.7.0-0.174.0
kubernetes v1.7.6+a08f5eeb62

Comment 9 Yan Du 2017-10-30 07:04:38 UTC
Test on latest OCP

openshift v3.7.0-0.184.0
kubernetes v1.7.6+a08f5eeb62

Verify steps are in bug description.
Actual result: Each endpoint gets weight/numberOfEndpoints portion of the requests	
# oc get pod
NAME              READY     STATUS    RESTARTS   AGE
test-rc-1-5dthx   1/1       Running   0          30s
test-rc-1-pdghh   1/1       Running   0          1m
test-rc-2-6dfzv   1/1       Running   0          25s
test-rc-2-qgmgg   1/1       Running   0          25s
test-rc-2-tx7ws   1/1       Running   0          1m
test-rc-2-z4bwm   1/1       Running   0          25s
test-rc-3-b65xn   1/1       Running   0          22s
test-rc-3-k8jfs   1/1       Running   0          1m
test-rc-3-msnzc   1/1       Running   0          22s
test-rc-4-hf2xz   1/1       Running   0          48s
# oc get route
NAME      HOST/PORT                                PATH      SERVICES                                                                                        PORT      TERMINATION   WILDCARD
route1    route1-d1.apps.1030-0ou.qe.rhcloud.com             service-unsecure(20%),service-unsecure-2(10%),service-unsecure-3(30%),service-unsecure-4(40%)   http                    None
# for i in {1..20} ; do curl route1-d1.apps.1030-0ou.qe.rhcloud.com; done
Hello-OpenShift-4 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-3 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-3 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-3 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-3 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-3 http-8080
Hello-OpenShift-4 http-8080
Hello-OpenShift-3 http-8080
Hello-OpenShift-2 http-8080

Comment 12 errata-xmlrpc 2017-11-28 22:06:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188