Bug 1473736 - A/B deployment seems to round-robin across all pods in multiple services, instead of proportional routing to services
Summary: A/B deployment seems to round-robin across all pods in multiple services, ins...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.6.z
Assignee: Phil Cameron
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1470350 1477685
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-21 14:19 UTC by Ben Bennett
Modified: 2022-08-04 22:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: See docs PR 4847 Reason: Result:
Clone Of: 1470350
Environment:
Last Closed: 2017-09-08 03:15:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 15309 0 None None None 2017-07-21 14:19:14 UTC
Red Hat Product Errata RHBA-2017:2642 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6.1 bug fix and enhancement update 2017-09-08 07:14:52 UTC

Description Ben Bennett 2017-07-21 14:19:15 UTC
+++ This bug was initially created as a clone of Bug #1470350 +++

While doing an OpenShift training session, I set up two services with different static HTML served. We set traffic to route across 2 services 50%/50%. I changed the number of pod replicas for service A (dneary/v3simple-spatial) to 4, and set the number of pod replicas for service B (dneary/green) to 1.

I ran the following script:

for i in `seq 1 50`; do 
  curl "http://v3simple-spatial-dneary.apps.class.molw.io";
  echo;
done

What I expected: I expected to get a 50/50 split of old text ("Hello OpenShift Ninja without a DB?") and new text ("Hello there. Have you considered OpenShift?") - either alternating or randomly across the 2 services.

What I observed: I got multiple copies of old text, with 1 copy of new text. It appears that the application is round-robin distributing load across all of the pods (4 in service A, 1 in service B). This appears to be incompatible with a 50/50 split across 2 services.


Output of command:
for i in `seq 1 50`; do
  curl "http://v3simple-spatial-dneary.apps.class.molw.io";
  echo;
done
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello OpenShift Ninja without a DB?</h1>
<h1>Hello there. Have you considered OpenShift?</h1>
...

--- Additional comment from Eric Paris on 2017-07-13 00:24:45 EDT ---

Wasn't there a requirement that you had to set an annotation on the route
haproxy.router.openshift.io/balance = leastconn
or something like that?

You can try that, and if it does not work I assume we are going to want to see the yaml/json for the route in question.

--- Additional comment from Yan Du on 2017-07-13 02:05:04 EDT ---

I could reproduce the issue with latest OCP env
openshift v3.6.143
kubernetes v1.6.1+5115d708d7

$ oc set route-backends route1
NAME           KIND     TO                  WEIGHT
routes/route1  Service  service-unsecure    50 (50%)
routes/route1  Service  service-unsecure-2  50 (50%)

$ oc scale rc test-rc-1 --replicas=4
replicationcontroller "test-rc-1" scaled
$ oc get pod -w
NAME              READY     STATUS    RESTARTS   AGE
test-rc-1-33rhr   1/1       Running   0          11s
test-rc-1-mfjw6   1/1       Running   0         12m
test-rc-1-tnn5g   1/1       Running   0         11s
test-rc-1-w5gh1   1/1       Running   0         11s
test-rc-2-mmf4r   1/1       Running   0         12m

$ for i in {1..50}; do curl route1-sess.0713-u9a.qe.rhcloud.com ; done
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-2 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080
Hello-OpenShift-1 http-8080

The route balance is roundrobin by default with multiple service in route, but even I set haproxy.router.openshift.io/balance = leastconn, it doesn't work too.
For the route yaml file and the haproxy.config are in attachment.

--- Additional comment from Yan Du on 2017-07-13 02:05 EDT ---



--- Additional comment from Ben Bennett on 2017-07-13 11:02:17 EDT ---

This is actually functioning as intended... but the behavior is not properly documented (and is rather confusing anyway).

The weights apply to each backend, so if you have a route with service A at weight 1 and service B at weight 2, if the number of back-ends for each is equal, then a pod backing A will get 33% of the traffic and a pod backing B will get 66%.  BUT if there are 2 endpoints for A and 1 for B, the numbers change... an A pod will respond 50% of the time and a B pod will respond 50% of the time.

Since weighting requires round-robin, you would expect to see round-robin behavior if all weights are equal.

If you set the balance type to leastconn, weighting has no effect.

We could take a feature request to change the behavior to set the weights proportionally based on the fraction of endpoints that service has, but that is not the way this was originally designed and approved.

We really need to update the docs to make this more clear, trello card https://trello.com/c/MajuXbiV tracks the docs improvements, and I have asked our networking docs person, and a networking developer to look at improving this ASAP.

--- Additional comment from Ben Bennett on 2017-07-13 14:51:50 EDT ---

Re-opened because the networking team decided that while we made a deliberate choice of this behavior... it was not a great one and the current behavior will surprise a lot of users.

--- Additional comment from Dave Neary on 2017-07-13 15:54:22 EDT ---

Thanks for the update and the confirmation Ben, Yan - I know this surprised me (and the evangelist training us). Also, thanks for showing me up with the {1..50} shell built-in over running seq, yan ;-)

Comment 1 Phil Cameron 2017-07-26 14:39:48 UTC
Blocked by merge of PR 15309

Comment 2 Phil Cameron 2017-07-26 14:43:54 UTC
Docs PR 4847
https://github.com/openshift/openshift-docs/pull/4847

Comment 3 Phil Cameron 2017-08-02 13:23:25 UTC
origin PR 15309 Merged    -- bz1470350
https://github.com/openshift/origin/pull/15309

Backport to 3.6.1 will be done when 3.6 branch opens

Comment 5 Yan Du 2017-08-29 08:00:56 UTC
Verify the bug on OCP 3.6, code have merged, the weight works well and already illustrate how the weight is calculated in the doc.
oc v3.6.173.0.21
kubernetes v1.6.1+5115d708d7

Comment 7 errata-xmlrpc 2017-09-08 03:15:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2642


Note You need to log in before you can comment on or make changes to this bug.