1538611 – [Deployment] Changing opendaylight load balancing algorithm to roundrobin

Bug 1538611 - [Deployment] Changing opendaylight load balancing algorithm to roundrobin

Summary: [Deployment] Changing opendaylight load balancing algorithm to roundrobin

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Tim Rozet
QA Contact:	Tomas Jamrisko
Docs Contact:
URL:
Whiteboard:	odl_deployment
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-25 12:39 UTC by Tomas Jamrisko
Modified:	2018-10-18 07:19 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-8.0.2-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	N/A
Last Closed:	2018-06-27 13:43:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1762518	None	None	None	2018-04-17 19:45:27 UTC
OpenStack gerrit	563463	None	stable/queens: MERGED	puppet-tripleo: Changing opendaylight loadbalancing alogirthm (Idb4fe3803f69ab7440aaa2997cc4de46c9ac5458)	2018-04-24 21:16:13 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:44:09 UTC

Description Tomas Jamrisko 2018-01-25 12:39:05 UTC

Description of problem:
haproxy is by default configured to use source for loadbalancing between odl controllers. This might result in uneven distribution of load between odl controllers, as most requests are going to originate from a small number of neutron nodes which can fall in the same bucket.


Expected results:
A different loadbalancing algorithm might have more consistent results

Comment 2 Mike Kolesnik 2018-02-05 14:44:36 UTC

Please check how severe is this bug, perhaps the impact on ODL is very low..

Comment 7 Nir Yechiel 2018-03-05 12:04:15 UTC

Stephen,

Can you please provide a better analysis of this issue, how it impacts our clustering solution, and how a solution could look like? The bug description is a a little vague for me.

Thanks,
Nir

Comment 8 Stephen Kitt 2018-03-07 13:24:45 UTC

(In reply to Nir Yechiel from comment #7)
> Can you please provide a better analysis of this issue, how it impacts our
> clustering solution, and how a solution could look like? The bug description
> is a a little vague for me.

Source-based load-balancing in HAProxy means that connections are load-balanced among the available servers on first connection, but that any given client will thereafter always end up on the same server.

This would make sense in a setup like those we typically see in ODL, with a static set of clients and servers, as long as all clients generate approximately the same load. If one client generates more load than the others, that extra load won’t be balanced, and one server will suffer the brunt of it. However we don’t know whether that’s the case, so really we’d need monitoring during scale tests or in customer setups to get a better idea.

Comment 10 Tim Rozet 2018-03-07 18:50:07 UTC

I think I used balance source a long time ago when we initially were trying to get ODL HA to work because of some issues that I think are probably long gone. Looking at the HAproxy guide again:

source The source IP address is hashed and divided by the total
weight of the running servers to designate which server
will
receive the request. This ensures that the same client IP
address will always reach the same server as long as no
server goes down or up. If the hash result changes due to
the
number of running servers changing, many clients will be
directed to a different server. This algorithm is generally
used in TCP mode where no cookie may be inserted. It may
also
be used on the Internet to provide a best-effort stickiness
to clients which refuse session cookies. This algorithm is
static by default, which means that changing a server's
weight on the fly will have no effect, but this can be
changed using "hash-type".

I think we would see an imbalance if an ODL went down and came back up. Maybe we should look at switching to round-robin:

roundrobin Each server is used in turns, according to their weights.
This is the smoothest and fairest algorithm when the
server's
processing time remains equally distributed. This algorithm
is dynamic, which means that server weights may be adjusted
on the fly for slow starts for instance. It is limited by
design to 4095 active servers per backend. Note that in
some
large farms, when a server becomes up after having been
down
for a very short time, it may sometimes take a few hundreds
requests for it to be re-integrated into the farm and start
receiving traffic. This is normal, though very rare. It is
indicated here in case you would have the chance to observe
it, so that you don't worry.

Either way we would need to test it in a scale/perf environment to see the impact of each one to determine which one is better.

Comment 11 Stephen Kitt 2018-04-04 14:57:44 UTC

OK, so to determine what needs to be done here, we ideally need some scale testing; as described by Sai:

We want to send a bunch of neutron resource create
requests and see how they are load balanced via haproxy to the ODL cluster.
Then, change the alogorithm via the haproxy config manually and repeat.  So,
AFAIU, we need to blast the setup with several requests and check the
haproxy stats page to see how they were load balanced...

Comment 12 Mike Kolesnik 2018-04-08 11:48:13 UTC

Additionally, I suggest that ODL be taken down and brought back up as per comment #10 during this scale run, to check the affect of such behavior on the load balancing.

Comment 13 Sai Sindhur Malleni 2018-04-17 19:44:54 UTC


I ran scale tests with balance source and also roundrobin to see how it affects the backend connections to opendaylight and opendaylight_ws. It was seen that using balance source load balanced the connections to only 2 of the 3 ODLs (in the ratio 2:1). However using round robin ensured fair load balancing between the 3 ODLs.

In case of balance source,

For opendaylight
1984 frontend conenctions
odl-0 11918
odl-1 6523
odl-2 0

For opendaylight_ws
3 frontend connections
odl-0 2
odl-1 1
odl-2 0

In the case of round robin,

For opendaylight
1995 frontend connections
odl-0 5563
odl-1 5563
odl-2 5563

opendaylight_ws
3 frontend connections
odl-0 1
odl-1 1
odl-2 1


Based on this we should move the default haproxy configuration for both opendaylight and opendaylight_ws to use roundrobin.

Comment 19 Mike Kolesnik 2018-04-29 08:03:44 UTC

Tomas, can you please verify this?

Comment 20 Tomas Jamrisko 2018-05-03 08:22:38 UTC

From /etc/haproxy/haproxy.cfg on the controllers:

listen opendaylight
  bind 172.17.1.17:8081 transparent
  bind 192.168.24.9:8081 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  option httpchk
  option httplog
  server controller-0.internalapi.localdomain 172.17.1.18:8081 check fall 5 inter 2000 rise 2

listen opendaylight_ws
  bind 172.17.1.17:8185 transparent
  bind 192.168.24.9:8185 transparent
  mode http
  timeout connect 5s
  timeout client 25s
  timeout server 25s
  timeout tunnel 3600s
  server controller-0.internalapi.localdomain 172.17.1.18:8185 check fall 5 inter 2000 rise 2

Verified, as there is no balance option and the default is roundrobin

Comment 22 errata-xmlrpc 2018-06-27 13:43:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.