Bug 1538611

Summary: [Deployment] Changing opendaylight load balancing algorithm to roundrobin
Product: Red Hat OpenStack Reporter: Tomas Jamrisko <tjamrisk>
Component: openstack-tripleo-heat-templatesAssignee: Tim Rozet <trozet>
Status: CLOSED ERRATA QA Contact: Tomas Jamrisko <tjamrisk>
Severity: medium Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: aadam, apevec, jschluet, lhh, mburns, michele, mkolesni, nyechiel, oblaut, rhel-osp-director-maint, sgaddam, skitt, srevivo, therve, tjamrisk, trozet
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: odl_deployment
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-06-27 13:43:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Tomas Jamrisko 2018-01-25 12:39:05 UTC
Description of problem:
haproxy is by default configured to use source for loadbalancing between odl controllers. This might result in uneven distribution of load between odl controllers, as most requests are going to originate from a small number of neutron nodes which can fall in the same bucket.


Expected results:
A different loadbalancing algorithm might have more consistent results

Comment 2 Mike Kolesnik 2018-02-05 14:44:36 UTC
Please check how severe is this bug, perhaps the impact on ODL is very low..

Comment 7 Nir Yechiel 2018-03-05 12:04:15 UTC
Stephen,

Can you please provide a better analysis of this issue, how it impacts our clustering solution, and how a solution could look like? The bug description is a a little vague for me.

Thanks,
Nir

Comment 8 Stephen Kitt 2018-03-07 13:24:45 UTC
(In reply to Nir Yechiel from comment #7)
> Can you please provide a better analysis of this issue, how it impacts our
> clustering solution, and how a solution could look like? The bug description
> is a a little vague for me.

Source-based load-balancing in HAProxy means that connections are load-balanced among the available servers on first connection, but that any given client will thereafter always end up on the same server.

This would make sense in a setup like those we typically see in ODL, with a static set of clients and servers, as long as all clients generate approximately the same load. If one client generates more load than the others, that extra load won’t be balanced, and one server will suffer the brunt of it. However we don’t know whether that’s the case, so really we’d need monitoring during scale tests or in customer setups to get a better idea.

Comment 10 Tim Rozet 2018-03-07 18:50:07 UTC
I think I used balance source a long time ago when we initially were trying to get ODL HA to work because of some issues that I think are probably long gone.  Looking at the HAproxy guide again:



      source      The source IP address is hashed and divided by the total
                  weight of the running servers to designate which server 
                  will
                  receive the request. This ensures that the same client IP
                  address will always reach the same server as long as no
                  server goes down or up. If the hash result changes due to 
                  the
                  number of running servers changing, many clients will be
                  directed to a different server. This algorithm is generally
                  used in TCP mode where no cookie may be inserted. It may 
                  also
                  be used on the Internet to provide a best-effort stickiness
                  to clients which refuse session cookies. This algorithm is
                  static by default, which means that changing a server's
                  weight on the fly will have no effect, but this can be
                  changed using "hash-type".

I think we would see an imbalance if an ODL went down and came back up.  Maybe we should look at switching to round-robin:

      roundrobin  Each server is used in turns, according to their weights.
                  This is the smoothest and fairest algorithm when the 
                  server's
                  processing time remains equally distributed. This algorithm
                  is dynamic, which means that server weights may be adjusted
                  on the fly for slow starts for instance. It is limited by
                  design to 4095 active servers per backend. Note that in 
                  some
                  large farms, when a server becomes up after having been 
                  down
                  for a very short time, it may sometimes take a few hundreds
                  requests for it to be re-integrated into the farm and start
                  receiving traffic. This is normal, though very rare. It is
                  indicated here in case you would have the chance to observe
                  it, so that you don't worry.

Either way we would need to test it in a scale/perf environment to see the impact of each one to determine which one is better.

Comment 11 Stephen Kitt 2018-04-04 14:57:44 UTC
OK, so to determine what needs to be done here, we ideally need some scale testing; as described by Sai:

We want to send a bunch of neutron resource create
requests and see how they are load balanced via haproxy to the ODL cluster.
Then, change the alogorithm via the haproxy config manually and repeat.  So,
AFAIU, we need to blast the setup with several requests and check the
haproxy stats page to see how they were load balanced...

Comment 12 Mike Kolesnik 2018-04-08 11:48:13 UTC
Additionally, I suggest that ODL be taken down and brought back up as per comment #10 during this scale run, to check the affect of such behavior on the load balancing.

Comment 13 Sai Sindhur Malleni 2018-04-17 19:44:54 UTC

I ran scale tests with balance source and also roundrobin to see how it affects the backend connections to opendaylight and opendaylight_ws. It was seen that using balance source load balanced the connections to only 2 of the 3 ODLs (in the ratio 2:1). However using round robin ensured fair load balancing between the 3 ODLs.

In case of balance source,

For opendaylight
1984 frontend conenctions
odl-0 11918
odl-1 6523
odl-2 0

For opendaylight_ws
3 frontend connections
odl-0 2
odl-1 1
odl-2 0

In the case of round robin,

For opendaylight
1995 frontend connections
odl-0 5563
odl-1 5563
odl-2 5563

opendaylight_ws
3 frontend connections
odl-0 1
odl-1 1
odl-2 1


Based on this we should move the default haproxy configuration for both opendaylight and opendaylight_ws to use roundrobin.

Comment 19 Mike Kolesnik 2018-04-29 08:03:44 UTC
Tomas, can you please verify this?

Comment 20 Tomas Jamrisko 2018-05-03 08:22:38 UTC
From /etc/haproxy/haproxy.cfg on the controllers:

listen opendaylight
  bind 172.17.1.17:8081 transparent
  bind 192.168.24.9:8081 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  option httpchk
  option httplog
  server controller-0.internalapi.localdomain 172.17.1.18:8081 check fall 5 inter 2000 rise 2

listen opendaylight_ws
  bind 172.17.1.17:8185 transparent
  bind 192.168.24.9:8185 transparent
  mode http
  timeout connect 5s
  timeout client 25s
  timeout server 25s
  timeout tunnel 3600s
  server controller-0.internalapi.localdomain 172.17.1.18:8185 check fall 5 inter 2000 rise 2

Verified, as there is no balance option and the default is roundrobin

Comment 22 errata-xmlrpc 2018-06-27 13:43:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086