Description of problem: haproxy is by default configured to use source for loadbalancing between odl controllers. This might result in uneven distribution of load between odl controllers, as most requests are going to originate from a small number of neutron nodes which can fall in the same bucket. Expected results: A different loadbalancing algorithm might have more consistent results
Please check how severe is this bug, perhaps the impact on ODL is very low..
Stephen, Can you please provide a better analysis of this issue, how it impacts our clustering solution, and how a solution could look like? The bug description is a a little vague for me. Thanks, Nir
(In reply to Nir Yechiel from comment #7) > Can you please provide a better analysis of this issue, how it impacts our > clustering solution, and how a solution could look like? The bug description > is a a little vague for me. Source-based load-balancing in HAProxy means that connections are load-balanced among the available servers on first connection, but that any given client will thereafter always end up on the same server. This would make sense in a setup like those we typically see in ODL, with a static set of clients and servers, as long as all clients generate approximately the same load. If one client generates more load than the others, that extra load won’t be balanced, and one server will suffer the brunt of it. However we don’t know whether that’s the case, so really we’d need monitoring during scale tests or in customer setups to get a better idea.
I think I used balance source a long time ago when we initially were trying to get ODL HA to work because of some issues that I think are probably long gone. Looking at the HAproxy guide again: source The source IP address is hashed and divided by the total weight of the running servers to designate which server will receive the request. This ensures that the same client IP address will always reach the same server as long as no server goes down or up. If the hash result changes due to the number of running servers changing, many clients will be directed to a different server. This algorithm is generally used in TCP mode where no cookie may be inserted. It may also be used on the Internet to provide a best-effort stickiness to clients which refuse session cookies. This algorithm is static by default, which means that changing a server's weight on the fly will have no effect, but this can be changed using "hash-type". I think we would see an imbalance if an ODL went down and came back up. Maybe we should look at switching to round-robin: roundrobin Each server is used in turns, according to their weights. This is the smoothest and fairest algorithm when the server's processing time remains equally distributed. This algorithm is dynamic, which means that server weights may be adjusted on the fly for slow starts for instance. It is limited by design to 4095 active servers per backend. Note that in some large farms, when a server becomes up after having been down for a very short time, it may sometimes take a few hundreds requests for it to be re-integrated into the farm and start receiving traffic. This is normal, though very rare. It is indicated here in case you would have the chance to observe it, so that you don't worry. Either way we would need to test it in a scale/perf environment to see the impact of each one to determine which one is better.
OK, so to determine what needs to be done here, we ideally need some scale testing; as described by Sai: We want to send a bunch of neutron resource create requests and see how they are load balanced via haproxy to the ODL cluster. Then, change the alogorithm via the haproxy config manually and repeat. So, AFAIU, we need to blast the setup with several requests and check the haproxy stats page to see how they were load balanced...
Additionally, I suggest that ODL be taken down and brought back up as per comment #10 during this scale run, to check the affect of such behavior on the load balancing.
I ran scale tests with balance source and also roundrobin to see how it affects the backend connections to opendaylight and opendaylight_ws. It was seen that using balance source load balanced the connections to only 2 of the 3 ODLs (in the ratio 2:1). However using round robin ensured fair load balancing between the 3 ODLs. In case of balance source, For opendaylight 1984 frontend conenctions odl-0 11918 odl-1 6523 odl-2 0 For opendaylight_ws 3 frontend connections odl-0 2 odl-1 1 odl-2 0 In the case of round robin, For opendaylight 1995 frontend connections odl-0 5563 odl-1 5563 odl-2 5563 opendaylight_ws 3 frontend connections odl-0 1 odl-1 1 odl-2 1 Based on this we should move the default haproxy configuration for both opendaylight and opendaylight_ws to use roundrobin.
Tomas, can you please verify this?
From /etc/haproxy/haproxy.cfg on the controllers: listen opendaylight bind 172.17.1.17:8081 transparent bind 192.168.24.9:8081 transparent mode http http-request set-header X-Forwarded-Proto https if { ssl_fc } http-request set-header X-Forwarded-Proto http if !{ ssl_fc } option httpchk option httplog server controller-0.internalapi.localdomain 172.17.1.18:8081 check fall 5 inter 2000 rise 2 listen opendaylight_ws bind 172.17.1.17:8185 transparent bind 192.168.24.9:8185 transparent mode http timeout connect 5s timeout client 25s timeout server 25s timeout tunnel 3600s server controller-0.internalapi.localdomain 172.17.1.18:8185 check fall 5 inter 2000 rise 2 Verified, as there is no balance option and the default is roundrobin
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086