Bug 1538611
| Summary: | [Deployment] Changing opendaylight load balancing algorithm to roundrobin | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Tomas Jamrisko <tjamrisk> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Tim Rozet <trozet> |
| Status: | CLOSED ERRATA | QA Contact: | Tomas Jamrisko <tjamrisk> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 12.0 (Pike) | CC: | aadam, apevec, jschluet, lhh, mburns, michele, mkolesni, nyechiel, oblaut, rhel-osp-director-maint, sgaddam, skitt, srevivo, therve, tjamrisk, trozet |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | 13.0 (Queens) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | odl_deployment | ||
| Fixed In Version: | openstack-tripleo-heat-templates-8.0.2-2 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
N/A
|
|
| Last Closed: | 2018-06-27 13:43:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tomas Jamrisko
2018-01-25 12:39:05 UTC
Please check how severe is this bug, perhaps the impact on ODL is very low.. Stephen, Can you please provide a better analysis of this issue, how it impacts our clustering solution, and how a solution could look like? The bug description is a a little vague for me. Thanks, Nir (In reply to Nir Yechiel from comment #7) > Can you please provide a better analysis of this issue, how it impacts our > clustering solution, and how a solution could look like? The bug description > is a a little vague for me. Source-based load-balancing in HAProxy means that connections are load-balanced among the available servers on first connection, but that any given client will thereafter always end up on the same server. This would make sense in a setup like those we typically see in ODL, with a static set of clients and servers, as long as all clients generate approximately the same load. If one client generates more load than the others, that extra load won’t be balanced, and one server will suffer the brunt of it. However we don’t know whether that’s the case, so really we’d need monitoring during scale tests or in customer setups to get a better idea. I think I used balance source a long time ago when we initially were trying to get ODL HA to work because of some issues that I think are probably long gone. Looking at the HAproxy guide again:
source The source IP address is hashed and divided by the total
weight of the running servers to designate which server
will
receive the request. This ensures that the same client IP
address will always reach the same server as long as no
server goes down or up. If the hash result changes due to
the
number of running servers changing, many clients will be
directed to a different server. This algorithm is generally
used in TCP mode where no cookie may be inserted. It may
also
be used on the Internet to provide a best-effort stickiness
to clients which refuse session cookies. This algorithm is
static by default, which means that changing a server's
weight on the fly will have no effect, but this can be
changed using "hash-type".
I think we would see an imbalance if an ODL went down and came back up. Maybe we should look at switching to round-robin:
roundrobin Each server is used in turns, according to their weights.
This is the smoothest and fairest algorithm when the
server's
processing time remains equally distributed. This algorithm
is dynamic, which means that server weights may be adjusted
on the fly for slow starts for instance. It is limited by
design to 4095 active servers per backend. Note that in
some
large farms, when a server becomes up after having been
down
for a very short time, it may sometimes take a few hundreds
requests for it to be re-integrated into the farm and start
receiving traffic. This is normal, though very rare. It is
indicated here in case you would have the chance to observe
it, so that you don't worry.
Either way we would need to test it in a scale/perf environment to see the impact of each one to determine which one is better.
OK, so to determine what needs to be done here, we ideally need some scale testing; as described by Sai: We want to send a bunch of neutron resource create requests and see how they are load balanced via haproxy to the ODL cluster. Then, change the alogorithm via the haproxy config manually and repeat. So, AFAIU, we need to blast the setup with several requests and check the haproxy stats page to see how they were load balanced... Additionally, I suggest that ODL be taken down and brought back up as per comment #10 during this scale run, to check the affect of such behavior on the load balancing. I ran scale tests with balance source and also roundrobin to see how it affects the backend connections to opendaylight and opendaylight_ws. It was seen that using balance source load balanced the connections to only 2 of the 3 ODLs (in the ratio 2:1). However using round robin ensured fair load balancing between the 3 ODLs. In case of balance source, For opendaylight 1984 frontend conenctions odl-0 11918 odl-1 6523 odl-2 0 For opendaylight_ws 3 frontend connections odl-0 2 odl-1 1 odl-2 0 In the case of round robin, For opendaylight 1995 frontend connections odl-0 5563 odl-1 5563 odl-2 5563 opendaylight_ws 3 frontend connections odl-0 1 odl-1 1 odl-2 1 Based on this we should move the default haproxy configuration for both opendaylight and opendaylight_ws to use roundrobin. Tomas, can you please verify this? From /etc/haproxy/haproxy.cfg on the controllers:
listen opendaylight
bind 172.17.1.17:8081 transparent
bind 192.168.24.9:8081 transparent
mode http
http-request set-header X-Forwarded-Proto https if { ssl_fc }
http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
option httpchk
option httplog
server controller-0.internalapi.localdomain 172.17.1.18:8081 check fall 5 inter 2000 rise 2
listen opendaylight_ws
bind 172.17.1.17:8185 transparent
bind 192.168.24.9:8185 transparent
mode http
timeout connect 5s
timeout client 25s
timeout server 25s
timeout tunnel 3600s
server controller-0.internalapi.localdomain 172.17.1.18:8185 check fall 5 inter 2000 rise 2
Verified, as there is no balance option and the default is roundrobin
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |