Bug 1389413
| Summary: | MySQL / Galera HAProxy settings have incorrect check settings | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Michael Bayer <mbayer> |
| Component: | puppet-tripleo | Assignee: | Chris Jones <chjones> |
| Status: | CLOSED ERRATA | QA Contact: | Arik Chernetsky <achernet> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 10.0 (Newton) | CC: | bperkins, fdinitto, jjoyce, jschluet, mbayer, michele, mkrcmari, mlopes, rohara, royoung, slinaber, srevivo, tvignaud, ushkalim, vaggarwa, vcojot |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | puppet-tripleo-5.4.0-2.el7ost | Doc Type: | Bug Fix |
| Doc Text: |
Prior to this update, HAProxy checking of MySQL resulted in a long timeout (16 seconds) before a failed node would be removed from service. Consequently, OpenStack services connected to a failed MySQL node could return API errors to users/operators/tools.
With this update, the check interval settings have been reduced to drop failed MySQL nodes within 6 seconds of failure. As a result, OpenStack services should failover to working MySQL nodes much faster and produce fewer API errors to their consumers.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-14 16:26:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Michael Bayer
2016-10-27 14:26:16 UTC
*** Bug 1381485 has been marked as a duplicate of this bug. *** Filed https://bugs.launchpad.net/tripleo/+bug/1639189 for upstream purposes, and pushed https://review.openstack.org/#/c/393673/ for review This has now merged upstream on master and stable/newton stable/newton backport is https://review.openstack.org/#/c/396092/2 Just to make expectations clear. haproxy.cfg is generated correctly with following parameters: listen mysql bind 172.17.1.18:3306 transparent option tcpka option httpchk stick on dst stick-table type ip size 1000 timeout client 90m timeout server 90m server controller-0.internalapi.localdomain 172.17.1.13:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 server controller-1.internalapi.localdomain 172.17.1.10:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 server controller-2.internalapi.localdomain 172.17.1.25:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200 This configuration causes haproxy to mark a galera node which is down as down in 6 seconds - the time of 3 consecutive unsuccessful health checks (which is default value of "fall" parameter if omitted) every 1s interval + some overhead. Previously a node was marked as down in ~16s - 5 unsuccessful health checks in 2s interval + some overhead (I guess). I wanna make sure this is expected behaviour since there is a mention of 1s max delay in expected results. OK, rohara knows better than me. if it's in fact 6 seconds total that still seems a little high - 1s interval + 1s overhead per check? or is 6 seconds approximated ? (In reply to Michael Bayer from comment #10) > OK, rohara knows better than me. if it's in fact 6 seconds total that still > seems a little high - 1s interval + 1s overhead per check? or is 6 seconds > approximated ? I can see it always as: time(node marked as down by haproxy) - time(node down) = 6s with the new settings (backup check inter 1s on-marked-down shutdown-sessions port 9200). The overhead was just my assumption, maybe because option httpchk and tcpka are used? No idea. But It's about 1 s, so for example 3 fall checks with 1 second interval takes 6s for haproxy to mark node as down, and 5 checks with 2 seconds interval is about 15/16 seconds. I am switching to Verified. The haproxy.cfg is generated as descripted in description of the bug - The actual time for node to be marked as down is 6 seconds which is drop from previous release by 10 seconds, It's not 1s but based on discussion we wanna keep 3 consecutive unsuccessful checks with 1s interval which makes these 5-6s in reality. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |