Summary: | "stick" setting in HAProxy fails for highly concurrent MySQL connections | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Michael Bayer <mbayer> | ||||||
Component: | openstack-foreman-installer | Assignee: | Jason Guiditta <jguiditt> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.0 (Juno) | CC: | bperkins, cwolfe, fdinitto, mburns, morazi, ohochman, rhos-maint, rohara, sasha, sputhenp, yeylon | ||||||
Target Milestone: | z4 | Keywords: | ZStream | ||||||
Target Release: | Installer | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openstack-foreman-installer-3.0.25-1.el7ost | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-08-24 15:18:29 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Attachments: |
|
Description
Michael Bayer
2015-04-14 21:07:51 UTC
Created attachment 1014506 [details]
show hosts script
Created attachment 1014507 [details]
haproxy config
note that the script also has a "delay" setting, which will make it space out connects and reconnects by N seconds. When this setting is greater or equal to 0.01 seconds, the issue generally goes away and the stick table seems to always take effect: [mbayer@thinkpad hammer]$ .venv/bin/python show_hosts.py -u root -H rhel7-1 -P 3456 -p root -n 10 -d0.01 1429045809.14 Effective host <unknown> modulus 0 Selected new host rhel7-2 1429045809.15 Effective host <unknown> modulus 1 Selected new host rhel7-2 1429045809.16 Effective host <unknown> modulus 2 Selected new host rhel7-2 1429045809.17 Effective host <unknown> modulus 3 Selected new host rhel7-2 1429045809.18 Effective host <unknown> modulus 4 Selected new host rhel7-2 1429045809.19 Effective host <unknown> modulus 5 Selected new host rhel7-2 1429045809.2 Effective host <unknown> modulus 6 Selected new host rhel7-2 1429045809.21 Effective host <unknown> modulus 7 Selected new host rhel7-2 1429045809.22 Effective host <unknown> modulus 8 Selected new host rhel7-2 1429045809.23 Effective host <unknown> modulus 9 Selected new host rhel7-2 Are you absolutely sure that all traffic is going through the same haproxy node? I'm asking because we do not have stick tables synchronized across haproxy nodes, so if there was a moment when traffic hit a different haproxy node, it could get redirected to a different backend server. Make sense? Could you test this with a single haproxy node and/or verify that haproxy logs on the other two node show no db traffic? > Are you absolutely sure that all traffic is going through the same haproxy node?
yes. for the series of outputs you see here I disabled it in pacemaker and pointed the script directly at a single HAProxy node, just to make sure. On these runs you can see I'm pointing the script at the "rhel7-1" node directly with an alternate port.
Also, the difference between running the script with no delay, vs. with a delay, is like night and day. When you first start the script and ten connections pile on simultaneously, it sends a few to other nodes 99% of the time. Turn up the delay and this vanishes.
The only test I haven't done is to turn on SQL logging on all three MySQL instances and actually tail their logs to triple check that they are in fact all receiving SQL traffic, to confirm my SELECT of the hostname query on each server is not somehow being corrupted. I guess you'd see traffic at the Galera level in any case, but this script just does a SELECT anyway.
let me point out that one thing that has *not* been tested is, this behavior on any other environment other than my QEMU VMs running RHEL7 hosted on a Fedora 21 laptop. It seems plausible that networking issues within any of these elements could contribute towards what I'm seeing. If someone wants to put me onto some other different kind of hosted environment I can try reproducing elsewhere. OK so as mentioned in the thread there's no problem making stick table 1000. This is a table of IP numbers, we're talking less memory than it takes to store the text for a single large SQL statement, it's nothing. with this config the system stays on one host at all times; on failover, it fails to a new host, and then there's no failback so there's never any split situation: stick-table type ip size 1000 stick on dst server rhos-node1 rhel7-1:3306 check inter 1s port 9200 backup on-marked-down shutdown-sessions server rhos-node2 rhel7-2:3306 check inter 1s port 9200 backup on-marked-down shutdown-sessions server rhos-node3 rhel7-3:3306 check inter 1s port 9200 backup on-marked-down shutdown-sessions we should get this config into our installer / HA setup documentation ASAP. Moving this to openstack-foreman-installer since it is not an haproxy bug, rather a configuration problem. Merged into staypuft/ofi: https://github.com/redhat-openstack/astapor/pull/518 (In reply to Crag Wolfe from comment #21) > Merged into staypuft/ofi: > https://github.com/redhat-openstack/astapor/pull/518 Was this be fixed in RHOS6 A4 release? (In reply to Ryan O'Hara from comment #22) > (In reply to Crag Wolfe from comment #21) > > Merged into staypuft/ofi: > > https://github.com/redhat-openstack/astapor/pull/518 > > Was this be fixed in RHOS6 A4 release? I _think_ so, but will have to defer to Mike on this for OSP 6. The referenced change is merged and will be in OSP 7 (ofi) release though (it is already in beta builds). Backported to OSP 6 Verified: Environment: openstack-foreman-installer-3.0.26-1.el7ost.noarch Based on Comment #15 Verified /etc/haproxy/haproxy.cfg has the following: listen galera bind 192.168.0.13:3306 mode tcp option tcplog option httpchk option tcpka stick on dst stick-table type ip size 1000 timeout client 90m timeout server 90m server pcmk-maca25400702876 192.168.0.7:3306 check inter 1s port 9200 backup on-marked-down shutdown-sessions server pcmk-maca25400702877 192.168.0.10:3306 check inter 1s port 9200 backup on-marked-down shutdown-sessions server pcmk-maca25400702875 192.168.0.9:3306 check inter 1s port 9200 backup on-marked-down shutdown-sessions Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1662.html |