Bug 1389413 - MySQL / Galera HAProxy settings have incorrect check settings
Summary: MySQL / Galera HAProxy settings have incorrect check settings
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 10.0 (Newton)
Assignee: Chris Jones
QA Contact: Arik Chernetsky
URL:
Whiteboard:
: 1381485 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-27 14:26 UTC by Michael Bayer
Modified: 2018-01-22 14:29 UTC (History)
16 users (show)

Fixed In Version: puppet-tripleo-5.4.0-2.el7ost
Doc Type: Bug Fix
Doc Text:
Prior to this update, HAProxy checking of MySQL resulted in a long timeout (16 seconds) before a failed node would be removed from service. Consequently, OpenStack services connected to a failed MySQL node could return API errors to users/operators/tools. With this update, the check interval settings have been reduced to drop failed MySQL nodes within 6 seconds of failure. As a result, OpenStack services should failover to working MySQL nodes much faster and produce fewer API errors to their consumers.
Clone Of:
Environment:
Last Closed: 2016-12-14 16:26:13 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:2948 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC
OpenStack gerrit 393673 None None None 2016-11-14 13:50:07 UTC
OpenStack gerrit 396092 None None None 2016-11-14 13:50:32 UTC
Red Hat Knowledge Base (Solution) 1599813 None None None 2018-01-16 12:19:16 UTC
Launchpad 1639189 None None None 2016-11-14 13:49:32 UTC
Red Hat Knowledge Base (Solution) 2680981 None None None 2018-01-16 12:19:16 UTC

Description Michael Bayer 2016-10-27 14:26:16 UTC
Description of problem:

Per bz#1211781, we went through a lot of testing to come up with the best settings for HAProxy on Galera which include that the "inter" setting is one second, with no retry of any kind.  Because this is Galera and we are using a simple inetd-based health checker, if this check returns "down", failover needs to happen immediately otherwise openstack services will begin to error out, rather than being able to reconnect and get onto a good node.

Those settings look like:

listen galera
  bind 192.168.0.13:3306
  mode  tcp
  option  tcplog
  option  httpchk
  option  tcpka
  stick  on dst
  stick-table  type ip size 1000
  timeout  client 90m
  timeout  server 90m
  server pcmk-maca25400702876 192.168.0.7:3306  check inter 1s port 9200 backup on-marked-down shutdown-sessions
  server pcmk-maca25400702877 192.168.0.10:3306  check inter 1s port 9200 backup on-marked-down shutdown-sessions
  server pcmk-maca25400702875 192.168.0.9:3306  check inter 1s port 9200 backup on-marked-down shutdown-sessions


That BZ affected the foreman installer, and it appears that in the current tripleo installer, these settings have been lost, now we are getting:

listen mysql
  bind ip:3306 
  option httpchk
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server overcloud-controller-0 ip:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-1 ip:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-2 ip:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2

Above, the health check is every two seconds, not one, and the health check needs to run five times before it determines the node is down.  This means it takes ten full seconds of a Galera node that's down and unresponsive to be taken out of the proxy, during which time all openstack services will be non-functioning and will return errors to end users.  This has led to bz#1381485 where openstack services are erroring out unnecessarily and then has caused lots of confusion on the customer's part as to how to resolve.

The source to this can be seen via the defaults at https://github.com/redhat-openstack/openstack-puppet-modules/blob/stable/mitaka/tripleo/manifests/loadbalancer.pp#L384 leading to https://github.com/redhat-openstack/openstack-puppet-modules/blob/stable/mitaka/tripleo/manifests/loadbalancer.pp#L1338 where the "inter 1s" is not being set, and the "fall 5 inter 2000" is being kept.


Version-Release number of selected component (if applicable):

All tripleo

How reproducible:

always

Steps to Reproduce:
1. run tripleo installer
2. haproxy.cfg has wrong values

Actual results:

openstack services have a ten second delay while a galera node is offline before they can recover

Expected results:

there is at most a one second delay while a galera node is offline, allowing most openstack applications to recover without errors.

Comment 2 Michael Bayer 2016-10-27 14:29:23 UTC
*** Bug 1381485 has been marked as a duplicate of this bug. ***

Comment 5 Chris Jones 2016-11-04 10:06:11 UTC
Filed https://bugs.launchpad.net/tripleo/+bug/1639189 for upstream purposes, and pushed https://review.openstack.org/#/c/393673/ for review

Comment 6 Chris Jones 2016-11-14 13:30:27 UTC
This has now merged upstream on master and stable/newton

Comment 7 Chris Jones 2016-11-14 13:33:10 UTC
stable/newton backport is https://review.openstack.org/#/c/396092/2

Comment 9 Marian Krcmarik 2016-11-18 21:46:26 UTC
Just to make expectations clear. haproxy.cfg is generated correctly with following parameters:

listen mysql
  bind 172.17.1.18:3306 transparent
  option tcpka
  option httpchk
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server controller-0.internalapi.localdomain 172.17.1.13:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-1.internalapi.localdomain 172.17.1.10:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-2.internalapi.localdomain 172.17.1.25:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200

This configuration causes haproxy to mark a galera node which is down as down in 6 seconds - the time of 3 consecutive unsuccessful health checks (which is default value of "fall" parameter if omitted) every 1s interval + some overhead.

Previously a node was marked as down in ~16s - 5 unsuccessful health checks in 2s interval + some overhead (I guess).

I wanna make sure this is expected behaviour since there is a mention of 1s max delay in expected results.

Comment 10 Michael Bayer 2016-11-18 21:59:52 UTC
OK, rohara knows better than me.  if it's in fact 6 seconds total that still seems a little high - 1s interval + 1s overhead per check?  or is 6 seconds approximated ?

Comment 11 Marian Krcmarik 2016-11-18 22:24:34 UTC
(In reply to Michael Bayer from comment #10)
> OK, rohara knows better than me.  if it's in fact 6 seconds total that still
> seems a little high - 1s interval + 1s overhead per check?  or is 6 seconds
> approximated ?

I can see it always as:
time(node marked as down by haproxy) - time(node down) = 6s 
with the new settings (backup check inter 1s on-marked-down shutdown-sessions port 9200).

The overhead was just my assumption, maybe because option httpchk and tcpka are used? No idea. But It's about 1 s, so for example 3 fall checks with 1 second interval takes 6s for haproxy to mark node as down, and 5 checks with 2 seconds interval is about 15/16 seconds.

Comment 12 Marian Krcmarik 2016-11-21 15:25:36 UTC
I am switching to Verified.

The haproxy.cfg is generated as descripted in description of the bug - The actual time for node to be marked as down is 6 seconds which is drop from previous release by 10 seconds, It's not 1s but based on discussion we wanna keep 3 consecutive unsuccessful checks with 1s interval which makes these 5-6s in reality.

Comment 14 errata-xmlrpc 2016-12-14 16:26:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.