Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1389413

Summary: MySQL / Galera HAProxy settings have incorrect check settings
Product: Red Hat OpenStack Reporter: Michael Bayer <mbayer>
Component: puppet-tripleoAssignee: Chris Jones <chjones>
Status: CLOSED ERRATA QA Contact: Arik Chernetsky <achernet>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: bperkins, fdinitto, jjoyce, jschluet, mbayer, michele, mkrcmari, mlopes, rohara, royoung, slinaber, srevivo, tvignaud, ushkalim, vaggarwa, vcojot
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-5.4.0-2.el7ost Doc Type: Bug Fix
Doc Text:
Prior to this update, HAProxy checking of MySQL resulted in a long timeout (16 seconds) before a failed node would be removed from service. Consequently, OpenStack services connected to a failed MySQL node could return API errors to users/operators/tools. With this update, the check interval settings have been reduced to drop failed MySQL nodes within 6 seconds of failure. As a result, OpenStack services should failover to working MySQL nodes much faster and produce fewer API errors to their consumers.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:26:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Bayer 2016-10-27 14:26:16 UTC
Description of problem:

Per bz#1211781, we went through a lot of testing to come up with the best settings for HAProxy on Galera which include that the "inter" setting is one second, with no retry of any kind.  Because this is Galera and we are using a simple inetd-based health checker, if this check returns "down", failover needs to happen immediately otherwise openstack services will begin to error out, rather than being able to reconnect and get onto a good node.

Those settings look like:

listen galera
  bind 192.168.0.13:3306
  mode  tcp
  option  tcplog
  option  httpchk
  option  tcpka
  stick  on dst
  stick-table  type ip size 1000
  timeout  client 90m
  timeout  server 90m
  server pcmk-maca25400702876 192.168.0.7:3306  check inter 1s port 9200 backup on-marked-down shutdown-sessions
  server pcmk-maca25400702877 192.168.0.10:3306  check inter 1s port 9200 backup on-marked-down shutdown-sessions
  server pcmk-maca25400702875 192.168.0.9:3306  check inter 1s port 9200 backup on-marked-down shutdown-sessions


That BZ affected the foreman installer, and it appears that in the current tripleo installer, these settings have been lost, now we are getting:

listen mysql
  bind ip:3306 
  option httpchk
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server overcloud-controller-0 ip:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-1 ip:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-2 ip:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2

Above, the health check is every two seconds, not one, and the health check needs to run five times before it determines the node is down.  This means it takes ten full seconds of a Galera node that's down and unresponsive to be taken out of the proxy, during which time all openstack services will be non-functioning and will return errors to end users.  This has led to bz#1381485 where openstack services are erroring out unnecessarily and then has caused lots of confusion on the customer's part as to how to resolve.

The source to this can be seen via the defaults at https://github.com/redhat-openstack/openstack-puppet-modules/blob/stable/mitaka/tripleo/manifests/loadbalancer.pp#L384 leading to https://github.com/redhat-openstack/openstack-puppet-modules/blob/stable/mitaka/tripleo/manifests/loadbalancer.pp#L1338 where the "inter 1s" is not being set, and the "fall 5 inter 2000" is being kept.


Version-Release number of selected component (if applicable):

All tripleo

How reproducible:

always

Steps to Reproduce:
1. run tripleo installer
2. haproxy.cfg has wrong values

Actual results:

openstack services have a ten second delay while a galera node is offline before they can recover

Expected results:

there is at most a one second delay while a galera node is offline, allowing most openstack applications to recover without errors.

Comment 2 Michael Bayer 2016-10-27 14:29:23 UTC
*** Bug 1381485 has been marked as a duplicate of this bug. ***

Comment 5 Chris Jones 2016-11-04 10:06:11 UTC
Filed https://bugs.launchpad.net/tripleo/+bug/1639189 for upstream purposes, and pushed https://review.openstack.org/#/c/393673/ for review

Comment 6 Chris Jones 2016-11-14 13:30:27 UTC
This has now merged upstream on master and stable/newton

Comment 7 Chris Jones 2016-11-14 13:33:10 UTC
stable/newton backport is https://review.openstack.org/#/c/396092/2

Comment 9 Marian Krcmarik 2016-11-18 21:46:26 UTC
Just to make expectations clear. haproxy.cfg is generated correctly with following parameters:

listen mysql
  bind 172.17.1.18:3306 transparent
  option tcpka
  option httpchk
  stick on dst
  stick-table type ip size 1000
  timeout client 90m
  timeout server 90m
  server controller-0.internalapi.localdomain 172.17.1.13:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-1.internalapi.localdomain 172.17.1.10:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200
  server controller-2.internalapi.localdomain 172.17.1.25:3306 backup check inter 1s on-marked-down shutdown-sessions port 9200

This configuration causes haproxy to mark a galera node which is down as down in 6 seconds - the time of 3 consecutive unsuccessful health checks (which is default value of "fall" parameter if omitted) every 1s interval + some overhead.

Previously a node was marked as down in ~16s - 5 unsuccessful health checks in 2s interval + some overhead (I guess).

I wanna make sure this is expected behaviour since there is a mention of 1s max delay in expected results.

Comment 10 Michael Bayer 2016-11-18 21:59:52 UTC
OK, rohara knows better than me.  if it's in fact 6 seconds total that still seems a little high - 1s interval + 1s overhead per check?  or is 6 seconds approximated ?

Comment 11 Marian Krcmarik 2016-11-18 22:24:34 UTC
(In reply to Michael Bayer from comment #10)
> OK, rohara knows better than me.  if it's in fact 6 seconds total that still
> seems a little high - 1s interval + 1s overhead per check?  or is 6 seconds
> approximated ?

I can see it always as:
time(node marked as down by haproxy) - time(node down) = 6s 
with the new settings (backup check inter 1s on-marked-down shutdown-sessions port 9200).

The overhead was just my assumption, maybe because option httpchk and tcpka are used? No idea. But It's about 1 s, so for example 3 fall checks with 1 second interval takes 6s for haproxy to mark node as down, and 5 checks with 2 seconds interval is about 15/16 seconds.

Comment 12 Marian Krcmarik 2016-11-21 15:25:36 UTC
I am switching to Verified.

The haproxy.cfg is generated as descripted in description of the bug - The actual time for node to be marked as down is 6 seconds which is drop from previous release by 10 seconds, It's not 1s but based on discussion we wanna keep 3 consecutive unsuccessful checks with 1s interval which makes these 5-6s in reality.

Comment 14 errata-xmlrpc 2016-12-14 16:26:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html