Bug 1320233 - haproxy is reloaded every 10 minutes N-times for N endpoints
Summary: haproxy is reloaded every 10 minutes N-times for N endpoints
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.2.1
Assignee: Maru Newby
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1333522 1336009 (view as bug list)
Depends On:
Blocks: 1267746 1339502
TreeView+ depends on / blocked
 
Reported: 2016-03-22 15:32 UTC by Marek Schmidt
Modified: 2022-08-04 22:20 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When the default HAProxy router reloaded its configuration too often during a resync (default interval: 10 minutes), it was possible to experience dropped connections to routes. This bug fix updates the ose-haproxy-router image to limit reloads to at most one per sync event to minimize the potential for dropped connections.
Clone Of:
Environment:
Last Closed: 2016-06-27 15:05:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 7409 0 None None None 2016-03-30 10:12:36 UTC
Origin (Github) 8287 0 None None None 2016-03-30 10:12:13 UTC
Red Hat Product Errata RHBA-2016:1343 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 3.2.1.1 bug fix and enhancement update 2016-06-27 19:04:05 UTC

Description Marek Schmidt 2016-03-22 15:32:30 UTC
Description of problem:

On default OpenShift Enterprise installation (with the default haproxy router), the haproxy seems to be reloaded every 10 minutes multiple times in short succession (apparently for each existing endpoint)

This, together with https://bugzilla.redhat.com/show_bug.cgi?id=1269488 decimates currently open connections, as the rapid restarts kill almost all open connections, including their potential reconnects, if the number of endpoints is big enough.

Version-Release number of selected component (if applicable):

# oc version
oc v3.1.1.6-21-gcd70c35
kubernetes v1.1.0-origin-1107-g4c8e6f4

router image:

rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/ose-haproxy-router:v3.1.1.6"

How reproducible:
Always

Steps to Reproduce:
1. Deploy the default haproxy router and some random pods and services (endpoints)
2. rsh to the haproxy pod  (oc rsh -n default router-1-xxxxx)
3. edit the reload-haproxy to add write any reload event to a log (  vi /var/lib/haproxy/reload-haproxy  )  , e.g.



#!/bin/bash -xu

config_file=/var/lib/haproxy/conf/haproxy.config
pid_file=/var/lib/haproxy/run/haproxy.pid
old_pid=""
haproxy_conf_dir=/var/lib/haproxy/conf

# sort the path based map files for the haproxy map_beg function
for mapfile in "$haproxy_conf_dir"/*.map; do
  sort -r "$mapfile" -o "$mapfile"
done

if [ -f $pid_file ]; then
  old_pid=$(<$pid_file)
fi

if [ -n "$old_pid" ]; then
  echo "fast reload "`date` >> /tmp/reloadlog
  /usr/sbin/haproxy -f $config_file -p $pid_file -sf $old_pid
else
  echo "first run "`date` >> /tmp/reloadlog
  /usr/sbin/haproxy -f $config_file -p $pid_file
fi



4. wait for 10 minutes
5. check the /tmp/reloadlog  to see the reloads

Actual results:

There appears to be N reloads in short succession every 10 minutes (for N appearing to be number of endpoints in the k8s cluster)

Expected results:

The haproxy should only be reloaded when an endpoint changes.

Additional info:

This can also be reproduced in the CDK  (reproduced on Beta4)

Comment 1 Ram Ranganathan 2016-03-22 19:41:34 UTC
This is the default resync interval (10 minutes) where the router's in-memory etcd cache is flushed and re-read from the store. As the router get new endpoints, it reapplies those changes - the current mechanism will force a write and reload for every change we detect. 

This is the same as the github issue: https://github.com/openshift/origin/issues/7409
Once we write the config in 1 shot on a new resource version available - we would do this once every 10 minutes.

There's a couple of workarounds here: 
1.  You can increase the resync interval by passing an option to the
    infra router: --resync-interval=10m  # 10m == 10 minutes.

     $ oc edit dc/router -o json
     $ #  and in the editor add the command line args/entrypoint for the router container:
    "command": ["/usr/bin/openshift-router", "--resync-interval=10m" ],

    Example: 
       "spec": {
          "containers": [
            {
               "name": "router",
               "image": "openshift/origin-haproxy-router:latest",
               "command": ["/usr/bin/openshift-router", "--resync-interval=10m" ],
      ... 

2. You can additionally control how often the router reloads occur by specifying an
   environment variable:
     oc env dc router RELOAD_INTERVAL=10s

   That coalesces multiple reloads within 10 seconds of each other.


You can use one or a combination of the options to alleviate this issue. The second option for reload interval is only available in the latest releases.

Comment 2 Ben Bennett 2016-03-23 12:33:18 UTC
Thanks for the diagnosis Ram.  I'm reassigning to Maru to align it with the issue.

Comment 3 Ben Bennett 2016-05-10 17:11:08 UTC
You can have the router install an iptables rule to somewhat mitigate this, the instructions are at:
  https://github.com/openshift/openshift-docs/pull/1987

Comment 12 Ben Bennett 2016-06-01 19:24:57 UTC
*** Bug 1333522 has been marked as a duplicate of this bug. ***

Comment 17 Meng Bo 2016-06-06 11:00:13 UTC
Checked with latest haproxy router image v3.2.1.1. Issue has been fixed.

After add the timestamp to the reload-haproxy script.
The haproxy router will not reload periodically.
And the connection to the route will not be interrupted during the haproxy-router reload by scaling it up.


Here is some ab result:

Concurrency Level:      10
Time taken for tests:   1711.198 seconds
Complete requests:      109071
Failed requests:        0
Total transferred:      39156489 bytes
HTML transferred:       3053988 bytes
Requests per second:    63.74 [#/sec] (mean)
Time per request:       156.888 [ms] (mean)
Time per request:       15.689 [ms] (mean, across all concurrent requests)
Transfer rate:          22.35 [Kbytes/sec] received

Move the bug to verified.

Comment 18 Phil Cameron 2016-06-10 13:33:24 UTC
*** Bug 1329399 has been marked as a duplicate of this bug. ***

Comment 19 Ben Bennett 2016-06-14 14:28:49 UTC
*** Bug 1336009 has been marked as a duplicate of this bug. ***

Comment 21 errata-xmlrpc 2016-06-27 15:05:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1343


Note You need to log in before you can comment on or make changes to this bug.