Description of problem: On default OpenShift Enterprise installation (with the default haproxy router), the haproxy seems to be reloaded every 10 minutes multiple times in short succession (apparently for each existing endpoint) This, together with https://bugzilla.redhat.com/show_bug.cgi?id=1269488 decimates currently open connections, as the rapid restarts kill almost all open connections, including their potential reconnects, if the number of endpoints is big enough. Version-Release number of selected component (if applicable): # oc version oc v3.1.1.6-21-gcd70c35 kubernetes v1.1.0-origin-1107-g4c8e6f4 router image: rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/ose-haproxy-router:v3.1.1.6" How reproducible: Always Steps to Reproduce: 1. Deploy the default haproxy router and some random pods and services (endpoints) 2. rsh to the haproxy pod (oc rsh -n default router-1-xxxxx) 3. edit the reload-haproxy to add write any reload event to a log ( vi /var/lib/haproxy/reload-haproxy ) , e.g. #!/bin/bash -xu config_file=/var/lib/haproxy/conf/haproxy.config pid_file=/var/lib/haproxy/run/haproxy.pid old_pid="" haproxy_conf_dir=/var/lib/haproxy/conf # sort the path based map files for the haproxy map_beg function for mapfile in "$haproxy_conf_dir"/*.map; do sort -r "$mapfile" -o "$mapfile" done if [ -f $pid_file ]; then old_pid=$(<$pid_file) fi if [ -n "$old_pid" ]; then echo "fast reload "`date` >> /tmp/reloadlog /usr/sbin/haproxy -f $config_file -p $pid_file -sf $old_pid else echo "first run "`date` >> /tmp/reloadlog /usr/sbin/haproxy -f $config_file -p $pid_file fi 4. wait for 10 minutes 5. check the /tmp/reloadlog to see the reloads Actual results: There appears to be N reloads in short succession every 10 minutes (for N appearing to be number of endpoints in the k8s cluster) Expected results: The haproxy should only be reloaded when an endpoint changes. Additional info: This can also be reproduced in the CDK (reproduced on Beta4)
This is the default resync interval (10 minutes) where the router's in-memory etcd cache is flushed and re-read from the store. As the router get new endpoints, it reapplies those changes - the current mechanism will force a write and reload for every change we detect. This is the same as the github issue: https://github.com/openshift/origin/issues/7409 Once we write the config in 1 shot on a new resource version available - we would do this once every 10 minutes. There's a couple of workarounds here: 1. You can increase the resync interval by passing an option to the infra router: --resync-interval=10m # 10m == 10 minutes. $ oc edit dc/router -o json $ # and in the editor add the command line args/entrypoint for the router container: "command": ["/usr/bin/openshift-router", "--resync-interval=10m" ], Example: "spec": { "containers": [ { "name": "router", "image": "openshift/origin-haproxy-router:latest", "command": ["/usr/bin/openshift-router", "--resync-interval=10m" ], ... 2. You can additionally control how often the router reloads occur by specifying an environment variable: oc env dc router RELOAD_INTERVAL=10s That coalesces multiple reloads within 10 seconds of each other. You can use one or a combination of the options to alleviate this issue. The second option for reload interval is only available in the latest releases.
Thanks for the diagnosis Ram. I'm reassigning to Maru to align it with the issue.
You can have the router install an iptables rule to somewhat mitigate this, the instructions are at: https://github.com/openshift/openshift-docs/pull/1987
*** Bug 1333522 has been marked as a duplicate of this bug. ***
Checked with latest haproxy router image v3.2.1.1. Issue has been fixed. After add the timestamp to the reload-haproxy script. The haproxy router will not reload periodically. And the connection to the route will not be interrupted during the haproxy-router reload by scaling it up. Here is some ab result: Concurrency Level: 10 Time taken for tests: 1711.198 seconds Complete requests: 109071 Failed requests: 0 Total transferred: 39156489 bytes HTML transferred: 3053988 bytes Requests per second: 63.74 [#/sec] (mean) Time per request: 156.888 [ms] (mean) Time per request: 15.689 [ms] (mean, across all concurrent requests) Transfer rate: 22.35 [Kbytes/sec] received Move the bug to verified.
*** Bug 1329399 has been marked as a duplicate of this bug. ***
*** Bug 1336009 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1343