Bug 1471899

Summary:	Addition of new routes slows down considerably with high numbers of routes.
Product:	OpenShift Container Platform	Reporter:	Jiří Mencák <jmencak>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, bbennett, bmeng, ccoleman, eparis, hongli, jeder, jfiala, jupierce, pdwyer, rcwwilliams07, rkant, weliang, wgordon
Version:	3.6.0
Target Milestone:	---
Target Release:	3.7.0
Hardware:	All
OS:	Linux
Whiteboard:	aos-scalability-36
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The locking was overly broad so events were not processed while an haproxy reload was happening. Consequence: Route changes would take hours to process. Fix: Made the locking more fine-grained, so that events could process in parallel. Result: Changes were processed within the time of two reloads of the router.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 22:04:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jiří Mencák 2017-07-17 15:32:57 UTC

Description of problem:

When exposing large numbers of routes and using HAProxy router, new routes are being registered reasonably fast up to a point. I've found that point to be around a couple of thousands routes. Then, the route registration slows down considerably (only a few routes per minute enter HAProxy configuration file), multiple HAProxy processes are forked and they start using a considerable amount of CPU.

Version-Release number of selected component (if applicable):
Verified on openshift v3.6.126.1, v3.6.131 and v3.6.151.

How reproducible:
I first hit this when testing in the scale lab, around 500 nodes, v3.6.126.1, 3 router replicas. Then I reproduced it on a 3 node v3.6.131 cluster using 2 router replicas. Finally I've checked the (nearly) latest v3.6.151 on a single node (during reboot lost all registered routes) with only a single router.

Steps to Reproduce:
1. Create a simple project, pod and service.
$ oc new-project hello
$ curl -sL https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json | oc create -f-
oc expose pod hello-openshift --name=hello-openshift --port=8080 --protocol="TCP" --generator="service/v2"

2. Create multiple routes to the service.
$ i=0
$ while test $i -lt 5000 ; do \
i=$((i+1)); \
echo $i; \
oc expose service hello-openshift --name=hello-openshift-$i --hostname=hello-openshift-$i.router.default.svc.cluster.local -n hello ; \
done

3. Watch the route registration slow down, e.g. using curl on already created routes above ~3300:
$ curl -I hello-openshift-3301.router.default.svc.cluster.local | grep ^H
HTTP/1.0 503 Service Unavailable

Actual results:
Even though the loop in step 2 might finish quite fast, route registration will be slowed down at around i=3300, multiple HAProxy processes will be spawned and CPU utilisation of one of them will be close to 100%CPU.

Expected results:
$ curl -sI hello-openshift-3301.router.default.svc.cluster.local | grep ^H
HTTP/1.1 200 OK

Reasonably fast route registration, no HAProxy processes continuously at 100%CPU while not serving any requests or at least a limit at which we inform we cannot create any more routes on a router (shard).

Additional info:

Comment 2 Ben Bennett 2017-07-19 19:57:38 UTC

I have reproduced this... I'm seeing ~9 second haproxy reload times with 3688 backends.  Our default reload interval allows a reload every 5 seconds.  With that set:
 - We will process a route
 - See if we can reload:
   - Check to see if 5 seconds has passed since the last reload
   - Since they have, reload
     [9 seconds pass]
 - Go back to the top

So basically, we reload continuously, and only process one change before each reload.  Not great.

An expedient fix is to set RELOAD_INTERVAL to something longer.  I'd advise 15s for now.  But you can tune that down when you work out what the steady-state reload time really is.  See https://docs.openshift.com/container-platform/3.5/architecture/core_concepts/routes.html#env-variables

Note that this is just a temporary fix, we need to work out how to better handle this case in the code.

Comment 4 Ben Bennett 2017-08-31 17:33:20 UTC

*** Bug 1471957 has been marked as a duplicate of this bug. ***

Comment 5 Ben Bennett 2017-08-31 17:33:41 UTC

*** Bug 1465543 has been marked as a duplicate of this bug. ***

Comment 6 Robin Williams 2017-09-09 03:01:16 UTC

Please correct me if I am wrong, but my understanding is that this bug means a route/service is unavailable for a while after a deployment - i.e. downtime for a site!

If this is the case, please could you consider escalating the severity / priority.

Meanwhile is there a workaround for OpenShift Online 3 users?

Thanks.

Comment 7 Rajat Chopra 2017-09-09 05:27:50 UTC

It does not mean downtime for an existing app. It means a long setup time for a new route. All existing routes will continue to be served.

Comment 8 Robin Williams 2017-09-09 08:36:07 UTC

In that case, maybe there is a separate issue.

I was finding that an existing route was returning 503s for a while after a build and a deployment - on west 2 starter (a week or two ago) this was consistently 3-4 minutes.

I just tried it again (since the update to west 2 starter a couple of days ago) and the 503s still seem to be happening, although perhaps for a slightly shorter time.

Comment 9 Ben Bennett 2017-09-11 14:31:57 UTC

@Robin: If you do a redeployment and it is slow, it's probably the same culprit.  The pods completely change, so the events need to be processed.

Comment 10 Ben Bennett 2017-09-11 14:33:23 UTC

This is probably fixed when https://trello.com/c/FZTU1FvG is completed.

Comment 11 Robin Williams 2017-09-11 14:44:24 UTC

Thanks Ben

Do you think fix deployed to OpenShift Online is likely to be measured in days, weeks or months?

Comment 12 Ben Bennett 2017-10-26 18:21:43 UTC

https://github.com/openshift/origin/pull/17049

Comment 13 openshift-github-bot 2017-10-28 22:18:49 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/dac5ce6d5136fa09af03c53b0cb93a0ef8b6bc13
Change the router reload suppression so that it doesn't block updates

This changes the locking so that a reload doesn't hold a lock of the
router object for the duration of the reload so that updates that
happen while the router is reloading can be processed immediately and
batched up, then included when the next reload occurs. Before this, if
a reload ran longer than the reload interval, only one event would be
processed before triggering a new reload. Which would then lock out
other event processing. This caused the router to not make any
meaningful progress consuming events.

A new module to do the rate limiting has been added.

The module has have a top and bottom half. The top half simply calls
the bottom half with a flag indicating the user has made a change.
The flag simply tells the bottom half to register the desire to reload
(so we can do it under a single lock acquisition).

The bottom half is in charge of determining if it can immediately
reload or if it has to wait. If it must wait, then it works out the
earliest time it can reload and schedules a callback to itself for
that time.

If it determines it can reload, then it runs the reload code
immediately. When the reload is complete, it calls itself again to
make sure there was no other pending reload that had come in while the
reload was running.

Whenever the bottom half calls itself, it does it without the flag
indicating the user made a change.

Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899

https://github.com/openshift/origin/commit/6f125e4a79406915e4898a9fb19d1054ba9e60ee
Merge pull request #17049 from knobunc/bug/bz1471899-change-router-locking

Automatic merge from submit-queue.

Change the router reload suppression so that it doesn't block updates

This changes the locking so that a reload doesn't hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloading can be processed immediately and batched up, then included when the next reload occurs. Before this, if a reload ran longer than the reload interval, only one event would be processed before triggering a new reload. Which would then lock out other event processing. This caused the router to not make any meaningful progress consuming events.

A new module to do the rate limiting has been added.

The module has have a top and bottom half. The top half simply calls the bottom half with a flag indicating the user has made a change. The flag simply tells the bottom half to register the desire to reload (so we can do it under a single lock acquisition).

The bottom half is in charge of determining if it can immediately reload or if it has to wait. If it must wait, then it works out the earliest time it can reload and schedules a callback to itself for that time.

If it determines it can reload, then it runs the reload code immediately. When the reload is complete, it calls itself again to make sure there was no other pending reload that had come in while the reload was running.

Whenever the bottom half calls itself, it does it without the flag indicating the user made a change.

Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899

@openshift/networking PTAL

Comment 18 Hongan Li 2017-11-02 07:23:46 UTC

verified in v3.7.0-0.189.0, the route configuration delay time reaches to ~15s if routes number > 3000, and even to ~20s after routes number > 4000. 

below is part of cpu usage while testing:
02:57:11 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
02:57:16 AM 1000000000     91013   39.06    2.45    0.00   41.51     0  openshift-route
02:57:21 AM 1000000000     91013    5.48    0.38    0.00    5.86     0  openshift-route
02:57:26 AM 1000000000     91013   78.52    5.89    0.00   84.41     0  openshift-route
02:57:31 AM 1000000000     91013   51.42    4.17    0.00   55.60     0  openshift-route
02:57:36 AM 1000000000     91013    0.57    0.00    0.00    0.57     0  openshift-route
02:57:41 AM 1000000000     91013   71.59    4.92    0.00   76.52     0  openshift-route
02:57:46 AM 1000000000     91013   58.33    3.98    0.00   62.31     0  openshift-route
02:57:51 AM 1000000000     91013    0.19    0.00    0.00    0.19     0  openshift-route
02:57:56 AM 1000000000     91013   63.19    5.12    0.00   68.31     0  openshift-route
02:58:01 AM 1000000000     91013   69.01    5.89    0.00   74.90     0  openshift-route
02:58:06 AM 1000000000     91013    4.75    0.00    0.00    4.75     0  openshift-route
02:58:11 AM 1000000000     91013   42.61    3.98    0.00   46.59     0  openshift-route
02:58:16 AM 1000000000     91013   79.09    6.08    0.00   85.17     0  openshift-route

Comment 20 Ben Bennett 2017-11-02 13:49:46 UTC

@hongli: "Since the default reload interval is 5s, so this gap should not more than 5s theoretically."

That's a slight misconception. As the number of routes increases, the time it takes us to reload haproxy increases... and can take longer than 5 seconds. For instance in online it takes 30 seconds for haproxy to reload.

So the reload interval is the minimum reload interval. i.e. "reload no more often than 5 seconds".

The bug that this is addressing is that when the haproxy reload time was greater than the reload interval, we would take hours to process route changes in a busy router. The problem was that we would get an event, process it, trigger a reload.

Then that's where it got tricky. If the last reload is not running and was triggered > 5s ago, it would start a new reload. If < 5s it would mark the need for a reload and return. But if a reload was running it would block until the last reload completed.

When a reload takes more than 5s to complete, because it blocked and could not process events, as soon as the last reload completed, it would kick off the next reload, and that would let thread waiting to schedule a reload return... at which point it would process the next event, and try to schedule a reload... and block.

So it would only process one event every 30 seconds in online.

This change doesn't make the reload faster, but it makes it not block. So we will process all events that come in while we reload.

As long as the longest you have to wait is for the next reload, this is working correctly.

Comment 21 Weibin Liang 2017-11-02 14:29:25 UTC

@hongli: I retest it on v3.7.0-0.189.0 today.

My testing results reflect exactly what Ben explained above, the reload interval start to increase when the the number of routes increases, when route 
hello-openshift-5000 is created, the router logs show the last router reload interval is about 22 seconds, and curling hello-openshift-5000 will take about 25 seconds.

[root@ip-172-18-9-129 ~]# 
5000
route "hello-openshift-5000" exposed

[root@ip-172-18-9-129 ~]# oc log -f router-1-s7pdb
I1102 14:15:16.758715       1 router.go:438] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I1102 14:15:38.589138       1 router.go:438] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I1102 14:16:00.333719       1 router.go:438] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

[root@ip-172-18-10-178 ~]# date
Thu Nov  2 10:16:00 EDT 2017
[root@ip-172-18-8-207 ~]# curl -I hello-openshift-5000.router.default.svc.cluster.local | grep ^H
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.0 503 Service Unavailable

[root@ip-172-18-8-207 ~]# curl -I hello-openshift-5000.router.default.svc.cluster.local | grep ^H
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0    17    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.1 200 OK

[root@ip-172-18-10-178 ~]# date
Thu Nov  2 10:16:25 EDT 2017
[root@ip-172-18-10-178 ~]#

Comment 22 Hongan Li 2017-11-03 05:28:27 UTC

verified in v3.7.0-0.189.0 and issue has been fixed.

thanks for Ben and Weibin's help and detailed explanation.

Comment 26 errata-xmlrpc 2017-11-28 22:04:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188