Bug 1465543

Summary:	[starter][us-west-2] Route takes 25+ mins to become available (503 -> 200), and continues to have intermittent 503s
Product:	OpenShift Container Platform	Reporter:	Will Gordon <wgordon>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	aos-bugs, dakini, eparis, jmencak, rcwwilliams07, trankin
Version:	3.5.1
Target Milestone:	---
Target Release:	3.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-31 17:33:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Stefanie Forrester 2017-06-27 19:47:21 UTC

I can confirm that the majority of my test apps are also seeing a delay of about 28-30 minutes until the routes can return '200'.

# create projects and wait for dockercfg secrets to appear, due to bz #1462542

[root@starter-us-east-1-master-25064 ~]# for num in {1..20}; do oadm new-project bz1465543-$(date +%Y%m%d)-${num}; done

# create apps

[root@starter-us-east-1-master-25064 ~]# date; for num in {1..20}; do oc -n bz1465543-$(date +%Y%m%d)-${num} new-app cakephp-mysql-persistent; done
Tue Jun 27 18:33:17 UTC 2017

# watch for routes to become available

[root@starter-us-east-1-master-25064 ~]# subdomain="$(awk '/subdomain/ {print $2}' /etc/origin/master/master-config.yaml)"

[root@starter-us-east-1-master-25064 ~]# while true; do \
  date
  for num in {1..20}; do \
    curl -o /dev/null -sL -w "%{http_code} %{url_effective} \\n" http://cakephp-mysql-persistent-bz1465543-$(date +%Y%m%d)-${num}.${subdomain}
  done
  for num in {1..20}; do oc get pods -n bz1465543-$(date +%Y%m%d)-${num}|grep cake; done |grep -v -e build -e deploy -e hook
  sleep 60s
  echo
done

Tue Jun 27 19:08:01 UTC 2017
200 http://cakephp-mysql-persistent-bz1465543-20170627-1.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-2.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-3.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-4.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-5.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-6.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-7.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-8.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-9.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-10.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-11.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-12.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-13.1d35.starter-us-east-1.openshiftapps.com/
200 http://cakephp-mysql-persistent-bz1465543-20170627-14.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-15.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-16.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-17.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-18.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-19.1d35.starter-us-east-1.openshiftapps.com/
503 http://cakephp-mysql-persistent-bz1465543-20170627-20.1d35.starter-us-east-1.openshiftapps.com/
cakephp-mysql-persistent-1-p295m   1/1       Running     0          30m
cakephp-mysql-persistent-1-gjrtv   1/1       Running     0          31m
cakephp-mysql-persistent-1-kg92w   1/1       Running     0          31m
cakephp-mysql-persistent-1-gvh1c   1/1       Running     0          31m
cakephp-mysql-persistent-1-x0nzf   1/1       Running     0          30m
cakephp-mysql-persistent-1-w9mx3   1/1       Running     0          30m
cakephp-mysql-persistent-1-qzq8m   1/1       Running     0          31m
cakephp-mysql-persistent-1-wdht0   1/1       Running     0          30m
cakephp-mysql-persistent-1-w8vwv   1/1       Running     0          30m
cakephp-mysql-persistent-1-6xgts   1/1       Running     0          30m
cakephp-mysql-persistent-1-g01g3   1/1       Running     0          31m
cakephp-mysql-persistent-1-7tfww   1/1       Running     0          30m
cakephp-mysql-persistent-1-z51bp   1/1       Running     0          31m
cakephp-mysql-persistent-1-00ml1   1/1       Running     0          28m
cakephp-mysql-persistent-1-rxm7k   1/1       Running     0          31m
cakephp-mysql-persistent-1-wmmjj   1/1       Running     0          29m
cakephp-mysql-persistent-1-7qqds   1/1       Running     0          30m
cakephp-mysql-persistent-1-63zw1   1/1       Running     0          30m
cakephp-mysql-persistent-1-jtfrr   1/1       Running     0          31m

I'm starting on some node adjustments that might help performance in this cluster: increasing the number of nodes used for infra apps (such as the router), and decreasing the number of nodes used for compute (which may be contributing to etcd load).

Comment 2 Ben Bennett 2017-06-30 15:19:35 UTC

I believe this is the second bug fixed by https://github.com/openshift/origin/pull/14232

--
When handleEvent() is called to ADD a route, then called again to
DELETE the route before Pop() processes the ADD, there is no need
for the route to remain in store. handleEvent() can remove it from
store. Deleted routes that are found in store during a Resync()
are deleted by Pop() as they are encountered.
--

What really happens is that the event queue has a producer side (the watch events) and a consumer side (the router pulling events) with a queue in the middle (because the router can't always immediately handle events, so we need a buffer between them).

The event queue does a few extra things:
- It maintains a cache of all "live" objects, so that when a delete event happens, it can provide the deleted object along with the event
- It coalesces events, so that when an ADD and a DELETE watch event are produced before the consumer (router) gets around to handling them, the event should just be deleted and never delivered
- When the event queue is resynchronized, all objects in the cache are added back into the queue in modified state

These three behaviors are good... but they had a bug that led to a nasty interaction.

When the ADD then DELETE happened before the event was consumed, the event was removed from the queue, but left in the backing cache. When a resync happened, all deleted events were then added in to the queue itself and then had to be processed. Over time, that jams up the event processing in the router and makes it more likely an ADD and DELETE will happen before being processed. So it keeps adding more and more dead items to the queue. These junk entries would never be deleted and just keep building up in the cache, and would be added to the queue on every resyc.

In this case, I suspect the endpoints queue is misbehaving. We can tell if this is the problem by restarting the routers and seeing if they are fast again. If they are, this is likely the culprithttps://bugzilla.redhat.com

This is fixed in 3.6 by https://bugzilla.redhat.com/show_bug.cgi?id=1447928
The 3.5 backport is https://bugzilla.redhat.com/show_bug.cgi?id=1464563
The 3.4 backport is https://bugzilla.redhat.com/show_bug.cgi?id=1464567

Neither backport has been released yet.

Comment 3 Ben Bennett 2017-06-30 16:54:27 UTC

Ok, so after a restart of the routers it is down to 3 minutes.  However, with 8024 routes and 9733 endpoints, the openshift-router process takes about 30% cpu in steady-state shortly after the restart.  So, while I think the bug mentioned above is causing trouble, there's something else going on too.  Debugging further.

Comment 4 Ben Bennett 2017-06-30 19:44:40 UTC

The load on haproxy is 100%.  So we need to work out how to spread that out... maybe smaller infra nodes, but more of them.  We also need to look at how to make haproxy use multiple cores.

That does not explain why the openshift-router process itself is taking 25% cpu (on one core) consistently, whereas a test router we stood up with essentially identical config takes about 1% of a core.  Both are on the same cluster, etc.  In order to debug that, I would like to enable the debugging endpoints on their routers with:
  oc env dc router OPENSHIFT_PROFILE=web

But we aren't able to make changes before the holiday weekend.

Comment 5 Ben Bennett 2017-06-30 20:56:36 UTC

Also, there's a ton of logspam due to https://bugzilla.redhat.com/show_bug.cgi?id=1401503

So, upgrading to 3.6 will fix that problem.  But the overall load still seems to be high and we continue to investigate.

Comment 8 Ben Bennett 2017-07-19 19:58:25 UTC

Please see the comment at https://bugzilla.redhat.com/show_bug.cgi?id=1471899#c2 for a way to tune things to work around this problem for the short term.

Comment 9 Robin Williams 2017-08-09 07:39:11 UTC

Thank you all for the amazing job you are doing creating this excellent platform with elegance for the developer whilst hiding such mind-boggling complexity.

I have been in the software development game long enough to avoid giving timing indications unless completely unavoidable, but would it be possible to give an estimated "probably not before" date (as opposed to an estimated fix date) for when this fix will be deployed to Starter (and Pro) as this would be extremely helpful.

This issue is a show-stopper for us in terms of roll-out to production because of the effect it has on overall system availability and the above information would be really helpful for our own development plans.

Thanks very much.

Robin

Comment 10 Robin Williams 2017-08-26 07:41:41 UTC

Hi

What is the status of this issue please?

Thanks

Robin

Comment 11 Ben Bennett 2017-08-31 15:26:35 UTC

Hi Robin, did you change the reload frequency in the router (as described in https://bugzilla.redhat.com/show_bug.cgi?id=1471899#c2)?

The best way to get the number is to 'oc rsh' into the router pod and then run:
  time oc exec -n default -t $ROUTER_POD ../haproxy-reload`

Run that a few times to get a good idea of how long a reload takes, add a few seconds (for future growth) and then set that on the router:
  oc env dc/router RELOAD_INTERVAL=30


This is absolutely _not_ the long-term fix, but it should get you working again and will cause the router to behave.  The real fix will be to re-factor the internals of the router so we don't lock the data structure during a reload.  BUT that may be a little too scary to back-port, we'll have to see when we finish the refactor.

Comment 12 Robin Williams 2017-08-31 16:37:07 UTC

Thanks Ben

Is this only on starter us-west-2 or does it apply to all clusters?

Thanks

Robin

Comment 13 Ben Bennett 2017-08-31 17:28:16 UTC

It applies to any cluster with lots of routes that is seeing the slow route propagation issue.

Comment 14 Ben Bennett 2017-08-31 17:33:41 UTC


*** This bug has been marked as a duplicate of bug 1471899 ***

Comment 15 Robin Williams 2017-09-03 09:15:32 UTC

Hi Ben

Thanks - how do I do this fix from the Openshift Online GUI please?

Robin