Bug 1465543
Summary: | [starter][us-west-2] Route takes 25+ mins to become available (503 -> 200), and continues to have intermittent 503s | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Will Gordon <wgordon> |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | aos-bugs, dakini, eparis, jmencak, rcwwilliams07, trankin |
Version: | 3.5.1 | ||
Target Milestone: | --- | ||
Target Release: | 3.6.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-31 17:33:41 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Stefanie Forrester
2017-06-27 19:47:21 UTC
I believe this is the second bug fixed by https://github.com/openshift/origin/pull/14232 -- When handleEvent() is called to ADD a route, then called again to DELETE the route before Pop() processes the ADD, there is no need for the route to remain in store. handleEvent() can remove it from store. Deleted routes that are found in store during a Resync() are deleted by Pop() as they are encountered. -- What really happens is that the event queue has a producer side (the watch events) and a consumer side (the router pulling events) with a queue in the middle (because the router can't always immediately handle events, so we need a buffer between them). The event queue does a few extra things: - It maintains a cache of all "live" objects, so that when a delete event happens, it can provide the deleted object along with the event - It coalesces events, so that when an ADD and a DELETE watch event are produced before the consumer (router) gets around to handling them, the event should just be deleted and never delivered - When the event queue is resynchronized, all objects in the cache are added back into the queue in modified state These three behaviors are good... but they had a bug that led to a nasty interaction. When the ADD then DELETE happened before the event was consumed, the event was removed from the queue, but left in the backing cache. When a resync happened, all deleted events were then added in to the queue itself and then had to be processed. Over time, that jams up the event processing in the router and makes it more likely an ADD and DELETE will happen before being processed. So it keeps adding more and more dead items to the queue. These junk entries would never be deleted and just keep building up in the cache, and would be added to the queue on every resyc. In this case, I suspect the endpoints queue is misbehaving. We can tell if this is the problem by restarting the routers and seeing if they are fast again. If they are, this is likely the culprithttps://bugzilla.redhat.com This is fixed in 3.6 by https://bugzilla.redhat.com/show_bug.cgi?id=1447928 The 3.5 backport is https://bugzilla.redhat.com/show_bug.cgi?id=1464563 The 3.4 backport is https://bugzilla.redhat.com/show_bug.cgi?id=1464567 Neither backport has been released yet. Ok, so after a restart of the routers it is down to 3 minutes. However, with 8024 routes and 9733 endpoints, the openshift-router process takes about 30% cpu in steady-state shortly after the restart. So, while I think the bug mentioned above is causing trouble, there's something else going on too. Debugging further. The load on haproxy is 100%. So we need to work out how to spread that out... maybe smaller infra nodes, but more of them. We also need to look at how to make haproxy use multiple cores. That does not explain why the openshift-router process itself is taking 25% cpu (on one core) consistently, whereas a test router we stood up with essentially identical config takes about 1% of a core. Both are on the same cluster, etc. In order to debug that, I would like to enable the debugging endpoints on their routers with: oc env dc router OPENSHIFT_PROFILE=web But we aren't able to make changes before the holiday weekend. Also, there's a ton of logspam due to https://bugzilla.redhat.com/show_bug.cgi?id=1401503 So, upgrading to 3.6 will fix that problem. But the overall load still seems to be high and we continue to investigate. Please see the comment at https://bugzilla.redhat.com/show_bug.cgi?id=1471899#c2 for a way to tune things to work around this problem for the short term. Thank you all for the amazing job you are doing creating this excellent platform with elegance for the developer whilst hiding such mind-boggling complexity. I have been in the software development game long enough to avoid giving timing indications unless completely unavoidable, but would it be possible to give an estimated "probably not before" date (as opposed to an estimated fix date) for when this fix will be deployed to Starter (and Pro) as this would be extremely helpful. This issue is a show-stopper for us in terms of roll-out to production because of the effect it has on overall system availability and the above information would be really helpful for our own development plans. Thanks very much. Robin Hi What is the status of this issue please? Thanks Robin Hi Robin, did you change the reload frequency in the router (as described in https://bugzilla.redhat.com/show_bug.cgi?id=1471899#c2)? The best way to get the number is to 'oc rsh' into the router pod and then run: time oc exec -n default -t $ROUTER_POD ../haproxy-reload` Run that a few times to get a good idea of how long a reload takes, add a few seconds (for future growth) and then set that on the router: oc env dc/router RELOAD_INTERVAL=30 This is absolutely _not_ the long-term fix, but it should get you working again and will cause the router to behave. The real fix will be to re-factor the internals of the router so we don't lock the data structure during a reload. BUT that may be a little too scary to back-port, we'll have to see when we finish the refactor. Thanks Ben Is this only on starter us-west-2 or does it apply to all clusters? Thanks Robin It applies to any cluster with lots of routes that is seeing the slow route propagation issue. *** This bug has been marked as a duplicate of bug 1471899 *** Hi Ben Thanks - how do I do this fix from the Openshift Online GUI please? Robin |