Bug 1624078
| Summary: | Intermitent error on OpenShift HAProxy Router reload | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mauricio Magnani <mmagnani> | |
| Component: | Networking | Assignee: | Ram Ranganathan <ramr> | |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | acavalla, ajuricic, akaiser, aos-bugs, bbennett, bleanhar, bperkins, cstark, dmace, guilherme.camposo, hongli, jolee, mmagnani, openshift-bugs-escalate, ramr, rhowe, rpenta | |
| Version: | 3.9.0 | |||
| Target Milestone: | --- | |||
| Target Release: | 3.9.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: When wildcard routes are enabled and namespace ownership checks are disabled, non-wildcard routes get removed and immediately re-added on the resync interval boundaries and this causes a brief route outage and results in intermittent errors on a route.
Consequence: Intermittent errors on non-wildcard routes. See cause above.
Fix: To not remove and re-add the routes on resync interval in the specific case when wildcard routes enabled and namespace ownership checks are disabled.
Result: non-wildcard routes continue to serve without any intermittent errors after the fix.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1647176 (view as bug list) | Environment: | ||
| Last Closed: | 2018-12-13 19:27:05 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1647176 | |||
|
Description
Mauricio Magnani
2018-08-30 20:46:57 UTC
Hi @Ram, based on our conversation yesterday, we tried to monitor what was causing the monitored routes to be deleted and readmitted. What we noticed was that every route were readmitted, but the effect of it is not perceptible by all routes. Another thing we noticed was that this would happen in an interval of 10 minutes. We believe this is something related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1320233 We did a test and changed the rsync internal to 3 minutes, and now the errors appears every 3 minutes. Just to summarize. Customer is receiving 503 responses from HA proxy in a interval of 10 minutes. We noticed this might me related with the forced synchronization of the routes. There is two forms of forced synchronization. One we can change by setting --resync-interval in the container cmd, the other is hardcoded here : https://github.com/openshift/origin/blob/71543b2d15e53f4ae56272988a6604bf2f790dfd/pkg/cmd/infra/router/template.go#L418 by changing the first form to 3 minutes, we were able to change the behavior to 503's after 3 minutes. Another curious thing is that routes are being deleted and readmitted in the HA proxy frequently, but supposedly the routes weren't updated. @Ram any updates from Ravi? We are in a very difficult situation with the customer, we need at least a work around to avoid bigger problems. Created a PR against master: https://github.com/openshift/origin/pull/21053 Setting priority to high since customer is expecting to receive the patch asap. Associated 3.9 backport PR: https://github.com/openshift/ose/pull/1422 Associated 3.10 backport PR: https://github.com/openshift/ose/pull/1423 Waiting on merges. Apologies for the delay - the PRs were ready a while back and slipped through the cracks. @Brenton, can you please help. Ben doesn't have permissions to merge this into the OSE repo. Thanks a ton. Hey Ram, Just following up here since the PRs merged. Anything else left to do? @Dan, The work's done on this one. Am not sure on what the next step on the OSE front is ... basically what's the procedure to convert those PRs to actual router images for those 2 backported releases? And for QE to verify? Since Brenton's out am not sure who would be the best person to ask/see what else is needed here. @zhaozhanqi do you know what we need to do here? Thx Setting this to POST and will create a separate bugz for OSE 3.10 Cloned bugz for backporting to OSE 3.10 is: https://bugzilla.redhat.com/show_bug.cgi?id=1647176 Thank you Ram. Increase the log level to 4 on the router, I can see logs below with the OCP v3.9.41 while the router reloading, but not existing on OCP v3.9.55. So the issue has been verified. I1130 08:08:06.486294 1 unique_host.go:211] Deleting routes for hongli/service-unsecure I1130 08:08:06.486298 1 plugin.go:187] Deleting route hongli/service-unsecure I1130 08:08:06.486312 1 unique_host.go:195] Route hongli/service-unsecure claims service-unsecure-hongli.apps.1130-g2b.qe.rhcloud.com I1130 08:08:06.486322 1 status.go:245] admit: route already admitted I1130 08:08:06.486331 1 router.go:682] Adding route hongli/service-unsecure BTW, tried to using curl/ab but unfortunately didn't get the 503 error during router reload. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748 |