Bug 1492189
Summary: | [starter-us-east-1]Traffic passing through the router takes two orders of magnitude longer to serve than locally | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | aos-bugs, bbreard, eparis, fweimer, jeder, mjenner, smunilla, yufchang, zzhao |
Version: | 3.7.0 | ||
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 22:10:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2017-09-15 17:12:06 UTC
My expectation would be that the latency of a request via the router would be as least as fast as TLS port forward (<240ms) A few other clusters were reasonably fast (no where near this amount of delay) We hopped on the cluster and observed that: strace -p <pid of an haproxy> Was basically continually connecting to IP addresses. Theory is that the health checks are causing the problem. Set ROUTER_BACKEND_CHECK_INTERVAL=90s and the problem resolved. Watching to see if the problem is simply that, or if a container restart really fixed it and the problem may return. So... no change after the weekend, performance was still 0.1s with: time curl -I prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com Posted a PR to turn off health checks when there is only one endpoint: https://github.com/openshift/origin/pull/16643 Commits pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/4833eb5d9f770fe8ab2b991f1c4d114cd09bf99c Made the router skip health checks when there is one endpoint If there is only one endpoint for a route, there is no point to doing health checks. If the endpoint is down, haproxy will fail to connect. Skipping the checks helps tremendously on servers with large numbers of routes, because reducing any checking means the router doesn't spend a lot of time doing health checks pointlessly. Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189) https://github.com/openshift/origin/commit/f6a5067e021a534a5d2dd82a0a693f8f98805b0f Merge pull request #16643 from knobunc/fix/router-skip-health-when-one-endpoint Automatic merge from submit-queue (batch tested with PRs 16545, 16684, 16643, 16459, 16682). Made the router skip health checks when there is one endpoint If there is only one endpoint for a route, there is no point to doing health checks. If the endpoint is down, haproxy will fail to connect. Skipping the checks helps tremendously on servers with large numbers of routes, because reducing any checking means the router doesn't spend a lot of time doing health checks pointlessly. Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189) Tested this issue on OCP version (v3.7.0-0.188.0), it has been fixed 1. Create rc with 1 pod 2. Create svc/route 3. Check the haproxy.config, there is no health check for only 1 backend server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256 4. Scale the pod to 2 5. Check the haproxy.config. the healthy check will be existed server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256 check inter 5000ms server pod:test-rc-n2g9r:test-service:10.130.0.141:8080 10.130.0.141:8080 cookie 85940957b6601c50718f24c1382119ba weight 256 check inter 5000ms 6. scale the pod to 1 7. Check the haproxy.config again, same step 3. since this bug was reported in online(starter-us-east-1). so will verify this once it is upgraded. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |