On us-east-1 the router is taking 3-10s to return a 280k TLS HTTP response, while within the pod or from another pod within the same namespace the delay is 20ms (no HTTP locally), 180ms (TLS locally) and 210ms (TLS from another pod). If I port forward into the pod from my local machine, the response is returned in 240ms (TLS port forward). Hitting the public route URL took 1s for HTTP, which only talks to the router, so this may be a serious failure in the router. It appears the router is seriously delaying this one application at a minimum (prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com) and potentially other apps at a level that makes the service completely unusable.
My expectation would be that the latency of a request via the router would be as least as fast as TLS port forward (<240ms)
A few other clusters were reasonably fast (no where near this amount of delay)
We hopped on the cluster and observed that: strace -p <pid of an haproxy> Was basically continually connecting to IP addresses. Theory is that the health checks are causing the problem. Set ROUTER_BACKEND_CHECK_INTERVAL=90s and the problem resolved. Watching to see if the problem is simply that, or if a container restart really fixed it and the problem may return.
So... no change after the weekend, performance was still 0.1s with: time curl -I prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com Posted a PR to turn off health checks when there is only one endpoint: https://github.com/openshift/origin/pull/16643
Commits pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/4833eb5d9f770fe8ab2b991f1c4d114cd09bf99c Made the router skip health checks when there is one endpoint If there is only one endpoint for a route, there is no point to doing health checks. If the endpoint is down, haproxy will fail to connect. Skipping the checks helps tremendously on servers with large numbers of routes, because reducing any checking means the router doesn't spend a lot of time doing health checks pointlessly. Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189) https://github.com/openshift/origin/commit/f6a5067e021a534a5d2dd82a0a693f8f98805b0f Merge pull request #16643 from knobunc/fix/router-skip-health-when-one-endpoint Automatic merge from submit-queue (batch tested with PRs 16545, 16684, 16643, 16459, 16682). Made the router skip health checks when there is one endpoint If there is only one endpoint for a route, there is no point to doing health checks. If the endpoint is down, haproxy will fail to connect. Skipping the checks helps tremendously on servers with large numbers of routes, because reducing any checking means the router doesn't spend a lot of time doing health checks pointlessly. Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189)
Tested this issue on OCP version (v3.7.0-0.188.0), it has been fixed 1. Create rc with 1 pod 2. Create svc/route 3. Check the haproxy.config, there is no health check for only 1 backend server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256 4. Scale the pod to 2 5. Check the haproxy.config. the healthy check will be existed server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256 check inter 5000ms server pod:test-rc-n2g9r:test-service:10.130.0.141:8080 10.130.0.141:8080 cookie 85940957b6601c50718f24c1382119ba weight 256 check inter 5000ms 6. scale the pod to 1 7. Check the haproxy.config again, same step 3. since this bug was reported in online(starter-us-east-1). so will verify this once it is upgraded.
Verified this bug according to comment 6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188