Bug 1492189 - [starter-us-east-1]Traffic passing through the router takes two orders of magnitude longer to serve than locally
Summary: [starter-us-east-1]Traffic passing through the router takes two orders of mag...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.7.0
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-15 17:12 UTC by Clayton Coleman
Modified: 2022-08-04 22:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:10:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 16643 0 None None None 2017-10-02 19:26:01 UTC
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Clayton Coleman 2017-09-15 17:12:06 UTC
On us-east-1 the router is taking 3-10s to return a 280k TLS HTTP response, while within the pod or from another pod within the same namespace the delay is 20ms (no HTTP locally), 180ms (TLS locally) and 210ms (TLS from another pod).  If I port forward into the pod from my local machine, the response is returned in 240ms (TLS port forward).

Hitting the public route URL took 1s for HTTP, which only talks to the router, so this may be a serious failure in the router.

It appears the router is seriously delaying this one application at a minimum (prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com) and potentially other apps at a level that makes the service completely unusable.

Comment 1 Clayton Coleman 2017-09-15 17:17:16 UTC
My expectation would be that the latency of a request via the router would be as least as fast as TLS port forward (<240ms)

Comment 2 Clayton Coleman 2017-09-15 22:38:44 UTC
A few other clusters were reasonably fast (no where near this amount of delay)

Comment 3 Ben Bennett 2017-09-29 20:04:21 UTC
We hopped on the cluster and observed that:
  strace -p <pid of an haproxy>

Was basically continually connecting to IP addresses.

Theory is that the health checks are causing the problem.

Set ROUTER_BACKEND_CHECK_INTERVAL=90s and the problem resolved.  Watching to see if the problem is simply that, or if a container restart really fixed it and the problem may return.

Comment 4 Ben Bennett 2017-10-02 19:27:34 UTC
So... no change after the weekend, performance was still 0.1s with:
  time curl -I prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com

Posted a PR to turn off health checks when there is only one endpoint:
  https://github.com/openshift/origin/pull/16643

Comment 5 openshift-github-bot 2017-10-05 21:07:49 UTC
Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/4833eb5d9f770fe8ab2b991f1c4d114cd09bf99c
Made the router skip health checks when there is one endpoint

If there is only one endpoint for a route, there is no point to doing
health checks.  If the endpoint is down, haproxy will fail to connect.
Skipping the checks helps tremendously on servers with large numbers
of routes, because reducing any checking means the router doesn't
spend a lot of time doing health checks pointlessly.

Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189)

https://github.com/openshift/origin/commit/f6a5067e021a534a5d2dd82a0a693f8f98805b0f
Merge pull request #16643 from knobunc/fix/router-skip-health-when-one-endpoint

Automatic merge from submit-queue (batch tested with PRs 16545, 16684, 16643, 16459, 16682).

Made the router skip health checks when there is one endpoint

If there is only one endpoint for a route, there is no point to doing
health checks.  If the endpoint is down, haproxy will fail to connect.
Skipping the checks helps tremendously on servers with large numbers
of routes, because reducing any checking means the router doesn't
spend a lot of time doing health checks pointlessly.

Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189)

Comment 6 zhaozhanqi 2017-11-01 09:39:22 UTC
Tested this issue on OCP version (v3.7.0-0.188.0), it has been fixed

1. Create rc with 1 pod
2. Create svc/route
3. Check the haproxy.config, there is no health check for only 1 backend

server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256

4. Scale the pod to 2
5. Check the haproxy.config. the healthy check will be existed
   server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256 check inter 5000ms
  server pod:test-rc-n2g9r:test-service:10.130.0.141:8080 10.130.0.141:8080 cookie 85940957b6601c50718f24c1382119ba weight 256 check inter 5000ms
6. scale the pod to 1
7. Check the haproxy.config again, same step 3.

since this bug was reported in online(starter-us-east-1). so will verify this once it is upgraded.

Comment 9 zhaozhanqi 2017-11-10 01:29:27 UTC
Verified this bug according to comment 6

Comment 13 errata-xmlrpc 2017-11-28 22:10:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.