Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1492189

Summary:	[starter-us-east-1]Traffic passing through the router takes two orders of magnitude longer to serve than locally
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	router	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, bbreard, eparis, fweimer, jeder, mjenner, smunilla, yufchang, zzhao
Version:	3.7.0
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 22:10:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2017-09-15 17:12:06 UTC

On us-east-1 the router is taking 3-10s to return a 280k TLS HTTP response, while within the pod or from another pod within the same namespace the delay is 20ms (no HTTP locally), 180ms (TLS locally) and 210ms (TLS from another pod).  If I port forward into the pod from my local machine, the response is returned in 240ms (TLS port forward).

Hitting the public route URL took 1s for HTTP, which only talks to the router, so this may be a serious failure in the router.

It appears the router is seriously delaying this one application at a minimum (prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com) and potentially other apps at a level that makes the service completely unusable.

Comment 1 Clayton Coleman 2017-09-15 17:17:16 UTC

My expectation would be that the latency of a request via the router would be as least as fast as TLS port forward (<240ms)

Comment 2 Clayton Coleman 2017-09-15 22:38:44 UTC

A few other clusters were reasonably fast (no where near this amount of delay)

Comment 3 Ben Bennett 2017-09-29 20:04:21 UTC

We hopped on the cluster and observed that:
  strace -p <pid of an haproxy>

Was basically continually connecting to IP addresses.

Theory is that the health checks are causing the problem.

Set ROUTER_BACKEND_CHECK_INTERVAL=90s and the problem resolved.  Watching to see if the problem is simply that, or if a container restart really fixed it and the problem may return.

Comment 4 Ben Bennett 2017-10-02 19:27:34 UTC

So... no change after the weekend, performance was still 0.1s with:
  time curl -I prometheus-openshift-devops-monitor.1d35.starter-us-east-1.openshiftapps.com

Posted a PR to turn off health checks when there is only one endpoint:
  https://github.com/openshift/origin/pull/16643

Comment 5 openshift-github-bot 2017-10-05 21:07:49 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/4833eb5d9f770fe8ab2b991f1c4d114cd09bf99c
Made the router skip health checks when there is one endpoint

If there is only one endpoint for a route, there is no point to doing
health checks.  If the endpoint is down, haproxy will fail to connect.
Skipping the checks helps tremendously on servers with large numbers
of routes, because reducing any checking means the router doesn't
spend a lot of time doing health checks pointlessly.

Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189)

https://github.com/openshift/origin/commit/f6a5067e021a534a5d2dd82a0a693f8f98805b0f
Merge pull request #16643 from knobunc/fix/router-skip-health-when-one-endpoint

Automatic merge from submit-queue (batch tested with PRs 16545, 16684, 16643, 16459, 16682).

Made the router skip health checks when there is one endpoint

If there is only one endpoint for a route, there is no point to doing
health checks.  If the endpoint is down, haproxy will fail to connect.
Skipping the checks helps tremendously on servers with large numbers
of routes, because reducing any checking means the router doesn't
spend a lot of time doing health checks pointlessly.

Fixes bug 1492189 (https://bugzilla.redhat.com/show_bug.cgi?id=1492189)

Comment 6 zhaozhanqi 2017-11-01 09:39:22 UTC

Tested this issue on OCP version (v3.7.0-0.188.0), it has been fixed

1. Create rc with 1 pod
2. Create svc/route
3. Check the haproxy.config, there is no health check for only 1 backend

server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256

4. Scale the pod to 2
5. Check the haproxy.config. the healthy check will be existed
   server pod:test-rc-xwvvx:test-service:10.129.1.8:8080 10.129.1.8:8080 cookie ec5f4b9bf03b15e580c958863dddb8eb weight 256 check inter 5000ms
  server pod:test-rc-n2g9r:test-service:10.130.0.141:8080 10.130.0.141:8080 cookie 85940957b6601c50718f24c1382119ba weight 256 check inter 5000ms
6. scale the pod to 1
7. Check the haproxy.config again, same step 3.

since this bug was reported in online(starter-us-east-1). so will verify this once it is upgraded.

Comment 9 zhaozhanqi 2017-11-10 01:29:27 UTC

Verified this bug according to comment 6

Comment 13 errata-xmlrpc 2017-11-28 22:10:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188