1416037 – 503 errors during auto wakeup of a pod

Bug 1416037 - 503 errors during auto wakeup of a pod

Summary: 503 errors during auto wakeup of a pod

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-24 12:47 UTC by Jaspreet Kaur
Modified:	2017-07-24 14:11 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Unidling connections could time out if the pod takes longer than 30s to start Consequence: Clients had connections closed with no data Fix: Increased the timeout to 120s Result: Slow pods don't break clients
Clone Of:
Environment:
Last Closed:	2017-04-12 19:10:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	12754	0	None	None	None	2017-02-01 15:38:47 UTC
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Jaspreet Kaur 2017-01-24 12:47:44 UTC

Description of problem: A number of the 'quick-start' applications have a 30 second delay before the first Readiness check.  Additionally, the routers have a default 30 second timeout before returning an error page.

With the combination of these two behaviours, we've had reports that applications "return an error the first time they're accessed".

503 Service Unavailable


Version-Release number of selected component (if applicable): OCP 3.4


How reproducible:


Steps to Reproduce:
1. Created an application using jboss quickstart provided as default application in templates
2. Idle the application :

oc idle <application-service>

Application is correctly idled

3. Now access the route. 

Actual results: It returns 503 errors.


Expected results: When accessing route it should have wakeup the pod and should not provide the error.


Additional info: Tried timeout changes but still provided the error.  It's an annoyance rather than a major issue, but combined with a lack of easy way to tell when a route was last accessed results in a reduced customer experience.

Comment 3 openshift-github-bot 2017-02-03 07:18:44 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/845e285645adc5ddea1eb68d953c8045cf5da621
Increased the time the proxy will hold connections when unidling

Before we would wait 30 seconds for a pod to come live before dropping
the connections.  That time is too short, so we have increased it to
120 seconds.

Fixes bug 1416037 (https://bugzilla.redhat.com/show_bug.cgi?id=1416037)

Comment 4 Troy Dawson 2017-02-03 22:35:04 UTC

This has been merged into ocp and is in OCP v3.5.0.16 or newer.

Comment 5 Meng Bo 2017-02-04 06:35:07 UTC

Tested on OCP 3.5.0.16

It will throw a 504 error before the pod being started.

[root@openshift-127 application-templates]# oc idle dc helloworld
warning: continuing on for valid scalable resources, but an error occurred while finding scalable resources to idle: endpoints "dc" not foundMarked service default/helloworld to unidle resource DeploymentConfig default/helloworld (unidle to 1 replicas)
Idled DeploymentConfig default/helloworld 
[root@openshift-127 application-templates]# time curl -v http://helloworld-default.0203-s5y.qe.rhcloud.com/
* About to connect() to helloworld-default.0203-s5y.qe.rhcloud.com port 80 (#0)
*   Trying 10.14.6.106...
* Connected to helloworld-default.0203-s5y.qe.rhcloud.com (10.14.6.106) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: helloworld-default.0203-s5y.qe.rhcloud.com
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 504 Gateway Time-out
< Cache-Control: no-cache
< Connection: close
< Content-Type: text/html
< 
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
* Closing connection 0

real    0m30.138s
user    0m0.007s
sys     0m0.007s

Comment 6 Ben Bennett 2017-02-06 13:56:52 UTC

If you are going through a router then you will need to change the router's timeout.  The default is 30 seconds.

You can either set the router env ROUTER_DEFAULT_SERVER_TIMEOUT, or just change the route by setting the annotation haproxy.router.openshift.io/timeout.  Set the time to 120s to get the maximum wait the service proxy will allow.

Comment 7 Meng Bo 2017-02-07 08:59:21 UTC

Tested with OCP 3.5.0.17, after set the router env, the unidling access will wait for up to 120s.

Verify the bug.

Comment 9 errata-xmlrpc 2017-04-12 19:10:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.