Bug 1646903

Summary:

[starter-ca-central-1] routing issues

Product:

OpenShift Online

Reporter:

Jiří Fiala <jfiala>

Component:

Routing

Assignee:

Dan Mace <dmace>

Status:

CLOSED CURRENTRELEASE

QA Contact:

zhaozhanqi <zzhao>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.x

CC:

aos-bugs, bryanjm, jfiala, rcwwilliams07, scuppett, tom.jenkinson

Target Milestone:

---

Keywords:

Reopened

Target Release:

3.x

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-01-29 21:03:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Observed HTTP 200/503 rate using various routes to the same app	none
Django example app with additional routes	none

Description Jiří Fiala 2018-11-06 09:32:58 UTC

Description of problem:

Requests to exposed and admitted routes often return HTTP 503 even though the application is running properly and responding just fine, when checked locally (by opening a remote shell into the pod and running curl localhost:8080, or so).

We received a report detailing this behaviour as follows:
----
I went ahead and created three routes with different configurations:

Insecure: http://insecure-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ 
With multiple routes defined, this one no longer does HTTPS redirects (previously, an insecure route with no TLS termination and Insecure Policy defined was unexpectedly redirected from http to https, which correctly reported HTTP 503), but it rarely works. It seems it kind of starts working after trying it a few times after a period of inactivity, however some of the static resources often fail to be fetched.

Secure, allow HTTP traffic: https://secure-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ 
Works in a fashion similar to the above, maybe a bit more often, but still very unstable. 
Curiously, http://secure-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ (i.e. the secure route but with HTTP) is noticeably more stable, although still not perfect.

Secure, redirect HTTP to HTTPS: https://redirect-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ 
Probably most stable during activity (like 50%), but also stops working at all after a period of inactivity, curiously also after using another route for a while. Interestingly, the HTTPS redirect works only in those periods of stability, which may suggest the problem is somewhere before deciding about the HTTPS redirect, so quite soon.

Note that doing curl localhost:8080 in the pod terminal is 100% stable, so I don't suspect a problem with the pod.
----
A projection of the currently observed success/fail rate for the above mentioned route URLs can be found in the attached 'ca-central-1-503-rate.log' file.

Version-Release number of selected component (if applicable):
Server https://api.starter-ca-central-1.openshift.com:443
openshift v3.11.16
kubernetes v1.11.0+d4cacc0

How reproducible:
Appears to be consistently reproducible in the past few days using various routes

Steps to Reproduce:
1. deploy a new application
2. create one or more routes to the application
3. hit the external URL repeatedly

Actual results:
random and relatively frequent HTTP 503 as response to those request, sometimes only to a particular static resource 

Expected results:
consistent response, as long as the app is running properly

Additional info:
This is mentioned in INC0802435
The observed behaviour is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1609751
This could be different than https://bugzilla.redhat.com/show_bug.cgi?id=1645206, the router pods are not restarting frequently at all.

Comment 1 Jiří Fiala 2018-11-06 09:34:27 UTC

Created attachment 1502352 [details]
Observed HTTP 200/503 rate using various routes to the same app

Comment 3 Jiří Fiala 2018-11-06 11:44:39 UTC

*** Bug 1646901 has been marked as a duplicate of this bug. ***

Comment 4 Dan Mace 2018-11-06 14:31:50 UTC

Can you please provide a more specific and self-contained reproducer? Ideally every asset (e.g. YAML, JSON) that someone on the team could apply to the cluster in a new namespace to reproduce the problematic application setup.

Providing a reliable, portable reproducer will be essential to our ability to diagnose this in a timely fashion.

Thanks!

Comment 5 Jiří Fiala 2018-11-07 10:07:41 UTC

Created attachment 1502922 [details]
Django example app with additional routes

The issue still seems to be easily inducible on the starter-ca-central-1 cluster by deploying any application. I tried deploying the Django example application (https://github.com/openshift/django-ex.git) using the python 3.6 builder with additional routes and got 503 at times. It took a few minutes after the route was exposed on router before I got the first HTTP 200 OK response, and then HTTP 503 and HTPP 200 appeared randomly when "curling" in a loop with 2 seconds wait in between the requests.

Comment 6 Jiří Fiala 2018-11-07 11:13:55 UTC

We received a report of a mix of unexpected HTTP 503 responses, correct responses and responses that could come only from a months old instance of the application that is no longer running in either of the two pods on starter-ca-central-1. I'm pasting the message in its entirety below:
--------
I'm getting very odd responses from my application in OpenShift. Here's the URL: https://api-popcon.193b.starter-ca-central-1.openshiftapps.com

For instance, sometimes, I get "Hello World." This response is from a hello world app I deployed to my openshift project several months ago and then deleted. Is it cached somehow? Because there's nothing in my app that would send a hello world message.

Other times I get Application is not available. However my application has 2 pods, and whenever I get this message, each pod is working fine if I connect to it internally and run curl http://localhost:8080

And finally, sometimes I get a proper response, e.g.: {"started":"2018-11-06T18:26:11.685Z","uptime":782.447}

To make things more interesting, sometimes the responses seem to be returned in a round-robin style. For example:

$ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1
Tue Nov 6 13:47:30 EST 2018
503
$ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1
Tue Nov 6 13:47:45 EST 2018
Hello World200
$ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1
Tue Nov 6 13:47:47 EST 2018
{"started":"2018-11-06T18:44:53.757Z","uptime":173.718}200
--------
I can presently induce the three response types by visiting the mentioned URL. It is actually four different responses - I'm seeing two distinct uptime values corresponding with different age of the two running pods, as well as 503's and the mentioned "Hello world" message.
The route endpoints correctly point to the two pods running the same image.

Comment 7 Robin Williams 2018-11-08 11:29:20 UTC

I have been getting 503s every time for 3 days now from

https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/

although the console shows all pods are running fine

Comment 8 Stephen Cuppett 2018-11-09 12:44:28 UTC

I do not get a 503 here:

https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/

Closing as INSUFFICIENT_DATA for a bug at this point. Will keep the current investigation open in ServiceNow.

Comment 9 tom.jenkinson 2018-11-09 13:47:31 UTC

I just got the 503:

$ wget https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/
--2018-11-09 13:42:46--  https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/
Resolving webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)... 35.182.40.127, 35.182.0.80
Connecting to webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)|35.182.40.127|:443... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2018-11-09 13:42:57 ERROR 503: Service Unavailable.

Immediately preceeding that I got a 200:
$ wget https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/
--2018-11-09 13:42:38--  https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/
Resolving webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)... 35.182.40.127, 35.182.0.80
Connecting to webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)|35.182.40.127|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2545 (2.5K) [text/html]
Saving to: 'index.html'

index.html                                     100%[===================================================================================================>]   2.49K  --.-KB/s    in 0s

2018-11-09 13:42:39 (8.77 MB/s) - 'index.html' saved [2545/2545]


I just did it in a loop and got some 503 and some 200. 


I am also seeing the varying output from https://api-popcon.193b.starter-ca-central-1.openshiftapps.com (503, Hello world, JSON)

Comment 10 Robin Williams 2018-11-09 15:18:43 UTC

(In reply to Stephen Cuppett from comment #8)
> I do not get a 503 here:
> 
> https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/
> 
> Closing as INSUFFICIENT_DATA for a bug at this point. Will keep the current
> investigation open in ServiceNow.

I am getting loads of 503s:  Here's the message from one of them:

app.min.js:1 Failed to load resource: the server responded with a status of 503 (Service Unavailable)

Comment 11 Jiří Fiala 2018-11-12 10:48:42 UTC

The investigation of the situation described in this bug, affecting the starter-ca-central-1 cluster, can be followed in this Status page incident report:
https://status.starter.openshift.com/incidents/k7nfxr9b28h2

Comment 12 Jiří Fiala 2018-11-20 08:49:06 UTC

I'm reopening this one, after checking with Stephen. The issue appears to be still inducible quite easily even with newly deployed apps (I've again deployed the Django example mentioned in comment #5), getting:
----
while true;  do echo -n "`date` --> "; curl -s -I https://secure-jfiala-temp.193b.starter-ca-central-1.openshiftapps.com | head -1 ; sleep 2; done 
Tue Nov 20 09:28:32 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:28:49 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:28:52 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:28:55 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:29:10 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:29:23 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:29:36 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:29:49 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:29:52 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:30:07 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:30:21 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:30:24 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:30:28 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:30:43 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:30:46 CET 2018 --> HTTP/1.0 503 Service Unavailable
Tue Nov 20 09:30:49 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:30:51 CET 2018 --> HTTP/1.1 200 OK
Tue Nov 20 09:30:55 CET 2018 --> HTTP/1.0 503 Service Unavailable
(...)
----

The unexpected "Hello World" response of an old instance of the app (which is not running when looking on active pods) mentioned in comment #6 is still there, but I got no 503's in this case:
----
while true;  do echo -n "`date` --> "; curl -s https://api-popcon.193b.starter-ca-central-1.openshiftapps.com ; echo; sleep 2; done
Tue Nov 20 09:34:28 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555301.311}
Tue Nov 20 09:34:32 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555304.109}
Tue Nov 20 09:34:34 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744894.684}
Tue Nov 20 09:34:38 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744897.882}
Tue Nov 20 09:34:41 CET 2018 --> Hello World
Tue Nov 20 09:34:44 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555317.03}
Tue Nov 20 09:34:47 CET 2018 --> Hello World
Tue Nov 20 09:34:50 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555322.954}
Tue Nov 20 09:34:53 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555325.687}
Tue Nov 20 09:34:56 CET 2018 --> Hello World
Tue Nov 20 09:34:59 CET 2018 --> Hello World
Tue Nov 20 09:35:01 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744921.452}
Tue Nov 20 09:35:04 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555337.056}
Tue Nov 20 09:35:07 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744927.587}
Tue Nov 20 09:35:11 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555343.304}
Tue Nov 20 09:35:14 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744934.177}
Tue Nov 20 09:35:17 CET 2018 --> Hello World
----
Looking on the session cookie (name 5754cdf29e8202a84eec18dd02be3da1), it rotates 3 different values, so it looks like there indeed are defined three different endpoints stuck in the HAProxy config, while OpenShift shows only two pods running; searching the 959efe9f95b5b1f3632a8e1ec1ddf825 (the value that is set for the unexpected response) in all haproxy.config's should reveal where is the "Hello World" response coming from. Only these two "api" pods are running in the namespace:
----
oc get po -n popcon
NAME                 READY     STATUS      RESTARTS   AGE
api-40-build         0/1       Completed   0          13d
api-50-2ngzz         1/1       Running     0          8d
api-50-sx7qh         1/1       Running     0          6d
postgresql-5-9b6px   1/1       Running     0          14d
----

Comment 14 Jiří Fiala 2018-11-21 13:50:43 UTC

With the recent cluster upgrade (https://status.starter.openshift.com/incidents/wbvc51zzg5td) and router pods redeployed, I'm no longer inducing random 503's nor the unexpected response type mentioned in comment #6.