Description of problem: Requests to exposed and admitted routes often return HTTP 503 even though the application is running properly and responding just fine, when checked locally (by opening a remote shell into the pod and running curl localhost:8080, or so). We received a report detailing this behaviour as follows: ---- I went ahead and created three routes with different configurations: Insecure: http://insecure-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ With multiple routes defined, this one no longer does HTTPS redirects (previously, an insecure route with no TLS termination and Insecure Policy defined was unexpectedly redirected from http to https, which correctly reported HTTP 503), but it rarely works. It seems it kind of starts working after trying it a few times after a period of inactivity, however some of the static resources often fail to be fetched. Secure, allow HTTP traffic: https://secure-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ Works in a fashion similar to the above, maybe a bit more often, but still very unstable. Curiously, http://secure-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ (i.e. the secure route but with HTTP) is noticeably more stable, although still not perfect. Secure, redirect HTTP to HTTPS: https://redirect-kros-converter.193b.starter-ca-central-1.openshiftapps.com/ Probably most stable during activity (like 50%), but also stops working at all after a period of inactivity, curiously also after using another route for a while. Interestingly, the HTTPS redirect works only in those periods of stability, which may suggest the problem is somewhere before deciding about the HTTPS redirect, so quite soon. Note that doing curl localhost:8080 in the pod terminal is 100% stable, so I don't suspect a problem with the pod. ---- A projection of the currently observed success/fail rate for the above mentioned route URLs can be found in the attached 'ca-central-1-503-rate.log' file. Version-Release number of selected component (if applicable): Server https://api.starter-ca-central-1.openshift.com:443 openshift v3.11.16 kubernetes v1.11.0+d4cacc0 How reproducible: Appears to be consistently reproducible in the past few days using various routes Steps to Reproduce: 1. deploy a new application 2. create one or more routes to the application 3. hit the external URL repeatedly Actual results: random and relatively frequent HTTP 503 as response to those request, sometimes only to a particular static resource Expected results: consistent response, as long as the app is running properly Additional info: This is mentioned in INC0802435 The observed behaviour is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1609751 This could be different than https://bugzilla.redhat.com/show_bug.cgi?id=1645206, the router pods are not restarting frequently at all.
Created attachment 1502352 [details] Observed HTTP 200/503 rate using various routes to the same app
*** Bug 1646901 has been marked as a duplicate of this bug. ***
Can you please provide a more specific and self-contained reproducer? Ideally every asset (e.g. YAML, JSON) that someone on the team could apply to the cluster in a new namespace to reproduce the problematic application setup. Providing a reliable, portable reproducer will be essential to our ability to diagnose this in a timely fashion. Thanks!
Created attachment 1502922 [details] Django example app with additional routes The issue still seems to be easily inducible on the starter-ca-central-1 cluster by deploying any application. I tried deploying the Django example application (https://github.com/openshift/django-ex.git) using the python 3.6 builder with additional routes and got 503 at times. It took a few minutes after the route was exposed on router before I got the first HTTP 200 OK response, and then HTTP 503 and HTPP 200 appeared randomly when "curling" in a loop with 2 seconds wait in between the requests.
We received a report of a mix of unexpected HTTP 503 responses, correct responses and responses that could come only from a months old instance of the application that is no longer running in either of the two pods on starter-ca-central-1. I'm pasting the message in its entirety below: -------- I'm getting very odd responses from my application in OpenShift. Here's the URL: https://api-popcon.193b.starter-ca-central-1.openshiftapps.com For instance, sometimes, I get "Hello World." This response is from a hello world app I deployed to my openshift project several months ago and then deleted. Is it cached somehow? Because there's nothing in my app that would send a hello world message. Other times I get Application is not available. However my application has 2 pods, and whenever I get this message, each pod is working fine if I connect to it internally and run curl http://localhost:8080 And finally, sometimes I get a proper response, e.g.: {"started":"2018-11-06T18:26:11.685Z","uptime":782.447} To make things more interesting, sometimes the responses seem to be returned in a round-robin style. For example: $ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1 Tue Nov 6 13:47:30 EST 2018 503 $ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1 Tue Nov 6 13:47:45 EST 2018 Hello World200 $ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1 Tue Nov 6 13:47:47 EST 2018 {"started":"2018-11-06T18:44:53.757Z","uptime":173.718}200 -------- I can presently induce the three response types by visiting the mentioned URL. It is actually four different responses - I'm seeing two distinct uptime values corresponding with different age of the two running pods, as well as 503's and the mentioned "Hello world" message. The route endpoints correctly point to the two pods running the same image.
I have been getting 503s every time for 3 days now from https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ although the console shows all pods are running fine
I do not get a 503 here: https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ Closing as INSUFFICIENT_DATA for a bug at this point. Will keep the current investigation open in ServiceNow.
I just got the 503: $ wget https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ --2018-11-09 13:42:46-- https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ Resolving webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)... 35.182.40.127, 35.182.0.80 Connecting to webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)|35.182.40.127|:443... connected. HTTP request sent, awaiting response... 503 Service Unavailable 2018-11-09 13:42:57 ERROR 503: Service Unavailable. Immediately preceeding that I got a 200: $ wget https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ --2018-11-09 13:42:38-- https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ Resolving webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)... 35.182.40.127, 35.182.0.80 Connecting to webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)|35.182.40.127|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2545 (2.5K) [text/html] Saving to: 'index.html' index.html 100%[===================================================================================================>] 2.49K --.-KB/s in 0s 2018-11-09 13:42:39 (8.77 MB/s) - 'index.html' saved [2545/2545] I just did it in a loop and got some 503 and some 200. I am also seeing the varying output from https://api-popcon.193b.starter-ca-central-1.openshiftapps.com (503, Hello world, JSON)
(In reply to Stephen Cuppett from comment #8) > I do not get a 503 here: > > https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ > > Closing as INSUFFICIENT_DATA for a bug at this point. Will keep the current > investigation open in ServiceNow. I am getting loads of 503s: Here's the message from one of them: app.min.js:1 Failed to load resource: the server responded with a status of 503 (Service Unavailable)
The investigation of the situation described in this bug, affecting the starter-ca-central-1 cluster, can be followed in this Status page incident report: https://status.starter.openshift.com/incidents/k7nfxr9b28h2
I'm reopening this one, after checking with Stephen. The issue appears to be still inducible quite easily even with newly deployed apps (I've again deployed the Django example mentioned in comment #5), getting: ---- while true; do echo -n "`date` --> "; curl -s -I https://secure-jfiala-temp.193b.starter-ca-central-1.openshiftapps.com | head -1 ; sleep 2; done Tue Nov 20 09:28:32 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:28:49 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:28:52 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:28:55 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:10 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:23 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:36 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:49 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:29:52 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:07 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:21 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:24 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:28 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:43 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:46 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:49 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:51 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:55 CET 2018 --> HTTP/1.0 503 Service Unavailable (...) ---- The unexpected "Hello World" response of an old instance of the app (which is not running when looking on active pods) mentioned in comment #6 is still there, but I got no 503's in this case: ---- while true; do echo -n "`date` --> "; curl -s https://api-popcon.193b.starter-ca-central-1.openshiftapps.com ; echo; sleep 2; done Tue Nov 20 09:34:28 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555301.311} Tue Nov 20 09:34:32 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555304.109} Tue Nov 20 09:34:34 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744894.684} Tue Nov 20 09:34:38 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744897.882} Tue Nov 20 09:34:41 CET 2018 --> Hello World Tue Nov 20 09:34:44 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555317.03} Tue Nov 20 09:34:47 CET 2018 --> Hello World Tue Nov 20 09:34:50 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555322.954} Tue Nov 20 09:34:53 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555325.687} Tue Nov 20 09:34:56 CET 2018 --> Hello World Tue Nov 20 09:34:59 CET 2018 --> Hello World Tue Nov 20 09:35:01 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744921.452} Tue Nov 20 09:35:04 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555337.056} Tue Nov 20 09:35:07 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744927.587} Tue Nov 20 09:35:11 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555343.304} Tue Nov 20 09:35:14 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744934.177} Tue Nov 20 09:35:17 CET 2018 --> Hello World ---- Looking on the session cookie (name 5754cdf29e8202a84eec18dd02be3da1), it rotates 3 different values, so it looks like there indeed are defined three different endpoints stuck in the HAProxy config, while OpenShift shows only two pods running; searching the 959efe9f95b5b1f3632a8e1ec1ddf825 (the value that is set for the unexpected response) in all haproxy.config's should reveal where is the "Hello World" response coming from. Only these two "api" pods are running in the namespace: ---- oc get po -n popcon NAME READY STATUS RESTARTS AGE api-40-build 0/1 Completed 0 13d api-50-2ngzz 1/1 Running 0 8d api-50-sx7qh 1/1 Running 0 6d postgresql-5-9b6px 1/1 Running 0 14d ----
With the recent cluster upgrade (https://status.starter.openshift.com/incidents/wbvc51zzg5td) and router pods redeployed, I'm no longer inducing random 503's nor the unexpected response type mentioned in comment #6.