Bug 1646903
| Summary: | [starter-ca-central-1] routing issues | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Jiří Fiala <jfiala> | ||||||
| Component: | Routing | Assignee: | Dan Mace <dmace> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | zhaozhanqi <zzhao> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 3.x | CC: | aos-bugs, bryanjm, jfiala, rcwwilliams07, scuppett, tom.jenkinson | ||||||
| Target Milestone: | --- | Keywords: | Reopened | ||||||
| Target Release: | 3.x | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2019-01-29 21:03:42 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jiří Fiala
2018-11-06 09:32:58 UTC
Created attachment 1502352 [details]
Observed HTTP 200/503 rate using various routes to the same app
*** Bug 1646901 has been marked as a duplicate of this bug. *** Can you please provide a more specific and self-contained reproducer? Ideally every asset (e.g. YAML, JSON) that someone on the team could apply to the cluster in a new namespace to reproduce the problematic application setup. Providing a reliable, portable reproducer will be essential to our ability to diagnose this in a timely fashion. Thanks! Created attachment 1502922 [details] Django example app with additional routes The issue still seems to be easily inducible on the starter-ca-central-1 cluster by deploying any application. I tried deploying the Django example application (https://github.com/openshift/django-ex.git) using the python 3.6 builder with additional routes and got 503 at times. It took a few minutes after the route was exposed on router before I got the first HTTP 200 OK response, and then HTTP 503 and HTPP 200 appeared randomly when "curling" in a loop with 2 seconds wait in between the requests. We received a report of a mix of unexpected HTTP 503 responses, correct responses and responses that could come only from a months old instance of the application that is no longer running in either of the two pods on starter-ca-central-1. I'm pasting the message in its entirety below: -------- I'm getting very odd responses from my application in OpenShift. Here's the URL: https://api-popcon.193b.starter-ca-central-1.openshiftapps.com For instance, sometimes, I get "Hello World." This response is from a hello world app I deployed to my openshift project several months ago and then deleted. Is it cached somehow? Because there's nothing in my app that would send a hello world message. Other times I get Application is not available. However my application has 2 pods, and whenever I get this message, each pod is working fine if I connect to it internally and run curl http://localhost:8080 And finally, sometimes I get a proper response, e.g.: {"started":"2018-11-06T18:26:11.685Z","uptime":782.447} To make things more interesting, sometimes the responses seem to be returned in a round-robin style. For example: $ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1 Tue Nov 6 13:47:30 EST 2018 503 $ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1 Tue Nov 6 13:47:45 EST 2018 Hello World200 $ date; curl -s -w "%{http_code}" https://api-popcon.193b.starter-ca-central-1.openshiftapps.com/ | tail -n 1 Tue Nov 6 13:47:47 EST 2018 {"started":"2018-11-06T18:44:53.757Z","uptime":173.718}200 -------- I can presently induce the three response types by visiting the mentioned URL. It is actually four different responses - I'm seeing two distinct uptime values corresponding with different age of the two running pods, as well as 503's and the mentioned "Hello world" message. The route endpoints correctly point to the two pods running the same image. I have been getting 503s every time for 3 days now from https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ although the console shows all pods are running fine I do not get a 503 here: https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ Closing as INSUFFICIENT_DATA for a bug at this point. Will keep the current investigation open in ServiceNow. I just got the 503: $ wget https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ --2018-11-09 13:42:46-- https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ Resolving webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)... 35.182.40.127, 35.182.0.80 Connecting to webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)|35.182.40.127|:443... connected. HTTP request sent, awaiting response... 503 Service Unavailable 2018-11-09 13:42:57 ERROR 503: Service Unavailable. Immediately preceeding that I got a 200: $ wget https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ --2018-11-09 13:42:38-- https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ Resolving webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)... 35.182.40.127, 35.182.0.80 Connecting to webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com (webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com)|35.182.40.127|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2545 (2.5K) [text/html] Saving to: 'index.html' index.html 100%[===================================================================================================>] 2.49K --.-KB/s in 0s 2018-11-09 13:42:39 (8.77 MB/s) - 'index.html' saved [2545/2545] I just did it in a loop and got some 503 and some 200. I am also seeing the varying output from https://api-popcon.193b.starter-ca-central-1.openshiftapps.com (503, Hello world, JSON) (In reply to Stephen Cuppett from comment #8) > I do not get a 503 here: > > https://webserver-scrtest.193b.starter-ca-central-1.openshiftapps.com/ > > Closing as INSUFFICIENT_DATA for a bug at this point. Will keep the current > investigation open in ServiceNow. I am getting loads of 503s: Here's the message from one of them: app.min.js:1 Failed to load resource: the server responded with a status of 503 (Service Unavailable) The investigation of the situation described in this bug, affecting the starter-ca-central-1 cluster, can be followed in this Status page incident report: https://status.starter.openshift.com/incidents/k7nfxr9b28h2 I'm reopening this one, after checking with Stephen. The issue appears to be still inducible quite easily even with newly deployed apps (I've again deployed the Django example mentioned in comment #5), getting: ---- while true; do echo -n "`date` --> "; curl -s -I https://secure-jfiala-temp.193b.starter-ca-central-1.openshiftapps.com | head -1 ; sleep 2; done Tue Nov 20 09:28:32 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:28:49 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:28:52 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:28:55 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:10 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:23 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:36 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:29:49 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:29:52 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:07 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:21 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:24 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:28 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:43 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:46 CET 2018 --> HTTP/1.0 503 Service Unavailable Tue Nov 20 09:30:49 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:51 CET 2018 --> HTTP/1.1 200 OK Tue Nov 20 09:30:55 CET 2018 --> HTTP/1.0 503 Service Unavailable (...) ---- The unexpected "Hello World" response of an old instance of the app (which is not running when looking on active pods) mentioned in comment #6 is still there, but I got no 503's in this case: ---- while true; do echo -n "`date` --> "; curl -s https://api-popcon.193b.starter-ca-central-1.openshiftapps.com ; echo; sleep 2; done Tue Nov 20 09:34:28 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555301.311} Tue Nov 20 09:34:32 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555304.109} Tue Nov 20 09:34:34 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744894.684} Tue Nov 20 09:34:38 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744897.882} Tue Nov 20 09:34:41 CET 2018 --> Hello World Tue Nov 20 09:34:44 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555317.03} Tue Nov 20 09:34:47 CET 2018 --> Hello World Tue Nov 20 09:34:50 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555322.954} Tue Nov 20 09:34:53 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555325.687} Tue Nov 20 09:34:56 CET 2018 --> Hello World Tue Nov 20 09:34:59 CET 2018 --> Hello World Tue Nov 20 09:35:01 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744921.452} Tue Nov 20 09:35:04 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555337.056} Tue Nov 20 09:35:07 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744927.587} Tue Nov 20 09:35:11 CET 2018 --> {"started":"2018-11-13T22:19:28.557Z","uptime":555343.304} Tue Nov 20 09:35:14 CET 2018 --> {"started":"2018-11-11T17:39:41.120Z","uptime":744934.177} Tue Nov 20 09:35:17 CET 2018 --> Hello World ---- Looking on the session cookie (name 5754cdf29e8202a84eec18dd02be3da1), it rotates 3 different values, so it looks like there indeed are defined three different endpoints stuck in the HAProxy config, while OpenShift shows only two pods running; searching the 959efe9f95b5b1f3632a8e1ec1ddf825 (the value that is set for the unexpected response) in all haproxy.config's should reveal where is the "Hello World" response coming from. Only these two "api" pods are running in the namespace: ---- oc get po -n popcon NAME READY STATUS RESTARTS AGE api-40-build 0/1 Completed 0 13d api-50-2ngzz 1/1 Running 0 8d api-50-sx7qh 1/1 Running 0 6d postgresql-5-9b6px 1/1 Running 0 14d ---- With the recent cluster upgrade (https://status.starter.openshift.com/incidents/wbvc51zzg5td) and router pods redeployed, I'm no longer inducing random 503's nor the unexpected response type mentioned in comment #6. |