Bug 1609751

Summary: [starter-ca-central-1] [starter-us-east-1] random 503s when accessing exposed services externally
Product: OpenShift Online Reporter: Jiří Fiala <jfiala>
Component: RoutingAssignee: Ivan Chavero <ichavero>
Status: CLOSED NOTABUG QA Contact: zhaozhanqi <zzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.xCC: aos-bugs, dmace, jfiala, rcwwilliams07, sysrage, wgordon, yufchang
Target Milestone: ---Keywords: OnlineStarter
Target Release: ---Flags: jfiala: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-15 14:39:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
example app 200/503 response rate
none
In some cases happens total routing servicios. none

Description Jiří Fiala 2018-07-30 11:24:24 UTC
Created attachment 1471504 [details]
example app 200/503 response rate

Description of problem:
Users reported seemingly random, but quite frequent 503's when accessing applications over routes on starter-ca-central-1 (v3.10.14). The issue may have starter occurring after the recent upgrade to 3.10.14 on July 25th; v3.10.9 Starter clusters do not seem to be affected. The application itself seems to be running properly in all cases - indicating the issue could be caused by the router.
I have induced this by deploying the node.js example app and trying to access the default page every two seconds:

Version-Release number of selected component (if applicable):
Server https://api.starter-ca-central-1.openshift.com:443
openshift v3.10.14
kubernetes v1.10.0+b81c8f8

How reproducible:
appears to be consistently reproducible

Steps to Reproduce:
1. Deploy a new app on starter-ca-central-1 (or use an already existing one)
2. Expose a service and wait for the route to be admitted by the router
3. Hit the route repeatedly

Actual results:
At least some 503s while the app is running properly

Expected results:
Consistent response, same as when hitting the service from within the cluster

Comment 1 Paco Boga 2018-07-31 08:24:58 UTC
Created attachment 1471717 [details]
In some cases happens total routing servicios.

Comment 5 Jiří Fiala 2018-08-03 06:04:59 UTC
This issue was induced on starter-us-east-1 too, so it does not seem to be v3.10.14 specific, as suggested in my first comment.

Comment 6 Casey Callendrello 2018-08-03 12:19:15 UTC
Kicking over to the Router team, which is now separate from SDN.

Comment 9 Dan Mace 2018-08-15 14:39:56 UTC
Closing per https://bugzilla.redhat.com/show_bug.cgi?id=1609751#c8; if the issue recurs please feel free to re-open with new details.