Description of problem: There are situations when the OpenShift Control plain can crash (completely). When such a catastrophic failure occurs. Routers can no longer pull updates from the API servers. Because of this, no route changes propagate to the routers. This is generally OK because platform level changes (in the control plain) should not be changing the environment. > If changes occur, the "current" routes should be able to handle outage seen in the data layer, such as: Pods that die >> Routes, will health check these, and disable them (at some point - catastrophic failure of an app 503 will be seen for the route). However, there is a situation where, because the OpenShift Control plain is down, it is possible for the OpenShift router to "restart" (due to a catastrophic failure). This restart would then loose the current configuration of routes, and as such be forced to communicate with the OpenShift Control plain to re-populate the configuration of routes. This combination of failures should be something that the platform accounts for!
One suggestion to this problem is that, if the OpenShift Router restarts, it should not bind to the host ports (80/443,etc) if it can not communicate with the control plain at startup. > Note: if this occurs, some logging should be provided to help operators know / understand that the issue is not with the router but with its in-ability to talk to the Control Plain. This causes connection to the routers to be dropped / rejected at a TCP layer allowing for external load balancing to better handle error notifications to users. This also better indicates to operators and admins that there is a "catastrophic" issue at hand, that needs to be investigated.
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1383663
*** This bug has been marked as a duplicate of bug 1383663 ***