Bug 1388692 - Certificate re-deploy script cause an outage to entire OpenShift cluster because of router restarts (causing broken connectivity to master).
Summary: Certificate re-deploy script cause an outage to entire OpenShift cluster beca...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Andrew Butcher
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-25 22:59 UTC by Eric Rich
Modified: 2017-03-08 18:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-27 16:00:20 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Eric Rich 2016-10-25 22:59:48 UTC
Description of problem:

https://docs.openshift.com/container-platform/3.3/install_config/redeploying_certificates.html

Documentation does not explain how to  after this process is run. 

Because of update the router (and registry - however the registry is less of an issue), certificates (mainly restarts of pods caused by: https://github.com/openshift/openshift-ansible/blob/master/playbooks/common/openshift-cluster/redeploy-certificates.yml#L207-L246)

An outage to the "data" plan is possible because the routers are restarted, as part of "node evacuations", which result in the routers not being able to talk to the masters, because they have old/bad certificates.

This causes you to see issues described by: https://bugzilla.redhat.com/1387714

Version-Release number of selected component (if applicable): 3.1.1 (however the code applies to 3.3)

How to Reproduce: Update certificates of cluster, routes to existing applications should stop working, because of router restarts (new deployments) caused by certificate tooling. 

Additional info:

We likely need to split out (https://github.com/openshift/openshift-ansible/blob/master/playbooks/common/openshift-cluster/redeploy-certificates.yml#L207-L246) into a different script. 

An alternative to this is to exclude infra nodes (or differ them to later) to allow manual updates described in https://bugzilla.redhat.com/show_bug.cgi?id=1388691

Comment 1 Andrew Butcher 2016-10-27 16:00:20 UTC
Once the CA has been replaced the running routers will be unable to create new routes until router pods have been recreated as a result of the node evacuation. Existing routes will continue to be accessible and should continue to be accessible during pod evacuation assuming router has been scaled.

Tested by installing cluster, creating pod+route, running cert redeploy w/ CA replacement without node evacuation and ensuring that pod is still routable.


Note You need to log in before you can comment on or make changes to this bug.