Bug 1679772

Summary: The installer needs to protect master resources from being deleted
Product: OpenShift Container Platform Reporter: Eric Rich <erich>
Component: InstallerAssignee: Alex Crawford <crawford>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: unspecified CC: agarcial, decarr, ejacobs, erich, jialiu, scuppett
Version: 4.1.0Keywords: NeedsTestCase
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-29 15:03:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    

Description Eric Rich 2019-02-21 20:04:50 UTC
Description of problem: Masters are protected resources so we should create them as such. 

This needs to be done to protect against https://bugzilla.redhat.com/show_bug.cgi?id=1679215 

Version-Release number of the following components: 4.0 

How reproducible: 100% 

Steps to Reproduce:
1. oc delete machine master-N

Actual results:

Expected results:

Additional info:

Comment 1 Alex Crawford 2019-02-21 20:14:06 UTC
In order to help determine scope, are we only worried about users destroying the control plane using oc, or are we also concerned about the EC2 API? Phrased differently, do we care about external forces trying to delete control plane machines?

Comment 2 Derek Carr 2019-02-21 20:18:45 UTC
If we are worried about users doing an `oc delete machine my-master -n openshift-machine-api`, I think its a reasonable ask for us to support the following:

1. apply an annotation to machines that provides "terminationProtection"
2. author an admission controller that intercepts DELETE requests against machine resources and block any delete action from a machine with the specified annotation
3. admins or install (to be decided) could provide termination protection on their machine from accidental deletion via API

I would like to understand if this is a hard requirement for 4.0 or could be supported in 4.1.

Comment 3 Derek Carr 2019-02-21 20:28:18 UTC
To clarify, the underlying EC2 instance would not have termination protection enabled, but it would be protected from the API surface in OpenShift.

Comment 4 Erik M Jacobs 2019-02-21 20:48:12 UTC
I think it's entirely possible that someone could accidentally delete all masters at the EC2 instance level and not via the CLI. The question then becomes whether or not the cluster is at all recoverable given the nature of IPI and, in the future, how we handle UPI. As an example, there's no way to install a cluster into an existing vpc with IPI. So how would you create the control plane once the EC2 instances are deleted even if you had an etcd backup?

At a miminum we may want to ensure instance termination protection on master instances if possible. But people have been known to go to great lengths to shoot off their foot, even accidentally, so at least understanding whether or if we intend to help recover from this disaster will inform docs.

Comment 5 Alberto 2019-02-26 13:53:02 UTC
From the machine API pov long term we plan to add support for webhook dynamic validation where termination protection could be supported.
Short term this https://github.com/openshift/cluster-api-provider-aws/pull/166 could help to prevent users from shooting off their foot.
I'd like to get to the point where we are able to treat masters as cattle.

Comment 6 Alex Crawford 2019-02-27 01:35:56 UTC
I'm pushing this out to 4.2. There are too many corner cases to consider regarding future upgrades and it is too late in the game for me to feel comfortable with this change.

Comment 7 Erik M Jacobs 2019-02-27 13:50:09 UTC
For a short term fix, what about a docs change to suggest that admins enable EC2 termination protection on masters?

Comment 8 Alex Crawford 2019-03-01 18:33:54 UTC
*** Bug 1684087 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2019-03-29 15:03:14 UTC
The master team will be laying out plans for control plane disaster recovery and if anything is necessary of the installer they'll let us know.