1679772 – The installer needs to protect master resources from being deleted

Bug 1679772 - The installer needs to protect master resources from being deleted

Summary: The installer needs to protect master resources from being deleted

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Alex Crawford
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1664187
TreeView+	depends on / blocked

Reported:	2019-02-21 20:04 UTC by Eric Rich
Modified:	2019-03-29 15:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-29 15:03:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1679215	0	high	CLOSED	Deletion of master machine has no automatic recovery	2024-01-06 04:26:06 UTC

Description Eric Rich 2019-02-21 20:04:50 UTC

Description of problem: Masters are protected resources so we should create them as such. 

This needs to be done to protect against https://bugzilla.redhat.com/show_bug.cgi?id=1679215 

Version-Release number of the following components: 4.0 

How reproducible: 100% 

Steps to Reproduce:
1. oc delete machine master-N

Actual results:

Expected results:

Additional info:

Comment 1 Alex Crawford 2019-02-21 20:14:06 UTC

In order to help determine scope, are we only worried about users destroying the control plane using oc, or are we also concerned about the EC2 API? Phrased differently, do we care about external forces trying to delete control plane machines?

Comment 2 Derek Carr 2019-02-21 20:18:45 UTC

If we are worried about users doing an `oc delete machine my-master -n openshift-machine-api`, I think its a reasonable ask for us to support the following:

1. apply an annotation to machines that provides "terminationProtection"
2. author an admission controller that intercepts DELETE requests against machine resources and block any delete action from a machine with the specified annotation
3. admins or install (to be decided) could provide termination protection on their machine from accidental deletion via API

I would like to understand if this is a hard requirement for 4.0 or could be supported in 4.1.

Comment 3 Derek Carr 2019-02-21 20:28:18 UTC

To clarify, the underlying EC2 instance would not have termination protection enabled, but it would be protected from the API surface in OpenShift.

Comment 4 Erik M Jacobs 2019-02-21 20:48:12 UTC

I think it's entirely possible that someone could accidentally delete all masters at the EC2 instance level and not via the CLI. The question then becomes whether or not the cluster is at all recoverable given the nature of IPI and, in the future, how we handle UPI. As an example, there's no way to install a cluster into an existing vpc with IPI. So how would you create the control plane once the EC2 instances are deleted even if you had an etcd backup?

At a miminum we may want to ensure instance termination protection on master instances if possible. But people have been known to go to great lengths to shoot off their foot, even accidentally, so at least understanding whether or if we intend to help recover from this disaster will inform docs.

Comment 5 Alberto 2019-02-26 13:53:02 UTC

From the machine API pov long term we plan to add support for webhook dynamic validation where termination protection could be supported.
Short term this https://github.com/openshift/cluster-api-provider-aws/pull/166 could help to prevent users from shooting off their foot.
I'd like to get to the point where we are able to treat masters as cattle.

Comment 6 Alex Crawford 2019-02-27 01:35:56 UTC

I'm pushing this out to 4.2. There are too many corner cases to consider regarding future upgrades and it is too late in the game for me to feel comfortable with this change.

Comment 7 Erik M Jacobs 2019-02-27 13:50:09 UTC

For a short term fix, what about a docs change to suggest that admins enable EC2 termination protection on masters?

Comment 8 Alex Crawford 2019-03-01 18:33:54 UTC

*** Bug 1684087 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2019-03-29 15:03:14 UTC

The master team will be laying out plans for control plane disaster recovery and if anything is necessary of the installer they'll let us know.

Note You need to log in before you can comment on or make changes to this bug.