Description of problem: Masters are protected resources so we should create them as such. This needs to be done to protect against https://bugzilla.redhat.com/show_bug.cgi?id=1679215 Version-Release number of the following components: 4.0 How reproducible: 100% Steps to Reproduce: 1. oc delete machine master-N Actual results: Expected results: Additional info:
In order to help determine scope, are we only worried about users destroying the control plane using oc, or are we also concerned about the EC2 API? Phrased differently, do we care about external forces trying to delete control plane machines?
If we are worried about users doing an `oc delete machine my-master -n openshift-machine-api`, I think its a reasonable ask for us to support the following: 1. apply an annotation to machines that provides "terminationProtection" 2. author an admission controller that intercepts DELETE requests against machine resources and block any delete action from a machine with the specified annotation 3. admins or install (to be decided) could provide termination protection on their machine from accidental deletion via API I would like to understand if this is a hard requirement for 4.0 or could be supported in 4.1.
To clarify, the underlying EC2 instance would not have termination protection enabled, but it would be protected from the API surface in OpenShift.
I think it's entirely possible that someone could accidentally delete all masters at the EC2 instance level and not via the CLI. The question then becomes whether or not the cluster is at all recoverable given the nature of IPI and, in the future, how we handle UPI. As an example, there's no way to install a cluster into an existing vpc with IPI. So how would you create the control plane once the EC2 instances are deleted even if you had an etcd backup? At a miminum we may want to ensure instance termination protection on master instances if possible. But people have been known to go to great lengths to shoot off their foot, even accidentally, so at least understanding whether or if we intend to help recover from this disaster will inform docs.
From the machine API pov long term we plan to add support for webhook dynamic validation where termination protection could be supported. Short term this https://github.com/openshift/cluster-api-provider-aws/pull/166 could help to prevent users from shooting off their foot. I'd like to get to the point where we are able to treat masters as cattle.
I'm pushing this out to 4.2. There are too many corner cases to consider regarding future upgrades and it is too late in the game for me to feel comfortable with this change.
For a short term fix, what about a docs change to suggest that admins enable EC2 termination protection on masters?
*** Bug 1684087 has been marked as a duplicate of this bug. ***
The master team will be laying out plans for control plane disaster recovery and if anything is necessary of the installer they'll let us know.