Bug 1684087

Summary: master/etcd replicas should be protected or limited to be modified.
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED WORKSFORME Docs Contact:
Severity: medium    
Priority: medium CC: adahiya, crawford
Version: 4.1.0Keywords: Reopened
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-03 17:48:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-02-28 11:37:35 UTC
Description of problem:
As far as I know, etcd is co-working with master on the same machine. According to my understanding, etcd cluster need 3 etd member as a minimal. If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number.

Version-Release number of the following components:
v4.0.5-1-dirty

How reproducible:
Always

Steps to Reproduce:
1. Create install-config.yaml via openshift-install tool.
2. Modify controlPlane.replicas to 1 in install-config.yaml
3. Trigger install

Actual results:
Installer give no any warning and error to complain the incorrect controlPlane.replicas number.
After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready.
# oc get node
The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?


Expected results:
Installer should warning and exit the installation when controlPlane.replicas is moidifed to <3, and not an odd number.

Additional info:
When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability.

Comment 1 Alex Crawford 2019-03-01 18:33:54 UTC

*** This bug has been marked as a duplicate of bug 1679772 ***

Comment 2 Johnny Liu 2019-03-07 10:52:03 UTC
I do not think this is a dup with bug 1679772.

This bug is talking about user mis-configure master number in install-config.yaml upon a fresh install, while bug 1679772 is talking about user mistakenly oc delete master or delete master instance via aws api as some day 2 operation.

In this bug, I am requesting installer should validate master number before trigger install.

Comment 4 Abhinav Dahiya 2019-03-21 21:35:17 UTC
> If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number.
> When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability.

Actually we require atleast one master. Any configuration that is >=1 master is a *valid* configuration.

for example, having 4 masters is not wrong, it's just that the 4th etcd member is not adding to the high availability of the etcd cluster ie. it can still tolerate only one master being down.




> Installer give no any warning and error to complain the incorrect controlPlane.replicas number.
> After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready.
> # oc get node
> The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?

The default install-config for AWS gives you 3 control plane machines, if you purposefully choose 1 control plane machine, you *the user* has decided that HA is not a requirement so the installer accepts the user's decision.

Comment 5 Johnny Liu 2019-03-22 03:10:51 UTC
(In reply to Abhinav Dahiya from comment #4)
> > If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number.
> > When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability.
> 
> Actually we require atleast one master. Any configuration that is >=1 master
> is a *valid* configuration.
> 
> for example, having 4 masters is not wrong, it's just that the 4th etcd
> member is not adding to the high availability of the etcd cluster ie. it can
> still tolerate only one master being down.
> 
> 
> 
> 
> > Installer give no any warning and error to complain the incorrect controlPlane.replicas number.
> > After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready.
> > # oc get node
> > The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
> 
> The default install-config for AWS gives you 3 control plane machines, if
> you purposefully choose 1 control plane machine, you *the user* has decided
> that HA is not a requirement so the installer accepts the user's decision.

Just like your above statement - "Any configuration that is >=1 master is a *valid* configuration", according to my test result, the cluster totally does not work. In a word, user purposefully choose 1 control plane machine (which is a valid configuration), but the cluster does not work.

Comment 6 Abhinav Dahiya 2019-03-25 21:37:10 UTC
(In reply to Johnny Liu from comment #5)
> (In reply to Abhinav Dahiya from comment #4)
> > > If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number.
> > > When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability.
> > 
> > Actually we require atleast one master. Any configuration that is >=1 master
> > is a *valid* configuration.
> > 
> > for example, having 4 masters is not wrong, it's just that the 4th etcd
> > member is not adding to the high availability of the etcd cluster ie. it can
> > still tolerate only one master being down.
> > 
> > 
> > 
> > 
> > > Installer give no any warning and error to complain the incorrect controlPlane.replicas number.
> > > After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready.
> > > # oc get node
> > > The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
> > 
> > The default install-config for AWS gives you 3 control plane machines, if
> > you purposefully choose 1 control plane machine, you *the user* has decided
> > that HA is not a requirement so the installer accepts the user's decision.
> 
> Just like your above statement - "Any configuration that is >=1 master is a
> *valid* configuration", according to my test result, the cluster totally
> does not work. 

Can you provide details around what is not working. For example, all libvirt clusters are created with single control plane host by default and we have not seen bugs claiming the cluster does not work at all.

> In a word, user purposefully choose 1 control plane machine
> (which is a valid configuration), but the cluster does not work.

Comment 7 Alex Crawford 2019-04-03 17:48:54 UTC
Closing due to inactivity. As far as we know, single node control planes work as intended.

Comment 8 Johnny Liu 2019-04-10 10:10:33 UTC
Just run the same testing using 4.0.0-0.nightly-2019-04-05-165550, 1 master + 1 worker installation is completed successfully.