Bug 1684087
| Summary: | master/etcd replicas should be protected or limited to be modified. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Johnny Liu <jialiu> |
| Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> |
| Installer sub component: | openshift-installer | QA Contact: | Johnny Liu <jialiu> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | adahiya, crawford |
| Version: | 4.1.0 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-04-03 17:48:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Johnny Liu
2019-02-28 11:37:35 UTC
*** This bug has been marked as a duplicate of bug 1679772 *** I do not think this is a dup with bug 1679772. This bug is talking about user mis-configure master number in install-config.yaml upon a fresh install, while bug 1679772 is talking about user mistakenly oc delete master or delete master instance via aws api as some day 2 operation. In this bug, I am requesting installer should validate master number before trigger install. > If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number. > When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability. Actually we require atleast one master. Any configuration that is >=1 master is a *valid* configuration. for example, having 4 masters is not wrong, it's just that the 4th etcd member is not adding to the high availability of the etcd cluster ie. it can still tolerate only one master being down. > Installer give no any warning and error to complain the incorrect controlPlane.replicas number. > After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready. > # oc get node > The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? The default install-config for AWS gives you 3 control plane machines, if you purposefully choose 1 control plane machine, you *the user* has decided that HA is not a requirement so the installer accepts the user's decision. (In reply to Abhinav Dahiya from comment #4) > > If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number. > > When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability. > > Actually we require atleast one master. Any configuration that is >=1 master > is a *valid* configuration. > > for example, having 4 masters is not wrong, it's just that the 4th etcd > member is not adding to the high availability of the etcd cluster ie. it can > still tolerate only one master being down. > > > > > > Installer give no any warning and error to complain the incorrect controlPlane.replicas number. > > After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready. > > # oc get node > > The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? > > The default install-config for AWS gives you 3 control plane machines, if > you purposefully choose 1 control plane machine, you *the user* has decided > that HA is not a requirement so the installer accepts the user's decision. Just like your above statement - "Any configuration that is >=1 master is a *valid* configuration", according to my test result, the cluster totally does not work. In a word, user purposefully choose 1 control plane machine (which is a valid configuration), but the cluster does not work. (In reply to Johnny Liu from comment #5) > (In reply to Abhinav Dahiya from comment #4) > > > If I am right, installer should protect user from modifying controlPlane.replicas to <3, and must be odd number. > > > When modify controlPlane.replicas to 2, installation is completed, cluster is running well. But I do not think this is reasonable, because it is not comply with etcd cluster disaster recoverability. > > > > Actually we require atleast one master. Any configuration that is >=1 master > > is a *valid* configuration. > > > > for example, having 4 masters is not wrong, it's just that the 4th etcd > > member is not adding to the high availability of the etcd cluster ie. it can > > still tolerate only one master being down. > > > > > > > > > > > Installer give no any warning and error to complain the incorrect controlPlane.replicas number. > > > After installation is completed, oc command failed due to apiserver is not ready which is caused by etcd cluster is not ready. > > > # oc get node > > > The connection to the server api.qe-jialiu1.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? > > > > The default install-config for AWS gives you 3 control plane machines, if > > you purposefully choose 1 control plane machine, you *the user* has decided > > that HA is not a requirement so the installer accepts the user's decision. > > Just like your above statement - "Any configuration that is >=1 master is a > *valid* configuration", according to my test result, the cluster totally > does not work. Can you provide details around what is not working. For example, all libvirt clusters are created with single control plane host by default and we have not seen bugs claiming the cluster does not work at all. > In a word, user purposefully choose 1 control plane machine > (which is a valid configuration), but the cluster does not work. Closing due to inactivity. As far as we know, single node control planes work as intended. Just run the same testing using 4.0.0-0.nightly-2019-04-05-165550, 1 master + 1 worker installation is completed successfully. |