A fix for https://bugzilla.redhat.com/show_bug.cgi?id=1909587 recently went into the installer, but this introduced another issue: regardless of the size specified by the install config, the control plane is always three nodes. This won't affect supported configurations since we only support three nodes, but this is still technically wrong and will eventually come back to bite us.
> regardless of the size specified by the install config, the control plane is always three nodes Can you please add steps to reproduce? We currently have QE coverage[1] of a cluster with more than three Control plane nodes. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1909587#c8
I spent a while trying to get access to an OpenStack cluster today. That didn't happen. So I tried to reproduce the conditions on AWS instead and now I see why this works. The installer creates Machine objects to represent each of the control nodes, but then uses Terraform to provision them. These Machine objects are usually inert and only reflect the existing state of the node. In the case of a five-node cluster on OpenStack however, only three nodes are provisioned by Terraform. The two remaining Machine objects eventually trigger the Machine API Operator to go and create the missing nodes. I still want to add the proposed verification step to the code. Reason being, the flow I've described is unique to this platform and is not tested in CI. That's not a position we want to be in, especially after investing as much as we have to unify the installation flows between topologies and environments. I know this seems ridiculous, given that we're arguing over something that isn't even supported, but this system is complex enough that we can't afford to make many assumptions, but where we do, we need to continually validate those assumptions with CI.
(In reply to Alex Crawford from comment #3) > I spent a while trying to get access to an OpenStack cluster today. That > didn't happen. So I tried to reproduce the conditions on AWS instead and now > I see why this works. > > The installer creates Machine objects to represent each of the control > nodes, but then uses Terraform to provision them. These Machine objects are > usually inert and only reflect the existing state of the node. In the case > of a five-node cluster on OpenStack however, only three nodes are > provisioned by Terraform. The two remaining Machine objects eventually > trigger the Machine API Operator to go and create the missing nodes. > > I still want to add the proposed verification step to the code. Reason > being, the flow I've described is unique to this platform and is not tested > in CI. That's not a position we want to be in, especially after investing as > much as we have to unify the installation flows between topologies and > environments. I know this seems ridiculous, given that we're arguing over > something that isn't even supported, but this system is complex enough that > we can't afford to make many assumptions, but where we do, we need to > continually validate those assumptions with CI. My understanding is that there is no reproducible issue. Moreover, I see no reason for a platform-specific constraint: the same rationale ("not tested") applies to all platforms. I suggest you to close this as NOTABUG and to open a report against the Installer component to discuss enforcing the supported configuration with validation.
The reproducible issue is that non-three-node clusters deployed to OpenStack do not follow the same flow as the rest of our platforms. This isn't simply a matter of this particular topology not being tested. This is a case where we _know_ that this deviates from the rest of our platforms. If we decide at some point that we do want to support five-node clusters, I worry that we will have forgotten about this particular edge case and forget to update the Terraform to match. With the validation that I'm proposing, it will be immediately obvious to whomever is making the change that non-three-node OpenStack clusters need a second look. That's the intention. The validation isn't for the customer; it's for us.
Bumping priority to match the dependant bug.
Note for the verifier: With this fix, if `controlPlane.replicas` is set to anything but 3 in the install-config, the validation should prevent the installation.
Verified with 4.7.0-0.nightly-2021-02-03-165316. $ ./openshift-install-4.7 version ./openshift-install-4.7 4.7.0-0.nightly-2021-02-03-165316 built from commit c60d07ec35db85f6f8e66a2ad202d2be24fca5aa release image registry.ci.openshift.org/ocp/release@sha256:6f51980bd9e3c338de81867e617685a1bcc358e79ce77294ed354985397d8e36 $ ./openshift-install-4.7 create cluster --dir bz1919407 FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Provisioning Check": controlPlane.replicas: Invalid value: 5: control plane must be exactly three nodes when provisioning on OpenStack
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633