Bug 1919407

Summary: OpenStack IPI has three-node control plane limitation, but InstallConfigs aren't verified
Product: OpenShift Container Platform Reporter: Alex Crawford <crawford>
Component: InstallerAssignee: Alex Crawford <crawford>
Installer sub component: openshift-installer QA Contact: weiwei jiang <wjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: pprinett
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1924670 (view as bug list) Environment:
Last Closed: 2021-02-24 15:55:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1924670    

Description Alex Crawford 2021-01-22 19:57:34 UTC
A fix for https://bugzilla.redhat.com/show_bug.cgi?id=1909587 recently went into the installer, but this introduced another issue: regardless of the size specified by the install config, the control plane is always three nodes. This won't affect supported configurations since we only support three nodes, but this is still technically wrong and will eventually come back to bite us.

Comment 2 Pierre Prinetti 2021-01-22 20:48:15 UTC
> regardless of the size specified by the install config, the control plane is always three nodes

Can you please add steps to reproduce? We currently have QE coverage[1] of a cluster with more than three Control plane nodes.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1909587#c8

Comment 3 Alex Crawford 2021-01-26 01:08:46 UTC
I spent a while trying to get access to an OpenStack cluster today. That didn't happen. So I tried to reproduce the conditions on AWS instead and now I see why this works.

The installer creates Machine objects to represent each of the control nodes, but then uses Terraform to provision them. These Machine objects are usually inert and only reflect the existing state of the node. In the case of a five-node cluster on OpenStack however, only three nodes are provisioned by Terraform. The two remaining Machine objects eventually trigger the Machine API Operator to go and create the missing nodes.

I still want to add the proposed verification step to the code. Reason being, the flow I've described is unique to this platform and is not tested in CI. That's not a position we want to be in, especially after investing as much as we have to unify the installation flows between topologies and environments. I know this seems ridiculous, given that we're arguing over something that isn't even supported, but this system is complex enough that we can't afford to make many assumptions, but where we do, we need to continually validate those assumptions with CI.

Comment 4 Pierre Prinetti 2021-01-26 12:00:39 UTC
(In reply to Alex Crawford from comment #3)
> I spent a while trying to get access to an OpenStack cluster today. That
> didn't happen. So I tried to reproduce the conditions on AWS instead and now
> I see why this works.
> 
> The installer creates Machine objects to represent each of the control
> nodes, but then uses Terraform to provision them. These Machine objects are
> usually inert and only reflect the existing state of the node. In the case
> of a five-node cluster on OpenStack however, only three nodes are
> provisioned by Terraform. The two remaining Machine objects eventually
> trigger the Machine API Operator to go and create the missing nodes.
> 
> I still want to add the proposed verification step to the code. Reason
> being, the flow I've described is unique to this platform and is not tested
> in CI. That's not a position we want to be in, especially after investing as
> much as we have to unify the installation flows between topologies and
> environments. I know this seems ridiculous, given that we're arguing over
> something that isn't even supported, but this system is complex enough that
> we can't afford to make many assumptions, but where we do, we need to
> continually validate those assumptions with CI.

My understanding is that there is no reproducible issue. Moreover, I see no reason for a platform-specific constraint: the same rationale ("not tested") applies to all platforms.

I suggest you to close this as NOTABUG and to open a report against the Installer component to discuss enforcing the supported configuration with validation.

Comment 5 Alex Crawford 2021-01-26 19:44:47 UTC
The reproducible issue is that non-three-node clusters deployed to OpenStack do not follow the same flow as the rest of our platforms. This isn't simply a matter of this particular topology not being tested. This is a case where we _know_ that this deviates from the rest of our platforms. If we decide at some point that we do want to support five-node clusters, I worry that we will have forgotten about this particular edge case and forget to update the Terraform to match. With the validation that I'm proposing, it will be immediately obvious to whomever is making the change that non-three-node OpenStack clusters need a second look. That's the intention. The validation isn't for the customer; it's for us.

Comment 6 Pierre Prinetti 2021-02-02 15:32:23 UTC
Bumping priority to match the dependant bug.

Comment 8 Pierre Prinetti 2021-02-03 14:26:34 UTC
Note for the verifier:

With this fix, if `controlPlane.replicas` is set to anything but 3 in the install-config, the validation should prevent the installation.

Comment 9 weiwei jiang 2021-02-04 01:39:27 UTC
Verified with 4.7.0-0.nightly-2021-02-03-165316.

$ ./openshift-install-4.7 version 
./openshift-install-4.7 4.7.0-0.nightly-2021-02-03-165316
built from commit c60d07ec35db85f6f8e66a2ad202d2be24fca5aa
release image registry.ci.openshift.org/ocp/release@sha256:6f51980bd9e3c338de81867e617685a1bcc358e79ce77294ed354985397d8e36

$ ./openshift-install-4.7 create cluster --dir bz1919407                                                                        
FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Provisioning Check": controlPlane.replicas: Invalid value: 5: control plane must be exactly three nodes when provisioning on OpenStack

Comment 12 errata-xmlrpc 2021-02-24 15:55:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633