Bug 1924670 - OpenStack IPI has three-node control plane limitation, but InstallConfigs aren't verified
Summary: OpenStack IPI has three-node control plane limitation, but InstallConfigs are...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.z
Assignee: Pierre Prinetti
QA Contact: Pedro Amoedo
URL:
Whiteboard:
Depends On: 1919407
Blocks: 1916297
TreeView+ depends on / blocked
 
Reported: 2021-02-03 12:28 UTC by Pierre Prinetti
Modified: 2021-02-22 13:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1919407
Environment:
Last Closed: 2021-02-22 13:54:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4612 0 None closed Bug 1924670: openstack/validation: enforce control plane size 2021-02-20 05:58:48 UTC
Red Hat Product Errata RHBA-2021:0510 0 None None None 2021-02-22 13:55:20 UTC

Description Pierre Prinetti 2021-02-03 12:28:19 UTC
+++ This bug was initially created as a clone of Bug #1919407 +++

A fix for https://bugzilla.redhat.com/show_bug.cgi?id=1909587 recently went into the installer, but this introduced another issue: regardless of the size specified by the install config, the control plane is always three nodes. This won't affect supported configurations since we only support three nodes, but this is still technically wrong and will eventually come back to bite us.

--- Additional comment from Eric Paris on 2021-01-22 20:00:47 UTC ---

This bug has set a target release without specifying a severity. As part of triage when determining the importance of bugs a severity should be specified. Since these bugs have not been properly triaged we are removing the target release. Teams will need to add a severity before setting the target release again.

--- Additional comment from Pierre Prinetti on 2021-01-22 20:48:15 UTC ---

> regardless of the size specified by the install config, the control plane is always three nodes

Can you please add steps to reproduce? We currently have QE coverage[1] of a cluster with more than three Control plane nodes.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1909587#c8

--- Additional comment from Alex Crawford on 2021-01-26 01:08:46 UTC ---

I spent a while trying to get access to an OpenStack cluster today. That didn't happen. So I tried to reproduce the conditions on AWS instead and now I see why this works.

The installer creates Machine objects to represent each of the control nodes, but then uses Terraform to provision them. These Machine objects are usually inert and only reflect the existing state of the node. In the case of a five-node cluster on OpenStack however, only three nodes are provisioned by Terraform. The two remaining Machine objects eventually trigger the Machine API Operator to go and create the missing nodes.

I still want to add the proposed verification step to the code. Reason being, the flow I've described is unique to this platform and is not tested in CI. That's not a position we want to be in, especially after investing as much as we have to unify the installation flows between topologies and environments. I know this seems ridiculous, given that we're arguing over something that isn't even supported, but this system is complex enough that we can't afford to make many assumptions, but where we do, we need to continually validate those assumptions with CI.

--- Additional comment from Pierre Prinetti on 2021-01-26 12:00:39 UTC ---

(In reply to Alex Crawford from comment #3)
> I spent a while trying to get access to an OpenStack cluster today. That
> didn't happen. So I tried to reproduce the conditions on AWS instead and now
> I see why this works.
> 
> The installer creates Machine objects to represent each of the control
> nodes, but then uses Terraform to provision them. These Machine objects are
> usually inert and only reflect the existing state of the node. In the case
> of a five-node cluster on OpenStack however, only three nodes are
> provisioned by Terraform. The two remaining Machine objects eventually
> trigger the Machine API Operator to go and create the missing nodes.
> 
> I still want to add the proposed verification step to the code. Reason
> being, the flow I've described is unique to this platform and is not tested
> in CI. That's not a position we want to be in, especially after investing as
> much as we have to unify the installation flows between topologies and
> environments. I know this seems ridiculous, given that we're arguing over
> something that isn't even supported, but this system is complex enough that
> we can't afford to make many assumptions, but where we do, we need to
> continually validate those assumptions with CI.

My understanding is that there is no reproducible issue. Moreover, I see no reason for a platform-specific constraint: the same rationale ("not tested") applies to all platforms.

I suggest you to close this as NOTABUG and to open a report against the Installer component to discuss enforcing the supported configuration with validation.

--- Additional comment from Alex Crawford on 2021-01-26 19:44:47 UTC ---

The reproducible issue is that non-three-node clusters deployed to OpenStack do not follow the same flow as the rest of our platforms. This isn't simply a matter of this particular topology not being tested. This is a case where we _know_ that this deviates from the rest of our platforms. If we decide at some point that we do want to support five-node clusters, I worry that we will have forgotten about this particular edge case and forget to update the Terraform to match. With the validation that I'm proposing, it will be immediately obvious to whomever is making the change that non-three-node OpenStack clusters need a second look. That's the intention. The validation isn't for the customer; it's for us.

--- Additional comment from Pierre Prinetti on 2021-02-02 15:32:23 UTC ---

Bumping priority to match the dependant bug.

--- Additional comment from OpenShift Automated Release Tooling on 2021-02-02 22:38:07 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.

Comment 1 Pierre Prinetti 2021-02-03 14:24:03 UTC
https://github.com/openshift/installer/pull/4612 waiting for verification of Bug 1919407

Comment 2 Pierre Prinetti 2021-02-04 12:51:17 UTC
Waiting on the patch manager's review https://github.com/openshift/installer/pull/4612

Comment 5 Pedro Amoedo 2021-02-17 12:09:09 UTC
Verified with version "4.6.0-0.nightly-2021-02-13-034601", installer interrupts execution when master replicas != 3 as expected:

~~~
02-17 12:56:18  ./openshift-install 4.6.0-0.nightly-2021-02-13-034601
02-17 12:56:18  built from commit 74b1d08f7f44c9c0d3f999c669a16fb2395e9551
02-17 12:56:18  release image registry.ci.openshift.org/ocp/release@sha256:aa0d57d5cb325f33b1679dc0fa21a375a972e0fcb8efa6da0edd202ea03a759a
...
02-17 12:57:50  level=fatal msg="failed to fetch Cluster: failed to fetch dependency of \"Cluster\": failed to generate asset \"Platform Provisioning Check\": controlPlane.replicas: Invalid value: 5: control plane must be exactly three nodes when provisioning on OpenStack"
~~~

Best Regards.

Comment 7 errata-xmlrpc 2021-02-22 13:54:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0510


Note You need to log in before you can comment on or make changes to this bug.