Bug 1892288 - assisted install workflow creates excessive control-plane disruption
Summary: assisted install workflow creates excessive control-plane disruption
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-28 12:09 UTC by Sam Batschelet
Modified: 2021-02-24 15:29 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: operator performing an action that could result in a new static pod revision when the etcd cluster has less than 3 members. Consequence: Temporary quorum loss that can be quite disruptive with etcd unavailable for up to 1 minute. Fix: Avoid static pod revisions when all masters are not up. Result: self-inflicted quorum losses are avoided.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:28:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 485 0 None closed Bug 1892288: pkg/etcdcli: add IsQuorumFaultTolerant 2021-02-11 14:36:29 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:05 UTC

Description Sam Batschelet 2020-10-28 12:09:42 UTC
Description of problem: In a typical CI install of OCP 4 we can expect a reproducible and consistent number of revisions for the static pod operators. Revisions are caused by changes to key resources that the operator is watching for a change. These resources include secrets, configmaps, and nodes among others. 

A static pod revision is essentially an on disk representation of the operand's resources for its namespace at any given time. When all of the nodes are available at the same time some processes such as TLS certificate creation for all of the etcd members can happen at the same time. The net result is a minimization of revisions.

Each revision requires the operand to restart, which in the case of etcd costly because leader change is a required result.

## problem

Currently, etcd has been observed with 6 recisions as a result of scaling with assisted-installer. In some extreme cases, this has resulted in etcd with terms as high a 90.

Install should be a graceful and predictable process.


Version-Release number of selected component (if applicable):


How reproducible: fairly


Steps to Reproduce:
1. create assisted install and observe logs
2.
3.

Actual results: unstable control-plane on some installs can result in failure.


Expected results: install workflow should be consistent for the control-plane and in a way that minimizes disruption.


Additional info:

Comment 3 Sam Batschelet 2020-11-15 17:35:57 UTC
This bug is awaiting verification

Comment 5 ge liu 2020-12-22 09:48:42 UTC
I tried installation many times and have not hit this issue, then contacted with install team qe, they also have not hit it, change to verify status.

Comment 7 errata-xmlrpc 2021-02-24 15:28:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.