Bug 1892288

Summary: assisted install workflow creates excessive control-plane disruption
Product: OpenShift Container Platform Reporter: Sam Batschelet <sbatsche>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, skolicha
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: operator performing an action that could result in a new static pod revision when the etcd cluster has less than 3 members. Consequence: Temporary quorum loss that can be quite disruptive with etcd unavailable for up to 1 minute. Fix: Avoid static pod revisions when all masters are not up. Result: self-inflicted quorum losses are avoided.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:28:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sam Batschelet 2020-10-28 12:09:42 UTC
Description of problem: In a typical CI install of OCP 4 we can expect a reproducible and consistent number of revisions for the static pod operators. Revisions are caused by changes to key resources that the operator is watching for a change. These resources include secrets, configmaps, and nodes among others. 

A static pod revision is essentially an on disk representation of the operand's resources for its namespace at any given time. When all of the nodes are available at the same time some processes such as TLS certificate creation for all of the etcd members can happen at the same time. The net result is a minimization of revisions.

Each revision requires the operand to restart, which in the case of etcd costly because leader change is a required result.

## problem

Currently, etcd has been observed with 6 recisions as a result of scaling with assisted-installer. In some extreme cases, this has resulted in etcd with terms as high a 90.

Install should be a graceful and predictable process.

Version-Release number of selected component (if applicable):

How reproducible: fairly

Steps to Reproduce:
1. create assisted install and observe logs

Actual results: unstable control-plane on some installs can result in failure.

Expected results: install workflow should be consistent for the control-plane and in a way that minimizes disruption.

Additional info:

Comment 3 Sam Batschelet 2020-11-15 17:35:57 UTC
This bug is awaiting verification

Comment 5 ge liu 2020-12-22 09:48:42 UTC
I tried installation many times and have not hit this issue, then contacted with install team qe, they also have not hit it, change to verify status.

Comment 7 errata-xmlrpc 2021-02-24 15:28:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.