1892288 – assisted install workflow creates excessive control-plane disruption

Bug 1892288 - assisted install workflow creates excessive control-plane disruption

Summary: assisted install workflow creates excessive control-plane disruption

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-28 12:09 UTC by Sam Batschelet
Modified:	2021-02-24 15:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: operator performing an action that could result in a new static pod revision when the etcd cluster has less than 3 members. Consequence: Temporary quorum loss that can be quite disruptive with etcd unavailable for up to 1 minute. Fix: Avoid static pod revisions when all masters are not up. Result: self-inflicted quorum losses are avoided.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:28:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 485	0	None	closed	Bug 1892288: pkg/etcdcli: add IsQuorumFaultTolerant	2021-02-11 14:36:29 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:29:05 UTC

Description Sam Batschelet 2020-10-28 12:09:42 UTC

Description of problem: In a typical CI install of OCP 4 we can expect a reproducible and consistent number of revisions for the static pod operators. Revisions are caused by changes to key resources that the operator is watching for a change. These resources include secrets, configmaps, and nodes among others. 

A static pod revision is essentially an on disk representation of the operand's resources for its namespace at any given time. When all of the nodes are available at the same time some processes such as TLS certificate creation for all of the etcd members can happen at the same time. The net result is a minimization of revisions.

Each revision requires the operand to restart, which in the case of etcd costly because leader change is a required result.

## problem

Currently, etcd has been observed with 6 recisions as a result of scaling with assisted-installer. In some extreme cases, this has resulted in etcd with terms as high a 90.

Install should be a graceful and predictable process.


Version-Release number of selected component (if applicable):


How reproducible: fairly


Steps to Reproduce:
1. create assisted install and observe logs
2.
3.

Actual results: unstable control-plane on some installs can result in failure.


Expected results: install workflow should be consistent for the control-plane and in a way that minimizes disruption.


Additional info:

Comment 3 Sam Batschelet 2020-11-15 17:35:57 UTC

This bug is awaiting verification

Comment 5 ge liu 2020-12-22 09:48:42 UTC

I tried installation many times and have not hit this issue, then contacted with install team qe, they also have not hit it, change to verify status.

Comment 7 errata-xmlrpc 2021-02-24 15:28:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.