Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2024309

Summary:	cluster-etcd-operator: defrag controller needs to provide proper observability
Product:	OpenShift Container Platform	Reporter:	Sam Batschelet <sbatsche>
Component:	Etcd	Assignee:	Nobody <nobody>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.9
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2024311 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:28:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2024311

Description Sam Batschelet 2021-11-17 19:21:00 UTC

Description of problem: Today cluster-etcd-operator's defrag controller has a few flaws which this bug intends to cover.

- first, cluster-etcd-oeprator does not collect etcd client metrics and expose those to telemetry. By doing this we can consider alerts on failed client attempts to Defrag.

- the defrag controller currently performs defragmentation then checks cluster health if success defrag the next. In some circumstances such as smaller state files the duration in between defrag attempts can be very short. This can cause the etcd client balancer to perhaps choose poorly and result in a timeout. Resolution would be to ensure a reasonable time exits between defrag attempts.

- the current metrics scrape rate for etcd is 30s so when we are trying to identify defragmentation attempts they appear in some circumstances to happen at the same time. to improve this lets ensure we wait +30s in between defrag attempts.

- if defrag fails we only log the error and do not event.



Version-Release number of selected component (if applicable):


How reproducible: high


Steps to Reproduce:
1. install 4.9 cluster and observe metrics/logs/events
2.
3.

Actual results: defragmentation is not clearly observeable 


Expected results: clean signal and understanding of defrag success/failure.


Additional info:

Comment 6 errata-xmlrpc 2022-03-10 16:28:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056