Bug 2024309 - cluster-etcd-operator: defrag controller needs to provide proper observability
Summary: cluster-etcd-operator: defrag controller needs to provide proper observability
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.10.0
Assignee: Nobody
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 2024311
TreeView+ depends on / blocked
 
Reported: 2021-11-17 19:21 UTC by Sam Batschelet
Modified: 2022-03-10 16:29 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2024311 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:28:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:29:08 UTC

Description Sam Batschelet 2021-11-17 19:21:00 UTC
Description of problem: Today cluster-etcd-operator's defrag controller has a few flaws which this bug intends to cover.

- first, cluster-etcd-oeprator does not collect etcd client metrics and expose those to telemetry. By doing this we can consider alerts on failed client attempts to Defrag.

- the defrag controller currently performs defragmentation then checks cluster health if success defrag the next. In some circumstances such as smaller state files the duration in between defrag attempts can be very short. This can cause the etcd client balancer to perhaps choose poorly and result in a timeout. Resolution would be to ensure a reasonable time exits between defrag attempts.

- the current metrics scrape rate for etcd is 30s so when we are trying to identify defragmentation attempts they appear in some circumstances to happen at the same time. to improve this lets ensure we wait +30s in between defrag attempts.

- if defrag fails we only log the error and do not event.



Version-Release number of selected component (if applicable):


How reproducible: high


Steps to Reproduce:
1. install 4.9 cluster and observe metrics/logs/events
2.
3.

Actual results: defragmentation is not clearly observeable 


Expected results: clean signal and understanding of defrag success/failure.


Additional info:

Comment 6 errata-xmlrpc 2022-03-10 16:28:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.