Description of problem: Today cluster-etcd-operator's defrag controller has a few flaws which this bug intends to cover. - first, cluster-etcd-oeprator does not collect etcd client metrics and expose those to telemetry. By doing this we can consider alerts on failed client attempts to Defrag. - the defrag controller currently performs defragmentation then checks cluster health if success defrag the next. In some circumstances such as smaller state files the duration in between defrag attempts can be very short. This can cause the etcd client balancer to perhaps choose poorly and result in a timeout. Resolution would be to ensure a reasonable time exits between defrag attempts. - the current metrics scrape rate for etcd is 30s so when we are trying to identify defragmentation attempts they appear in some circumstances to happen at the same time. to improve this lets ensure we wait +30s in between defrag attempts. - if defrag fails we only log the error and do not event. Version-Release number of selected component (if applicable): How reproducible: high Steps to Reproduce: 1. install 4.9 cluster and observe metrics/logs/events 2. 3. Actual results: defragmentation is not clearly observeable Expected results: clean signal and understanding of defrag success/failure. Additional info:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056