2008175 – fsync controller will show false positive if gaps in metrics are observed.

Bug 2008175 - fsync controller will show false positive if gaps in metrics are observed.

Summary: fsync controller will show false positive if gaps in metrics are observed.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Sam Batschelet
QA Contact:	Sandeep
Docs Contact:
URL:
Whiteboard:
Depends On:	2013646
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-27 13:48 UTC by Sam Batschelet
Modified:	2021-11-22 21:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2013646 (view as bug list)
Environment:
Last Closed:	2021-11-22 21:47:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 686	0	None	open	Bug 2008175: pkg/operator/metriccontroller: Fix query	2021-10-14 01:01:15 UTC
Red Hat Product Errata	RHBA-2021:4712	0	None	None	None	2021-11-22 21:47:12 UTC

Description Sam Batschelet 2021-09-27 13:48:24 UTC

Description of problem: If gaps in metrics are observed by fsync controller they can provide false positives for leader election growth over time series. This can result in sending admin on an invalid path of etcd performance triage.

> Detected leader change increase of 1.25 over 5 minutes on "BareMetal"; disk metrics are: etcd-nfvpe-12.oot.lab.eng.bos.redhat.com=0.0025688888888888875,etcd-nfvpe-13.oot.lab.eng.bos.redhat.com=0.0018824444444444331,etcd-nfvpe-02.oot.lab.eng.bos.redhat.com=0.001912424242424239


The image below shows leader elections(green) with spikes in leader elections(blue) as a result of the query.

https://user-images.githubusercontent.com/1249749/134918692-43ca6cf0-4fb0-4ee3-8735-f5d035d56170.png

This metric query clearly shows the gaps in the collection of metrics from possible networking issues.

https://user-images.githubusercontent.com/1249749/134919312-f62591ee-57df-41de-a313-545e207d89ad.png

Version-Release number of selected component (if applicable):


How reproducible: given condition 100%


Steps to Reproduce:
1. generate gaps in metrics collection by disrupting networking
2. review events from cluster-etcd-operator namespace for Detected leader change increase events.

3. verify etcd is not actually experiencing leader elections `etcd_server_leader_changes_seen_total`

Actual results: leader elections are fasley reported by fsync controller 


Expected results: fsync controller should event only with an actual observation of the issue.


Additional info:

Comment 4 ge liu 2021-11-09 09:59:51 UTC

This bug verification be blocked by new bug: https://bugzilla.redhat.com/show_bug.cgi?id=2021453

Comment 13 errata-xmlrpc 2021-11-22 21:47:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4712

Note You need to log in before you can comment on or make changes to this bug.