1877106 – Connectivity checker creates excessive events and possibly etcd db growth on Azure

Bug 1877106 - Connectivity checker creates excessive events and possibly etcd db growth on Azure

Summary: Connectivity checker creates excessive events and possibly etcd db growth on ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Luis Sanchez
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-08 20:18 UTC by Dan Mace
Modified:	2020-10-27 16:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:39:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Multi platform trends (317.63 KB, image/png) 2020-09-08 20:18 UTC, Dan Mace	no flags	Details
Azure example (348.56 KB, image/png) 2020-09-08 20:19 UTC, Dan Mace	no flags	Details
DB size trends (402.13 KB, image/png) 2020-09-08 20:19 UTC, Dan Mace	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 948	0	None	closed	Bug 1877106: Connectivity checker creates excessive events and possibly etcd db growth on Azure	2020-10-13 08:20:29 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:39:07 UTC

Description Dan Mace 2020-09-08 20:18:55 UTC

Created attachment 1714175 [details]
Multi platform trends

Description of problem:

Analyzing a week's worth of CI stats for the e2e-* periodic jobs across AWS/GCP/Azure reveals that the connectivity checker is causing excessive events and possibly etcd db growth on Azure. See attached screenshots.

Here's a representative example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303104261227286528)

$ jq -r '.items[] | .metadata.namespace + "/" + .reason' < events.json | sort | uniq -c | sort -bnr | head -5
2372 openshift-kube-apiserver/ConnectivityRestored
2140 openshift-apiserver/ConnectivityRestored
 128 openshift-authentication-operator/OperatorStatusChanged
  81 openshift-apiserver/ConnectivityOutageDetected
  78 openshift-monitoring/Pulled

For comparison, from a similar GCP job:

$ jq -r '.items[] | .metadata.namespace + "/" + .reason' < events.json | sort | uniq -c | sort -bnr | head -5
 108 openshift-kube-controller-manager-operator/OperatorStatusChanged
 106 openshift-authentication-operator/OperatorStatusChanged
  89 openshift-kube-apiserver-operator/OperatorStatusChanged
  87 openshift-etcd-operator/OperatorStatusChanged
  81 openshift-kube-apiserver/Pulled

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2020-09-08 20:19:21 UTC

Created attachment 1714176 [details]
Azure example

Comment 2 Dan Mace 2020-09-08 20:19:45 UTC

Created attachment 1714177 [details]
DB size trends

Comment 4 Xingxing Xia 2020-09-28 12:22:51 UTC

Researched for a while to verify. Didn't finish yet. BTW the PR is KAS-O PR, so selecting the correct component. BTW should there be an OAS-O PR too?

Comment 8 errata-xmlrpc 2020-10-27 16:39:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.