Bug 1877106 - Connectivity checker creates excessive events and possibly etcd db growth on Azure
Summary: Connectivity checker creates excessive events and possibly etcd db growth on ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Luis Sanchez
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-08 20:18 UTC by Dan Mace
Modified: 2020-10-27 16:39 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:39:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Multi platform trends (317.63 KB, image/png)
2020-09-08 20:18 UTC, Dan Mace
no flags Details
Azure example (348.56 KB, image/png)
2020-09-08 20:19 UTC, Dan Mace
no flags Details
DB size trends (402.13 KB, image/png)
2020-09-08 20:19 UTC, Dan Mace
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 948 0 None closed Bug 1877106: Connectivity checker creates excessive events and possibly etcd db growth on Azure 2020-10-13 08:20:29 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:39:07 UTC

Description Dan Mace 2020-09-08 20:18:55 UTC
Created attachment 1714175 [details]
Multi platform trends

Description of problem:

Analyzing a week's worth of CI stats for the e2e-* periodic jobs across AWS/GCP/Azure reveals that the connectivity checker is causing excessive events and possibly etcd db growth on Azure. See attached screenshots.

Here's a representative example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303104261227286528)

$ jq -r '.items[] | .metadata.namespace + "/" + .reason' < events.json | sort | uniq -c | sort -bnr | head -5
2372 openshift-kube-apiserver/ConnectivityRestored
2140 openshift-apiserver/ConnectivityRestored
 128 openshift-authentication-operator/OperatorStatusChanged
  81 openshift-apiserver/ConnectivityOutageDetected
  78 openshift-monitoring/Pulled

For comparison, from a similar GCP job:

$ jq -r '.items[] | .metadata.namespace + "/" + .reason' < events.json | sort | uniq -c | sort -bnr | head -5
 108 openshift-kube-controller-manager-operator/OperatorStatusChanged
 106 openshift-authentication-operator/OperatorStatusChanged
  89 openshift-kube-apiserver-operator/OperatorStatusChanged
  87 openshift-etcd-operator/OperatorStatusChanged
  81 openshift-kube-apiserver/Pulled

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2020-09-08 20:19:21 UTC
Created attachment 1714176 [details]
Azure example

Comment 2 Dan Mace 2020-09-08 20:19:45 UTC
Created attachment 1714177 [details]
DB size trends

Comment 4 Xingxing Xia 2020-09-28 12:22:51 UTC
Researched for a while to verify. Didn't finish yet. BTW the PR is KAS-O PR, so selecting the correct component. BTW should there be an OAS-O PR too?

Comment 8 errata-xmlrpc 2020-10-27 16:39:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.