Bug 1877106

Summary: Connectivity checker creates excessive events and possibly etcd db growth on Azure
Product: OpenShift Container Platform Reporter: Dan Mace <dmace>
Component: kube-apiserverAssignee: Luis Sanchez <sanchezl>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6CC: aos-bugs, mfojtik, sttts, xxia
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:39:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Multi platform trends
none
Azure example
none
DB size trends none

Description Dan Mace 2020-09-08 20:18:55 UTC
Created attachment 1714175 [details]
Multi platform trends

Description of problem:

Analyzing a week's worth of CI stats for the e2e-* periodic jobs across AWS/GCP/Azure reveals that the connectivity checker is causing excessive events and possibly etcd db growth on Azure. See attached screenshots.

Here's a representative example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303104261227286528)

$ jq -r '.items[] | .metadata.namespace + "/" + .reason' < events.json | sort | uniq -c | sort -bnr | head -5
2372 openshift-kube-apiserver/ConnectivityRestored
2140 openshift-apiserver/ConnectivityRestored
 128 openshift-authentication-operator/OperatorStatusChanged
  81 openshift-apiserver/ConnectivityOutageDetected
  78 openshift-monitoring/Pulled

For comparison, from a similar GCP job:

$ jq -r '.items[] | .metadata.namespace + "/" + .reason' < events.json | sort | uniq -c | sort -bnr | head -5
 108 openshift-kube-controller-manager-operator/OperatorStatusChanged
 106 openshift-authentication-operator/OperatorStatusChanged
  89 openshift-kube-apiserver-operator/OperatorStatusChanged
  87 openshift-etcd-operator/OperatorStatusChanged
  81 openshift-kube-apiserver/Pulled

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2020-09-08 20:19:21 UTC
Created attachment 1714176 [details]
Azure example

Comment 2 Dan Mace 2020-09-08 20:19:45 UTC
Created attachment 1714177 [details]
DB size trends

Comment 4 Xingxing Xia 2020-09-28 12:22:51 UTC
Researched for a while to verify. Didn't finish yet. BTW the PR is KAS-O PR, so selecting the correct component. BTW should there be an OAS-O PR too?

Comment 8 errata-xmlrpc 2020-10-27 16:39:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196