Bug 1828457

Summary: Some operator events are dropped because of upstream event correlator
Product: OpenShift Container Platform Reporter: Michal Fojtik <mfojtik>
Component: kube-apiserverAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, kewang, mfojtik
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:31:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Fojtik 2020-04-27 18:06:56 UTC
Description of problem:

All operators based on library-go in OpenShift use the the "events.Recorder" wrapper that provide convenient way to send events to Kubernetes. This wrapper use default Kubernetes event recorder and broadcaster that allow queue events and by default correlate similar events into one event (so we don't create too many events by accident).

However, library-go based operators send much more events than Kubernetes as events provide nice timeline to what is going on in the system.

We need to tweak the correlator options, to allow higher QPS and BurstSize. Tweaking these settings cause more events to go through the event correlator/aggregator. Additionally, upstream correlate events based on "reason", we also need to correlate base on "message", as we might have events the same reason, but different message and we don't want to loose these events.


Version-Release number of selected component (if applicable):

4.5

How reproducible:

Make the operator produce a lot of events, the events should not be correlated.
Additionally, this can be verified by looking at "4.4" events.json available in CI artifacts and comparing this to the 4.5 "events.json". The amount of events should be 30-40% bigger.

Steps to Reproduce:
1.
2.
3.

Actual results:

Similar events are being correlated and lost.
Only 30 events are allowed per minute, per component.

Expected results:

Similar events should not be correlated for operators based on Reason.
More than 30 events should be allowed per minute, per component.


Additional info:

Comment 1 Michal Fojtik 2020-04-27 18:07:28 UTC
library-go change: https://github.com/openshift/library-go/pull/777

Comment 5 Ke Wang 2020-05-15 09:00:48 UTC
Verified with build OCP 4.5.0-0.nightly-2020-05-14-231228,

Force kube-apiserer pods Redeployment,
$ oc patch kubeapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test01" } ]'

Wait for a while, after the related events fired, count the openshift-kube-apiserver events,
$ oc get events -n openshift-kube-apiserver | awk '{ print $1}'| sort | uniq -c | sort -k 1
      1 17m
      1 2m
      1 2m12s
      1 2m16s
      1 2m2s
      1 3m54s
      1 4m18s
      1 4m28s
     16 28m
      1 91s
      1 93s
      1 LAST
      2 119s
      2 2m13s
     23 75m
      2 4m27s
     26 23m
     26 80m
      2 81m
     28 26m
     28 78m
      2 94s
     29 57m
      3 112s
     31 60m
      3 22m
      3 25m
      3 3m11s
      3 3m55s
      3 43s
      3 4m20s
      3 54m
      3 56m
      3 59m
      3 72m
      3 79m
     40 29m
      4 21m
      4 24m
      4 27m
      4 76m
     48 62m
      7 61m
      7 74m
      7 77m
      8 4m17s
      8 4m21s
      9 109s
      9 113s

From above output, we can see 30 events at 60mins, 40 events at 29 mins, 48 events at 62 mins. So there are more than 30 events should be allowed per minute, per component kube-apiserver.
The results was as expected. Move to verified.

Comment 6 errata-xmlrpc 2020-07-13 17:31:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409