Bug 1828457 - Some operator events are dropped because of upstream event correlator
Summary: Some operator events are dropped because of upstream event correlator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Michal Fojtik
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-27 18:06 UTC by Michal Fojtik
Modified: 2020-07-13 17:32 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:31:52 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 282 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:10 UTC
Github openshift cluster-config-operator pull 129 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:09 UTC
Github openshift cluster-etcd-operator pull 323 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:09 UTC
Github openshift cluster-kube-apiserver-operator pull 837 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:08 UTC
Github openshift cluster-kube-controller-manager-operator pull 406 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:08 UTC
Github openshift cluster-kube-scheduler-operator pull 243 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:08 UTC
Github openshift cluster-openshift-apiserver-operator pull 358 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:08 UTC
Github openshift service-ca-operator pull 118 0 None closed Bug 1828457: bump(*): vendor update 2020-07-22 15:34:08 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:32:09 UTC

Description Michal Fojtik 2020-04-27 18:06:56 UTC
Description of problem:

All operators based on library-go in OpenShift use the the "events.Recorder" wrapper that provide convenient way to send events to Kubernetes. This wrapper use default Kubernetes event recorder and broadcaster that allow queue events and by default correlate similar events into one event (so we don't create too many events by accident).

However, library-go based operators send much more events than Kubernetes as events provide nice timeline to what is going on in the system.

We need to tweak the correlator options, to allow higher QPS and BurstSize. Tweaking these settings cause more events to go through the event correlator/aggregator. Additionally, upstream correlate events based on "reason", we also need to correlate base on "message", as we might have events the same reason, but different message and we don't want to loose these events.


Version-Release number of selected component (if applicable):

4.5

How reproducible:

Make the operator produce a lot of events, the events should not be correlated.
Additionally, this can be verified by looking at "4.4" events.json available in CI artifacts and comparing this to the 4.5 "events.json". The amount of events should be 30-40% bigger.

Steps to Reproduce:
1.
2.
3.

Actual results:

Similar events are being correlated and lost.
Only 30 events are allowed per minute, per component.

Expected results:

Similar events should not be correlated for operators based on Reason.
More than 30 events should be allowed per minute, per component.


Additional info:

Comment 1 Michal Fojtik 2020-04-27 18:07:28 UTC
library-go change: https://github.com/openshift/library-go/pull/777

Comment 5 Ke Wang 2020-05-15 09:00:48 UTC
Verified with build OCP 4.5.0-0.nightly-2020-05-14-231228,

Force kube-apiserer pods Redeployment,
$ oc patch kubeapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test01" } ]'

Wait for a while, after the related events fired, count the openshift-kube-apiserver events,
$ oc get events -n openshift-kube-apiserver | awk '{ print $1}'| sort | uniq -c | sort -k 1
      1 17m
      1 2m
      1 2m12s
      1 2m16s
      1 2m2s
      1 3m54s
      1 4m18s
      1 4m28s
     16 28m
      1 91s
      1 93s
      1 LAST
      2 119s
      2 2m13s
     23 75m
      2 4m27s
     26 23m
     26 80m
      2 81m
     28 26m
     28 78m
      2 94s
     29 57m
      3 112s
     31 60m
      3 22m
      3 25m
      3 3m11s
      3 3m55s
      3 43s
      3 4m20s
      3 54m
      3 56m
      3 59m
      3 72m
      3 79m
     40 29m
      4 21m
      4 24m
      4 27m
      4 76m
     48 62m
      7 61m
      7 74m
      7 77m
      8 4m17s
      8 4m21s
      9 109s
      9 113s

From above output, we can see 30 events at 60mins, 40 events at 29 mins, 48 events at 62 mins. So there are more than 30 events should be allowed per minute, per component kube-apiserver.
The results was as expected. Move to verified.

Comment 6 errata-xmlrpc 2020-07-13 17:31:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.