1828457 – Some operator events are dropped because of upstream event correlator

Bug 1828457 - Some operator events are dropped because of upstream event correlator

Summary: Some operator events are dropped because of upstream event correlator

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Michal Fojtik
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-27 18:06 UTC by Michal Fojtik
Modified:	2020-07-13 17:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:31:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-authentication-operator pull 282	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:10 UTC
Github	openshift cluster-config-operator pull 129	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:09 UTC
Github	openshift cluster-etcd-operator pull 323	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:09 UTC
Github	openshift cluster-kube-apiserver-operator pull 837	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:08 UTC
Github	openshift cluster-kube-controller-manager-operator pull 406	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:08 UTC
Github	openshift cluster-kube-scheduler-operator pull 243	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:08 UTC
Github	openshift cluster-openshift-apiserver-operator pull 358	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:08 UTC
Github	openshift service-ca-operator pull 118	None	closed	Bug 1828457: bump(*): vendor update	2020-07-22 15:34:08 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:32:09 UTC

Description Michal Fojtik 2020-04-27 18:06:56 UTC

Description of problem:

All operators based on library-go in OpenShift use the the "events.Recorder" wrapper that provide convenient way to send events to Kubernetes. This wrapper use default Kubernetes event recorder and broadcaster that allow queue events and by default correlate similar events into one event (so we don't create too many events by accident).

However, library-go based operators send much more events than Kubernetes as events provide nice timeline to what is going on in the system.

We need to tweak the correlator options, to allow higher QPS and BurstSize. Tweaking these settings cause more events to go through the event correlator/aggregator. Additionally, upstream correlate events based on "reason", we also need to correlate base on "message", as we might have events the same reason, but different message and we don't want to loose these events.

Version-Release number of selected component (if applicable):

4.5

How reproducible:

Make the operator produce a lot of events, the events should not be correlated.
Additionally, this can be verified by looking at "4.4" events.json available in CI artifacts and comparing this to the 4.5 "events.json". The amount of events should be 30-40% bigger.

Steps to Reproduce:
1.
2.
3.

Actual results:

Similar events are being correlated and lost.
Only 30 events are allowed per minute, per component.

Expected results:

Similar events should not be correlated for operators based on Reason.
More than 30 events should be allowed per minute, per component.

Additional info:

Comment 1 Michal Fojtik 2020-04-27 18:07:28 UTC

library-go change: https://github.com/openshift/library-go/pull/777

Comment 5 Ke Wang 2020-05-15 09:00:48 UTC

Verified with build OCP 4.5.0-0.nightly-2020-05-14-231228,

Force kube-apiserer pods Redeployment,
$ oc patch kubeapiservers/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "just a forced test01" } ]'

Wait for a while, after the related events fired, count the openshift-kube-apiserver events,
$ oc get events -n openshift-kube-apiserver | awk '{ print $1}'| sort | uniq -c | sort -k 1
      1 17m
      1 2m
      1 2m12s
      1 2m16s
      1 2m2s
      1 3m54s
      1 4m18s
      1 4m28s
     16 28m
      1 91s
      1 93s
      1 LAST
      2 119s
      2 2m13s
     23 75m
      2 4m27s
     26 23m
     26 80m
      2 81m
     28 26m
     28 78m
      2 94s
     29 57m
      3 112s
     31 60m
      3 22m
      3 25m
      3 3m11s
      3 3m55s
      3 43s
      3 4m20s
      3 54m
      3 56m
      3 59m
      3 72m
      3 79m
     40 29m
      4 21m
      4 24m
      4 27m
      4 76m
     48 62m
      7 61m
      7 74m
      7 77m
      8 4m17s
      8 4m21s
      9 109s
      9 113s

From above output, we can see 30 events at 60mins, 40 events at 29 mins, 48 events at 62 mins. So there are more than 30 events should be allowed per minute, per component kube-apiserver.
The results was as expected. Move to verified.

Comment 6 errata-xmlrpc 2020-07-13 17:31:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.