Bug 1466933

Summary:	Spam to API server is causing too many etcd writes
Product:	OpenShift Container Platform	Reporter:	Derek Carr <decarr>
Component:	Node	Assignee:	Derek Carr <decarr>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.6.0	CC:	aos-bugs, eparis, jokerman, mifiedle, mmccomas, sross, xtian
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: control event spam to masters Reason: internal controllers can send large numbers of events when an object is unable to reach its desired state. in large clusters, this can cause excessive writes to etcd. Result: event client has been updated to protect against spamming master components. objects have an initial event budget of 25 events with a refill rate of 1 event every 5 minutes. this controls traffic to the masters and reduces writes to etcd.	Story Points:	---
Clone Of:
Clones:	1467022 (view as bug list)		Environment:
Last Closed:	2017-11-28 21:59:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1467022

Description Derek Carr 2017-06-30 19:34:03 UTC

Description of problem:

A large cluster with either of the following:

1. a large amount of HPAs
2. a large amount of pods that are always unhealthy

is causing too many writes to the API server and inducing excessive snapshot of etcd.

Version-Release number of selected component (if applicable):
OCP 3.6.0

How reproducible:
Always

Steps to Reproduce:
See above

Actual results:
spam should be reduced to masters

Expected results:
HPA should only send status updates if status changes
Pods in perpetual failure status should not spam masters with events

Additional info:

Comment 1 Derek Carr 2017-06-30 20:28:27 UTC

Two fixes were identified:

https://github.com/openshift/origin/pull/14747

this sets a budget of events about an object.
per source+object event budget of 25 burst with refill of 1 every 5 minutes.
it will reduce the long tail of events sent about objects in perpetual failure states (pod in crash loop backoff, controllers denied by quota, etc.).

https://github.com/openshift/origin/pull/14529
HPA only sends status updates if status changes.

Comment 2 Eric Paris 2017-07-01 20:14:14 UTC

I'm using this bz to track only 
https://github.com/openshift/origin/pull/14747

The HPA fix is tracked in 1467022

Comment 4 DeShuai Ma 2017-07-05 08:06:22 UTC

Hi Mike, help verify the bug, thanks.

Comment 5 Mike Fiedler 2017-07-06 21:01:00 UTC

Verified during an event storm caused by misconfigured DNS.  api servers were not getting unique PUTs for each event recorded by the kubelet.   v3.6.126.1

Comment 9 errata-xmlrpc 2017-11-28 21:59:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188