Bug 1466933

Summary: Spam to API server is causing too many etcd writes
Product: OpenShift Container Platform Reporter: Derek Carr <decarr>
Component: NodeAssignee: Derek Carr <decarr>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, eparis, jokerman, mifiedle, mmccomas, sross, xtian
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: control event spam to masters Reason: internal controllers can send large numbers of events when an object is unable to reach its desired state. in large clusters, this can cause excessive writes to etcd. Result: event client has been updated to protect against spamming master components. objects have an initial event budget of 25 events with a refill rate of 1 event every 5 minutes. this controls traffic to the masters and reduces writes to etcd.
Story Points: ---
Clone Of:
: 1467022 (view as bug list) Environment:
Last Closed: 2017-11-28 21:59:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1467022    

Description Derek Carr 2017-06-30 19:34:03 UTC
Description of problem:

A large cluster with either of the following:

1. a large amount of HPAs
2. a large amount of pods that are always unhealthy

is causing too many writes to the API server and inducing excessive snapshot of etcd.

Version-Release number of selected component (if applicable):
OCP 3.6.0

How reproducible:
Always

Steps to Reproduce:
See above

Actual results:
spam should be reduced to masters

Expected results:
HPA should only send status updates if status changes
Pods in perpetual failure status should not spam masters with events

Additional info:

Comment 1 Derek Carr 2017-06-30 20:28:27 UTC
Two fixes were identified:

https://github.com/openshift/origin/pull/14747

this sets a budget of events about an object.
per source+object event budget of 25 burst with refill of 1 every 5 minutes.
it will reduce the long tail of events sent about objects in perpetual failure states (pod in crash loop backoff, controllers denied by quota, etc.).

https://github.com/openshift/origin/pull/14529
HPA only sends status updates if status changes.

Comment 2 Eric Paris 2017-07-01 20:14:14 UTC
I'm using this bz to track only 
https://github.com/openshift/origin/pull/14747

The HPA fix is tracked in 1467022

Comment 4 DeShuai Ma 2017-07-05 08:06:22 UTC
Hi Mike, help verify the bug, thanks.

Comment 5 Mike Fiedler 2017-07-06 21:01:00 UTC
Verified during an event storm caused by misconfigured DNS.  api servers were not getting unique PUTs for each event recorded by the kubelet.   v3.6.126.1

Comment 9 errata-xmlrpc 2017-11-28 21:59:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188