Bug 1466933 - Spam to API server is causing too many etcd writes
Summary: Spam to API server is causing too many etcd writes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.7.0
Assignee: Derek Carr
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks: 1467022
TreeView+ depends on / blocked
 
Reported: 2017-06-30 19:34 UTC by Derek Carr
Modified: 2017-11-28 21:59 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: control event spam to masters Reason: internal controllers can send large numbers of events when an object is unable to reach its desired state. in large clusters, this can cause excessive writes to etcd. Result: event client has been updated to protect against spamming master components. objects have an initial event budget of 25 events with a refill rate of 1 event every 5 minutes. this controls traffic to the masters and reduces writes to etcd.
Clone Of:
: 1467022 (view as bug list)
Environment:
Last Closed: 2017-11-28 21:59:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Derek Carr 2017-06-30 19:34:03 UTC
Description of problem:

A large cluster with either of the following:

1. a large amount of HPAs
2. a large amount of pods that are always unhealthy

is causing too many writes to the API server and inducing excessive snapshot of etcd.

Version-Release number of selected component (if applicable):
OCP 3.6.0

How reproducible:
Always

Steps to Reproduce:
See above

Actual results:
spam should be reduced to masters

Expected results:
HPA should only send status updates if status changes
Pods in perpetual failure status should not spam masters with events

Additional info:

Comment 1 Derek Carr 2017-06-30 20:28:27 UTC
Two fixes were identified:

https://github.com/openshift/origin/pull/14747

this sets a budget of events about an object.
per source+object event budget of 25 burst with refill of 1 every 5 minutes.
it will reduce the long tail of events sent about objects in perpetual failure states (pod in crash loop backoff, controllers denied by quota, etc.).

https://github.com/openshift/origin/pull/14529
HPA only sends status updates if status changes.

Comment 2 Eric Paris 2017-07-01 20:14:14 UTC
I'm using this bz to track only 
https://github.com/openshift/origin/pull/14747

The HPA fix is tracked in 1467022

Comment 4 DeShuai Ma 2017-07-05 08:06:22 UTC
Hi Mike, help verify the bug, thanks.

Comment 5 Mike Fiedler 2017-07-06 21:01:00 UTC
Verified during an event storm caused by misconfigured DNS.  api servers were not getting unique PUTs for each event recorded by the kubelet.   v3.6.126.1

Comment 9 errata-xmlrpc 2017-11-28 21:59:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.