Bug 1464585

Summary: Pounding etcd for events?recursive=true
Product: OpenShift Container Platform Reporter: Eric Paris <eparis>
Component: MasterAssignee: Dan Mace <dmace>
Status: CLOSED WONTFIX QA Contact: DeShuai Ma <dma>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, deads, decarr, jokerman, mmccomas, sjenning, trankin
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-29 13:38:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journalctl -u atomic-openshift-master-api | grep '/events' | grep GET | grep -v namespaces none

Description Eric Paris 2017-06-23 20:14:14 UTC
# oc version
oc v3.5.5.19

# etcdctl --version
etcdctl version: 3.1.9
API version: 2

etcd is running with debug logging. The logs are highly rate limitted. (95% of all logs are dropped). In the logs we have we can see that 1 apiserver called GET on `/v2/keys/kubernetes.io/events?recursive=true&wait=true&waitIndex=[0-9]*` 7 times per second. Another apiserver was seen requesting the same key 30 times per second.

Extrapolate out based on the number of lost logs that's 117/sec from the one apiserver and 501/sec from the other.

There is another oddity on these requests. As an example we see in the logs that 1 apiserver did a GET on:

/v2/keys/kubernetes.io/events?recursive=true&wait=true&waitIndex=658036126

four times! With the SAME waitIndex. Because these requests were so temporally local it's doubtful the same extrapolation applies. That's just an example, but there were hundreds of times the same apiserver GET the same events at the same waitIndex. Sometimes 4 times, sometimes 3, sometimes 2, sometimes 1.

But something is calling GET a LOT. How can I figure out what is watching all events like this and find a way to calm it down?

Comment 1 Eric Paris 2017-06-23 22:01:31 UTC
Created attachment 1291284 [details]
journalctl -u atomic-openshift-master-api | grep '/events' | grep GET | grep -v namespaces

Logs trying to see who is hitting the apiserver looking for all events

Comment 2 Jordan Liggitt 2017-06-28 04:36:01 UTC
Tranferring to Dan while I'm out. Adding Derek in case he has likely culprits in mind.

Comment 3 Dan Mace 2017-06-30 14:48:55 UTC
Just a quick update here. There seem to be two problems I'm chasing:

1. Reflector-driven controller watches are being reset every 30s rather than every 5-10m, possibly due to external factors (e.g. ELB) which I'm still investigating.
2. During the watch resets (1), the `events` watch established by the un-idling controller appears to be retried immediately many times before settling.

Understanding more about (2) also reveals the un-idling controller is probably fetching and caching far more `event` resources than it needs to operate. That said, (1) still seems like a behavior that needs explained ASAP. Looking forward, it's too soon to say what the best way forward with the un-idling controller is in the short term.

Comment 4 Dan Mace 2017-07-05 21:10:05 UTC
Upon further inspection, the massive amount of GETs seems to simply be the result of the unidling controller's reflector watch on the events resource: events are generated constantly, resulting in the reflector issuing new GETs to fetch each new record in turn. There are likely some unidentified cascading effects here or some reflector behavior which might explain the duplicate resourceVersion issues. It's also possible that the etcd history limit interplays with bogged down event processing, requiring re-lists to establish a new watch resourceversion. However, I have no evidence of that yet.

In the meantime, it might be more fruitful to refactor the unidling controller to stop using event list/watches entirely.

Comment 5 Dan Mace 2017-07-06 14:52:17 UTC
https://github.com/openshift/origin/issues/15079 opened to track refactoring the unidling mechanism.

Comment 6 Eric Paris 2017-08-29 13:38:25 UTC
closing bug, we can push the refactor in the issue.