Red Hat Bugzilla – Bug 1464585
Pounding etcd for events?recursive=true
Last modified: 2017-08-29 09:38:25 EDT
# oc version
# etcdctl --version
etcdctl version: 3.1.9
API version: 2
etcd is running with debug logging. The logs are highly rate limitted. (95% of all logs are dropped). In the logs we have we can see that 1 apiserver called GET on `/v2/keys/kubernetes.io/events?recursive=true&wait=true&waitIndex=[0-9]*` 7 times per second. Another apiserver was seen requesting the same key 30 times per second.
Extrapolate out based on the number of lost logs that's 117/sec from the one apiserver and 501/sec from the other.
There is another oddity on these requests. As an example we see in the logs that 1 apiserver did a GET on:
four times! With the SAME waitIndex. Because these requests were so temporally local it's doubtful the same extrapolation applies. That's just an example, but there were hundreds of times the same apiserver GET the same events at the same waitIndex. Sometimes 4 times, sometimes 3, sometimes 2, sometimes 1.
But something is calling GET a LOT. How can I figure out what is watching all events like this and find a way to calm it down?
Created attachment 1291284 [details]
journalctl -u atomic-openshift-master-api | grep '/events' | grep GET | grep -v namespaces
Logs trying to see who is hitting the apiserver looking for all events
Tranferring to Dan while I'm out. Adding Derek in case he has likely culprits in mind.
Just a quick update here. There seem to be two problems I'm chasing:
1. Reflector-driven controller watches are being reset every 30s rather than every 5-10m, possibly due to external factors (e.g. ELB) which I'm still investigating.
2. During the watch resets (1), the `events` watch established by the un-idling controller appears to be retried immediately many times before settling.
Understanding more about (2) also reveals the un-idling controller is probably fetching and caching far more `event` resources than it needs to operate. That said, (1) still seems like a behavior that needs explained ASAP. Looking forward, it's too soon to say what the best way forward with the un-idling controller is in the short term.
Upon further inspection, the massive amount of GETs seems to simply be the result of the unidling controller's reflector watch on the events resource: events are generated constantly, resulting in the reflector issuing new GETs to fetch each new record in turn. There are likely some unidentified cascading effects here or some reflector behavior which might explain the duplicate resourceVersion issues. It's also possible that the etcd history limit interplays with bogged down event processing, requiring re-lists to establish a new watch resourceversion. However, I have no evidence of that yet.
In the meantime, it might be more fruitful to refactor the unidling controller to stop using event list/watches entirely.
https://github.com/openshift/origin/issues/15079 opened to track refactoring the unidling mechanism.
closing bug, we can push the refactor in the issue.