1464585 – Pounding etcd for events?recursive=true

Bug 1464585 - Pounding etcd for events?recursive=true

Summary: Pounding etcd for events?recursive=true

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Dan Mace
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-23 20:14 UTC by Eric Paris
Modified:	2017-08-29 13:38 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-29 13:38:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
journalctl -u atomic-openshift-master-api \| grep '/events' \| grep GET \| grep -v namespaces (27.24 KB, text/plain) 2017-06-23 22:01 UTC, Eric Paris	no flags	Details
View All

Description Eric Paris 2017-06-23 20:14:14 UTC

# oc version
oc v3.5.5.19

# etcdctl --version
etcdctl version: 3.1.9
API version: 2

etcd is running with debug logging. The logs are highly rate limitted. (95% of all logs are dropped). In the logs we have we can see that 1 apiserver called GET on `/v2/keys/kubernetes.io/events?recursive=true&wait=true&waitIndex=[0-9]*` 7 times per second. Another apiserver was seen requesting the same key 30 times per second.

Extrapolate out based on the number of lost logs that's 117/sec from the one apiserver and 501/sec from the other.

There is another oddity on these requests. As an example we see in the logs that 1 apiserver did a GET on:

/v2/keys/kubernetes.io/events?recursive=true&wait=true&waitIndex=658036126

four times! With the SAME waitIndex. Because these requests were so temporally local it's doubtful the same extrapolation applies. That's just an example, but there were hundreds of times the same apiserver GET the same events at the same waitIndex. Sometimes 4 times, sometimes 3, sometimes 2, sometimes 1.

But something is calling GET a LOT. How can I figure out what is watching all events like this and find a way to calm it down?

Comment 1 Eric Paris 2017-06-23 22:01:31 UTC

Created attachment 1291284 [details]
journalctl -u atomic-openshift-master-api | grep '/events' | grep GET | grep -v namespaces

Logs trying to see who is hitting the apiserver looking for all events

Comment 2 Jordan Liggitt 2017-06-28 04:36:01 UTC

Tranferring to Dan while I'm out. Adding Derek in case he has likely culprits in mind.

Comment 3 Dan Mace 2017-06-30 14:48:55 UTC

Just a quick update here. There seem to be two problems I'm chasing:

1. Reflector-driven controller watches are being reset every 30s rather than every 5-10m, possibly due to external factors (e.g. ELB) which I'm still investigating.
2. During the watch resets (1), the `events` watch established by the un-idling controller appears to be retried immediately many times before settling.

Understanding more about (2) also reveals the un-idling controller is probably fetching and caching far more `event` resources than it needs to operate. That said, (1) still seems like a behavior that needs explained ASAP. Looking forward, it's too soon to say what the best way forward with the un-idling controller is in the short term.

Comment 4 Dan Mace 2017-07-05 21:10:05 UTC

Upon further inspection, the massive amount of GETs seems to simply be the result of the unidling controller's reflector watch on the events resource: events are generated constantly, resulting in the reflector issuing new GETs to fetch each new record in turn. There are likely some unidentified cascading effects here or some reflector behavior which might explain the duplicate resourceVersion issues. It's also possible that the etcd history limit interplays with bogged down event processing, requiring re-lists to establish a new watch resourceversion. However, I have no evidence of that yet.

In the meantime, it might be more fruitful to refactor the unidling controller to stop using event list/watches entirely.

Comment 5 Dan Mace 2017-07-06 14:52:17 UTC

https://github.com/openshift/origin/issues/15079 opened to track refactoring the unidling mechanism.

Comment 6 Eric Paris 2017-08-29 13:38:25 UTC

closing bug, we can push the refactor in the issue.

Note You need to log in before you can comment on or make changes to this bug.