Hide Forgot
Description of problem: I've seen this several times but not sure what triggers it. oc get events -w starts replaying events from 2 hours prior. In this case, running 50 concurrent builds in 50 different projects with oc get events -w running. Current events (timestamp = now) come in just fine, but interspersed with the current events are events from exactly (minute/seconds) 2 hours prior. The events really are old ones - things that happened 2 hours ago. I've seen cases where they were from DCs that had since been deleted. Definitely not just bad timestamps. See below for an example. Version-Release number of selected component (if applicable): 3.2.0.15 How reproducible: Unknown Steps to Reproduce: 1. Create 4 node cluster on AWS: master, infra, 2 builder nodes 2. Create 50 projects with app django-example in each 3. Run 50 concurrent builds (all apps, all projects) 4. Run oc get events --all-namespaces -w 5. If the problem triggers, old events from 2 hours prior will be intermixed with current events. Actual results: Expected results: Additional info: jkl28 2016-04-15 10:13:31 -0400 EDT 2016-04-15 10:13:31 -0400 EDT 1 django-example-3-deploy Pod Warning FailedSync {kubelet ip-172-31-45-163.us-west-2.compute.internal} Error syncing pod, skipping: failed to "TeardownNetwork" for "django-example-3-deploy_jkl28" with TeardownNetworkError: "Failed to teardown network for pod \"875923b7-0313-11e6-8b99-02038bdb70bb\" using network plugins \"redhat/openshift-ovs-multitenant\": exit status 1" jkl23 2016-04-15 10:13:32 -0400 EDT 2016-04-15 10:13:32 -0400 EDT 1 django-example-3-deploy Pod Warning FailedSync {kubelet ip-172-31-45-163.us-west-2.compute.internal} Error syncing pod, skipping: failed to "TeardownNetwork" for "django-example-3-deploy_jkl23" with TeardownNetworkError: "Failed to teardown network for pod \"8d143f6f-0313-11e6-8b99-02038bdb70bb\" using network plugins \"redhat/openshift-ovs-multitenant\": exit status 1" jkl26 2016-04-15 08:13:34 -0400 EDT 2016-04-15 08:13:34 -0400 EDT 1 django-example-1-hrge5 Pod spec.containers{django-example} Normal Pulling {kubelet ip-172-31-45-164.us-west-2.compute.internal} pulling image "172.25.10.176:5000/jkl26/django-example@sha256:b5295d79ae45a8e28c797b235ee3785dbf6d4059eac72d61efcbd4b4e054ac14" jkl34 2016-04-15 08:13:34 -0400 EDT 2016-04-15 08:13:34 -0400 EDT 1 django-example-1-build Pod spec.containers{sti-build} Normal Pulled {kubelet ip-172-31-45-164.us-west-2.compute.internal} Container image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.2.0.14" already present on machine jkl14 2016-04-15 10:13:35 -0400 EDT 2016-04-15 10:13:35 -0400 EDT 1 django-example-3 ReplicationController Normal SuccessfulDelete {replication-controller } Deleted pod: django-example-3-1idhf jkl14 2016-04-15 10:13:35 -0400 EDT 2016-04-15 10:13:35 -0400 EDT 1 django-example-3-1idhf Pod spec.containers{django-example} Normal Killing {kubelet ip-172-31-45-164.us-west-2.compute.internal} Killing container with docker id 3454ed728395: Need to kill pod. jkl24 2016-04-15 08:13:36 -0400 EDT 2016-04-15 08:13:36 -0400 EDT 1 django-example-1-67j7b Pod spec.containers{django-example} Normal Pulling {kubelet ip-172-31-45-163.us-west-2.compute.internal} pulling image "172.25.10.176:5000/jkl24/django-example@sha256:1aba63aeab85314442eef7bd2b5befcfdd90d7c504f1a8520657dca19f796a62" jkl42 2016-04-15 10:13:36 -0400 EDT 2016-04-15 10:13:36 -0400 EDT 1 django-example-2 ReplicationController Normal SuccessfulDelete {replication-controller } Deleted pod: django-example-2-760gm jkl42 2016-04-15 10:13:37 -0400 EDT 2016-04-15 10:13:37 -0400 EDT 1 django-example-2-760gm Pod spec.containers{django-example} Normal Killing {kubelet ip-172-31-45-163.us-west-2.compute.internal} Killing container with docker id 6e892f273928: Need to kill pod.
I set kubernetesMasterConfig/apiServerArguments/event-ttl to 1h and the old events appearing in the watch were 1 hour old instead of 2. It appears that the watch is picking up expiring events as well as new ones.
I have a theory on this; that is the the garbage collection of the event is actually triggering a watch event on the... event. Taking a look.
After thinking about this, I don't think this is a bug. Take watching pods (a resource type just like events) as an example. If you are watching pods and a pod is deleted, the watch will return a watch event with type DELETED: https://github.com/kubernetes/kubernetes/blob/master/pkg/watch/watch.go#L43 kubectl -w just prints a resource when an event of any type is returned by the watch. To see only new events, one would need to do a watch (via the API) and filter for watch events that have type ADDED; a functionality that kubectl doesn't have currently. Adding a type filter for watches with kubectl would be a new feature. Mike, is that sufficient explanation to close?
I understand what I am seeing is an event representing the deletion of an event. The problem I still have is that it looks exactly like the original event, there is nothing to indicate that it is an event that is "going away". Plus, a straight oc/kubectl get events doesn't show them. Any chance of updating the oc/kubectl get events watch to filter them? Maybe have the watch use the same filters the initial get events has? My use case for get events -w is always debugging things that are happening now and having old events interleaved with new ones is quite confusing. Example from today: 2016-06-16 14:52:47 -0400 EDT 2016-06-16 14:52:47 -0400 EDT 1 cakephp-example-51-build Pod Normal Scheduled {default-scheduler } Successfully assigned cakephp-example-51-build to ip-172-31-8-197.ec2.internal 2016-06-16 12:52:47 -0400 EDT 2016-06-16 12:52:47 -0400 EDT 1 cakephp-example-11-build Pod spec.containers{sti-build} Normal Pulling {kubelet ip-172-31-8-200.ec2.internal} pulling image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.2.1.1" 2016-06-16 12:52:47 -0400 EDT 2016-06-16 12:52:47 -0400 EDT 1 django-example-11-build Pod spec.containers{sti-build} Normal Pulling {kubelet ip-172-31-8-199.ec2.internal} pulling image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.2.1.1" 2016-06-16 14:52:48 -0400 EDT 2016-06-16 14:52:48 -0400 EDT 1 django-example-51-build Pod spec.containers{sti-build} Normal Pulling {kubelet ip-172-31-8-198.ec2.internal} pulling image "registry.qe.openshift.com/openshift3/ose-sti-builder:v3.2.1.1"
kubectl using a per-resource-type printer function to output the object returned in a watch event. In the case of pods, the pod has a status field the printer shows (Creating, Running, Terminating). However an event resource has no "status" per say. It is just a record of what happened. The initial listing that kubectl get -w does is a separate API call from the watch. If you add --v=9 you'll see on API call for the list and a separate one for the watch. The only thing I can think of is to add a new flag to kubectl get like --watch-type=[ADDED|MODIFIED|DELETED] that would only print watch events of a certain type.
Great digging Seth! Somewhat related issue: https://github.com/kubernetes/kubernetes/issues/17612
This is working as designed. Feel free to open an RFE to request that you can filter by watch-type per Seth's prior comment. Closing this bug, and we can prioritize a future RFE (if opened).