Bug 1613722 - Eventrouter creates duplicated events every 30 min with verb UPDATE
Summary: Eventrouter creates duplicated events every 30 min with verb UPDATE
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.11.z
Assignee: Noriko Hosoi
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1641166
TreeView+ depends on / blocked
 
Reported: 2018-08-08 08:32 UTC by Takayoshi Kimura
Modified: 2018-11-20 03:11 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The eventrouter intentionally generates duplicate event logs rather not to lose them. Fix: Extended the elasticsearch_genid plugin to elasticsearch_genid_ext so that it takes the alt_key and alt_tag. If a log message has a tag matched the alt_tag value, it uses the alt_key value as the elasticsearch primary key. We could specify a field which is shared among the duplicate events to alt_key, which eliminates the duplicate events from the elasticsearch. Sample filter using elasticsearch_genid_ext @type elasticsearch_genid_ext hash_id_key viaq_msg_id alt_key kubernetes.event.metadata.uid alt_tags "#{ENV['GENID_ALT_TAG'] || 'kubernetes.var.log.containers.kube-eventrouter-*.** kubernetes.journal.container._default_.kubernetes.event'}" </filter> Result: There's no duplicate event logs indexed in the elasticsearch.
Clone Of:
: 1641166 (view as bug list)
Environment:
Last Closed: 2018-11-20 03:10:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3537 None None None 2018-11-20 03:11:48 UTC

Description Takayoshi Kimura 2018-08-08 08:32:45 UTC
Description of problem:

Eventrouter produces duplicated events every 30 min with verb UPDATE. Looking at the log, there is no diff between produced "event" and "old_event" contents. Same event appears multiple times with different timestamps in the ElasticSearch and it's confusing.

> 2018-08-06T08:08:02.641161894Z {"verb":"UPDATED","event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"},"old_event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"}}
> 2018-08-06T08:38:02.639198331Z {"verb":"UPDATED","event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"},"old_event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"}}
> 2018-08-06T09:08:02.640141857Z {"verb":"UPDATED","event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"},"old_event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"}}
> 2018-08-06T09:38:02.640618367Z {"verb":"UPDATED","event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"},"old_event":{"metadata":{"name":"nginx-host-2.15483c87846c7c08","namespace":"test-nginx","selfLink":"/api/v1/namespaces/test-nginx/events/nginx-host-2.15483c87846c7c08","uid":"4fb640d6-994e-11e8-a346-001a4a1601d0","resourceVersion":"49254008","creationTimestamp":"2018-08-06T07:57:05Z"},"involvedObject":{"kind":"ReplicationController","namespace":"test-nginx","name":"nginx-host-2","uid":"4c837dab-994e-11e8-a346-001a4a1601d0","apiVersion":"v1","resourceVersion":"49254003"},"reason":"SuccessfulCreate","message":"Created pod: nginx-host-2-hg8sl","source":{"component":"replication-controller"},"firstTimestamp":"2018-08-06T07:57:05Z","lastTimestamp":"2018-08-06T07:57:05Z","count":1,"type":"Normal"}}

"oc get events -w" won't produce duplicated events while eventrouter is producing duplicated events, so it's likely some kind of filtering is missing in the eventrouter side.

To sum up:

- "oc get events" produces event A message 1 time
- The event A message appears multiple times every 30 min for a while, both in ElasticSearch and eventrouter logs

This is not a known duplicate messages issue when eventrouter is restarted. In this test, there is no restart of the eventrouter.

Version-Release number of selected component (if applicable):

v3.9.33

How reproducible:

Always

Steps to Reproduce:
1. Setup some deployments
2. Monitor "oc get events -o yaml --all-namespaces -w" and "oc get logs --timestapms logging-eventrouter"
3. find duplicate logs in logging-eventrouter log

Actual results:

- "oc get events" produces event A message 1 time
- The event A message appears multiple times every 30 min for a while, both in ElasticSearch and eventrouter logs

Expected results:

- Each event message appears once in "oc get event", eventrouter logs and ElasticSearch.

Additional info:

Comment 1 Takayoshi Kimura 2018-08-08 08:45:17 UTC
It could be related to "resync-interval" option by default 30m. Testing it with 2h now.

Comment 2 Takayoshi Kimura 2018-08-09 01:17:24 UTC
It seems to be the duplicated events are related to resync-interval 30m. With 2h I don't see duplicated events.

We can check if there's any duplicated events by this command:

> oc logs logging-eventrouter-XXXXX | sort | uniq -d

Now the question is, is this appropriate default value? Should it be synced with event lifetime, by default 2h? Do we need to set shorter value than 2h for watch timeout and/or API server load balancer timeout?

If the duplicated events are the current limitation, we should put it in the known issues section.

Comment 3 Takayoshi Kimura 2018-08-10 08:50:24 UTC
Also eventrouter currently produces duplicated events on "watch is too old" retries:

> 1 reflector.go:323] github.com/openshift/eventrouter/vendor/k8s.io/client-go/informers/factory.go:69: watch of *v1.Event ended with: The resourceVersion for the provided watch is too old

We need to fix eventrouter not to produce duplicated events on watch retries, or document the current behavior (produces a lot of duplicate events every watch retries, every 30m and/or "watch is too old").

Comment 6 Jeff Cantrill 2018-10-09 14:04:52 UTC
Noriko,

Do you know if the bump in event router version resolves the issues here?

Comment 9 Jeff Cantrill 2018-10-11 17:31:30 UTC
This [1] PR introduced a unique id to remove duplicates of all log messages.  We could resolve specifically with the eventrouter if we instead used the uid produced by kubernetes as the uid indexed into Elasticsearch.  Duplicates would then be ignored.  This would require changes to fluent to handle an 'event' pipeline differently.

[1] https://github.com/openshift/origin-aggregated-logging/pull/974/files

Comment 20 openshift-github-bot 2018-10-19 18:52:55 UTC
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/a995992b85ce4fd53e1ea7354ebd95a7dd992174
Bug 1613722 - Eventrouter creates duplicated events every 30 min with verb UPDATE

Adding a plugin filter_elasticsearch_genid_ext.rb, which extends filter_
elasticsearch_genid.rb from fluent-plugin-elasticsearch.  The extended plugin
takes config parameter alt_key as shown in filter-post-genid.conf.
  <filter **>
    @type elasticsearch_genid_ext
    hash_id_key viaq_msg_id
    alt_key kubernetes.event.metadata.uid
    alt_tags "#{ENV['GENID_ALT_TAG'] || 'kubernetes.var.log.containers.kube-eventrouter-*.** kubernetes.journal.container._default_.kubernetes.event'}"
  </filter>
The plugin does -
  1. If the tag matches the alt_tags list or not,
  2. If it does, it checks the alt_key exists in the record hash as the key.
     As the key is kubernetes.event.metadata.uid, it checks
     record[kubernetes][event][metadata][uid] exists or not.
  3. If it exists, viaq_msg_id is set to the value.
  4. Otherwise, the generated hash value is set. This is the default behaviour.

Note: the value of kubernetes.event.metadata.uid is shared among the event logs
originated from the same one event.  By using the value for the elasticsearch
primary key, duplicated logs are not to be indexed in the elasticsearch.

For testing, code to check there is no duplicated value of kubernetes.event.
metadata.uid is added to test/eventrouter.sh.

Additionally, the tag of the event logs via journald used to be
  kubernetes.journal.container._kube-eventrouter_
which is changed to
  kubernetes.journal.container._default_.kubernetes.event

https://github.com/openshift/origin-aggregated-logging/commit/e3a7e7aef90f13a7ecad1ba1ab8caef5e4bb376a
Merge pull request #1424 from nhosoi/bz1613722

Bug 1613722 - Eventrouter creates duplicated events every 30 min with…

Comment 21 Noriko Hosoi 2018-10-22 05:50:45 UTC
Additional PR.
https://github.com/openshift/origin-aggregated-logging/pull/1434

Comment 22 openshift-github-bot 2018-10-23 19:17:07 UTC
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/51b68c15eed1c6637d4ec998cf566411775212fc
Bug 1613722 - Eventrouter creates duplicated events every 30 min with verb UPDATE

Replacing the eventrouter journald tag
  kubernetes.journal.container._default_.kubernetes.event
with
  kubernetes.journal.container._default_.kubernetes-event

Plus bug fixes.

https://github.com/openshift/origin-aggregated-logging/commit/a36d32a554f695959e6f6ab47cf509922eec6935
Merge pull request #1434 from nhosoi/bz1613722

Bug 1613722 - Eventrouter creates duplicated events every 30 min with…

Comment 25 Qiaoling Tang 2018-11-02 09:25:57 UTC
Tested on ose-logging-fluentd-v3.11.36-1, waited for about 2 hours, no duplicated events created, so move this bug to verified.

Comment 27 errata-xmlrpc 2018-11-20 03:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537


Note You need to log in before you can comment on or make changes to this bug.