Bug 1548104 - Duplicate elasticsearch entries increase as namespaces increase (constant message rate)
Summary: Duplicate elasticsearch entries increase as namespaces increase (constant mes...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.9.0
Assignee: Jeff Cantrill
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-39
: 1491401 (view as bug list)
Depends On:
Blocks: 1556896 1556897
TreeView+ depends on / blocked
 
Reported: 2018-02-22 17:56 UTC by Mike Fiedler
Modified: 2018-06-20 15:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Fluentd inserts documents (logs) into Elasticsearch using the bulk insert API but relies upon Elasticsearch to generate UUIDs for each document. It does not remove successfully indexed documents from the bulk payload when the bulk operation fails. Consequence: The initial payload is resubmitted and documents that were successfully indexed are submitted again which will result in duplicate documents with different uuids. Fix: Generate document ids before submitting bulk insert requests. Result: Elasticsearch will disregard insert of documents that already exist in the data store and insert documents that do not.
Clone Of:
: 1556896 1556897 (view as bug list)
Environment:
Last Closed: 2018-03-28 14:30:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logging-fluentd logs with debug set in <system> config (292.71 KB, application/x-gzip)
2018-02-22 17:57 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin-aggregated-logging pull 974 0 None closed Generate uuid on ingestion to reduce duplicates 2020-06-17 16:23:17 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:30:51 UTC

Description Mike Fiedler 2018-02-22 17:56:01 UTC
Description of problem:

This is extracted from comments 20 - 25 of https://bugzilla.redhat.com/show_bug.cgi?id=1541429.   I don't want to muddy that bz further so that it can focus on the original issue.   

This is also a possible duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1491401 but want to make sure that more than just higher message rates can cause the duplicates. 

At a constant message rate of 100 1Kb messages/second per node (per fluentd), increasing the number of projects those messages are spready across results in an increasing number of BulkIndexQueueFull errors in the logging-fluentd logs and an increased number of duplicate entries in ES due to retries of bulk operations.  A

The number of pods does not seem to be a factor.  A large number of pods logging across a lower number of projects works OK.  See details below.

Details
--------------------------------
200 compute nodes each with a logging-fluentd pod.  Logging is 3.9.0-0.42.0

200 pods (1 pod/node) in 200 namespaces logging at an aggregate rate of 100 1K mps/node
No BulkIndexQueueFull messages, no missing or extra messages in ES indices

400 pods (2/node) in 400 namespaces logging at an aggregate rate of 100 1K mps/node
Minimal BulkIndexQueueFull, no missing messages, no more than 10 extra messages per ES index - majority have exactly the correct number

1000 pods (5/node) in 1000 namespaces logging at an aggregate rate of 100 1K mps/node
Many BulkIndexQueueFull messages and many bulk rejects in the ES bulk queue stats.  Unlike the originally reported issue, retried are successful and there are ~20% extra/duplicate entries in the ES indices

1000 pods (5/node) in 200 namespaces logging at an aggregate rate of 100 1K mps/node
Minimal BulkIndexQueueFull, no missing messages, no more than 10 extra messages per ES index - majority have exactly the correct number.   This looked very similar to the 400 namespace

Retries always eventually successful.  In no tests were there missing messages.

Version-Release number of selected component (if applicable):  Logging 3.9.0-0.42.0.  


How reproducible:  Always


Steps to Reproduce:
1.  Deploy logging (see inventory below).  All logging components settings are OOTB with the exception of configuring LD_BIND_NOW=1 to workaround https://bugzilla.redhat.com/show_bug.cgi?id=1544815
2.  Create 1 pod per compute node logging at 100 1K messages/second.   No or few BulkIndexQueueFull messages should be seen in logging-fluentd logs.   No duplicate index entries in ES
3.  Increase the number of projects and pods, but keep the overall message rate constant - same message throughput at fluentd and ES.

Actual results:

As the number of projects increases at a constant message rate, the number of BulkIndexQueueFull and duplicate index entries increases.


Expected results:

At a constant message rate, increasing the number of projects has no effect on errors.  Unrealistic?


Additional info:

3 ES nodes.   Each node is a VM with 40 vCPU and 160GB memory
217 compute nodes.   Each node is a VM with 8 vCPU and 128GB memory

[OSEv3:children]                                                      
masters                                                               
etcd                                                                  

[masters]                                                             
172.21.0.113                       
172.21.0.115                       
172.21.0.124                       
[etcd]                                                                
172.21.0.113                       
172.21.0.115                       
172.21.0.124                       

[OSEv3:vars]                                                          
deployment_type=openshift-enterprise                                  

openshift_deployment_type=openshift-enterprise                                                                                               
openshift_release=v3.9                                                
openshift_docker_additional_registries=registry.reg-aws.openshift.com                                                                        


openshift_logging_install_logging=true                                
openshift_logging_master_url=https://console.scale-ci.example.com:8443                                                                       
openshift_logging_master_public_url=https://console.scale-ci.example.com:8443                                                                
openshift_logging_kibana_hostname=kibana.apps.scale-ci.example.com    
openshift_logging_namespace=logging                                   
openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/                                                                
openshift_logging_image_version=v3.9.0-0.42.0                                 
openshift_logging_es_cluster_size=3                                   
openshift_logging_es_pvc_dynamic=true                                 
openshift_logging_es_pvc_size=250Gi                                   
openshift_logging_es_pvc_storage_class_name=gp2                                                                                              
openshift_logging_fluentd_read_from_head=false                                                                                               
openshift_logging_use_mux=false

Comment 1 Mike Fiedler 2018-02-22 17:57:07 UTC
Created attachment 1399494 [details]
logging-fluentd logs with debug set in <system> config

Comment 3 Rich Megginson 2018-02-22 18:28:10 UTC
Lukas or Peter - is there something we can do on the Elasticsearch side (more CPU, more RAM, larger bulk import queue size, more threads, faster disks) to mitigate this?  That's really the only way to mitigate the retries.  That is, Elasticsearch is telling us that it is overloaded, and we need to help Elasticsearch process more bulk index requests faster.

At some point, the number of retries and the number of duplicates are going to be a problem, but how will we know?  How urgent is the current situation?  What is the severity?

We now in 3.7 have a number of knobs exposed for fluentd - https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/output-es-config.conf#L21 - ES_FLUSH_INTERVAL, ES_RETRY_WAIT, BUFFER_QUEUE_LIMIT, BUFFER_SIZE_LIMIT - I suggest trying to increase the BUFFER_SIZE_LIMIT to 16m - perhaps Elasticsearch would be better able to process fewer bulk index requests of a larger size.

From the fluentd side, the best approach would be to remove all of the successful documents from the bulk index request, and retry only the failures.  In practice, with the way fluentd queues buffer chunks, it is nearly impossible to change the chunk and requeue a different chunk value.  It might be possible to delete the original chunk and requeue a different chunk, but this will involve some pretty hairy fluentd/ruby coding.

The next best is to generate a uuid for each document in fluentd.  Then, we would get back a duplicate document error, and we could ignore duplicates.  But we're still quite a ways from implementing that.  And we would still get a lot of retries.

Comment 4 Lukas Vlcek 2018-02-23 13:17:41 UTC
Rich, I would bet on effect of increasing number of shards.

> Expected results:
>
> At a constant message rate, increasing the number of projects has no effect on errors.  Unrealistic?

Yes, unrealistic. More projects = more indices and total number of shards ES has to manage. Index shard is not cost free.

See chapter "Are indices and shards not free?" here https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

It reads ES has to keep metadata copy at the segment level - yet each index shard can consists of many segments (like really many [i.e. 100 - 1000 - ...] depending on indexing pressure and segment merge frequency). It suggest adding RAM could be valid solution in this case.

Further it reads: A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards [...]

Given each ES node has 160GB RAM in this case we should target around 4000 shards max per ES node.

To further analyse what is going on we should check monitoring data for number of indices/shards/segments and use of heap per each ES node. This should give us an idea how much the situation can be improved just by throwing more RAM at ES node.

Comment 6 Jeff Cantrill 2018-02-23 18:37:31 UTC
(In reply to Lukas Vlcek from comment #4)
> 
> Given each ES node has 160GB RAM in this case we should target around 4000
> shards max per ES node.
> 

I assume this is a mistake and you mean '16GB' RAM.  Do you mean the amount we alot the container which gets halved to 8 or we specify 32 and get 16?

Comment 7 Peter Portante 2018-02-24 16:36:17 UTC
Each ES instance should use at most 31 GB of HEAP to stay below the large object Java behavior, which effectively doubles memory usage.  If you need more memory for ES, then we'll need to create more Pods at or below 62 GB of memory (1/2 for Java HEAP half for buffer cache).

Using client nodes to handle the bulk requests would help, and then having more spindles or I/O paths for sharding will also help.

Further, having writes routed to the primary shards and reads routed to the replica shards might also help.

Finally, analyzing the write perform to the ES disks to determine if we are maxing out that subsystem would be helpful.

Comment 8 Jeff Cantrill 2018-02-26 21:39:12 UTC
Additional PR to fix ansible changes where we try to keep the password so we dont redeploy the ES image

https://github.com/openshift/openshift-ansible/pull/7294

Comment 9 Jeff Cantrill 2018-02-26 21:39:39 UTC
ignore comment 8

Comment 10 openshift-github-bot 2018-03-01 23:34:58 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/5e1e372658b57f916af9543efb3f096cc2936bfc
bug 1548104. Generate record uuids upon ingestion to minimize duplicates

Comment 12 Mike Fiedler 2018-03-12 22:41:53 UTC
Verified in logging 3.9.7.

Sent 250 1K messages/second/node and 500 1K messages/second/node across 100 pods per node, each in its own namespace.

Verified in Elasticsearch that bulk request rejections were occurring
Verified in the final index counts that the correct number of messages were in the index.  No duplicates.  None missing.

Comment 15 errata-xmlrpc 2018-03-28 14:30:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 16 Jeff Cantrill 2018-04-11 20:43:03 UTC
*** Bug 1491401 has been marked as a duplicate of this bug. ***

Comment 17 openshift-github-bot 2018-06-20 15:06:03 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/3d9a9f05fdd8b0b18fade6e82bb5b66cc571f1d2
bug 1548104. Generate record uuids upon ingestion to minimize duplicates

(cherry picked from commit 5e1e372658b57f916af9543efb3f096cc2936bfc)


Note You need to log in before you can comment on or make changes to this bug.