Bug 1548104
Summary: | Duplicate elasticsearch entries increase as namespaces increase (constant message rate) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.9.0 | CC: | aos-bugs, juzhao, lvlcek, pportant, rmeggins | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.9.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | aos-scalability-39 | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Fluentd inserts documents (logs) into Elasticsearch using the bulk insert API but relies upon Elasticsearch to generate UUIDs for each document. It does not remove successfully indexed documents from the bulk payload when the bulk operation fails.
Consequence: The initial payload is resubmitted and documents that were successfully indexed are submitted again which will result in duplicate documents with different uuids.
Fix: Generate document ids before submitting bulk insert requests.
Result: Elasticsearch will disregard insert of documents that already exist in the data store and insert documents that do not.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1556896 1556897 (view as bug list) | Environment: | |||||
Last Closed: | 2018-03-28 14:30:19 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1556896, 1556897 | ||||||
Attachments: |
|
Description
Mike Fiedler
2018-02-22 17:56:01 UTC
Created attachment 1399494 [details]
logging-fluentd logs with debug set in <system> config
Lukas or Peter - is there something we can do on the Elasticsearch side (more CPU, more RAM, larger bulk import queue size, more threads, faster disks) to mitigate this? That's really the only way to mitigate the retries. That is, Elasticsearch is telling us that it is overloaded, and we need to help Elasticsearch process more bulk index requests faster. At some point, the number of retries and the number of duplicates are going to be a problem, but how will we know? How urgent is the current situation? What is the severity? We now in 3.7 have a number of knobs exposed for fluentd - https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/configs.d/openshift/output-es-config.conf#L21 - ES_FLUSH_INTERVAL, ES_RETRY_WAIT, BUFFER_QUEUE_LIMIT, BUFFER_SIZE_LIMIT - I suggest trying to increase the BUFFER_SIZE_LIMIT to 16m - perhaps Elasticsearch would be better able to process fewer bulk index requests of a larger size. From the fluentd side, the best approach would be to remove all of the successful documents from the bulk index request, and retry only the failures. In practice, with the way fluentd queues buffer chunks, it is nearly impossible to change the chunk and requeue a different chunk value. It might be possible to delete the original chunk and requeue a different chunk, but this will involve some pretty hairy fluentd/ruby coding. The next best is to generate a uuid for each document in fluentd. Then, we would get back a duplicate document error, and we could ignore duplicates. But we're still quite a ways from implementing that. And we would still get a lot of retries. Rich, I would bet on effect of increasing number of shards. > Expected results: > > At a constant message rate, increasing the number of projects has no effect on errors. Unrealistic? Yes, unrealistic. More projects = more indices and total number of shards ES has to manage. Index shard is not cost free. See chapter "Are indices and shards not free?" here https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster It reads ES has to keep metadata copy at the segment level - yet each index shard can consists of many segments (like really many [i.e. 100 - 1000 - ...] depending on indexing pressure and segment merge frequency). It suggest adding RAM could be valid solution in this case. Further it reads: A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards [...] Given each ES node has 160GB RAM in this case we should target around 4000 shards max per ES node. To further analyse what is going on we should check monitoring data for number of indices/shards/segments and use of heap per each ES node. This should give us an idea how much the situation can be improved just by throwing more RAM at ES node. (In reply to Lukas Vlcek from comment #4) > > Given each ES node has 160GB RAM in this case we should target around 4000 > shards max per ES node. > I assume this is a mistake and you mean '16GB' RAM. Do you mean the amount we alot the container which gets halved to 8 or we specify 32 and get 16? Each ES instance should use at most 31 GB of HEAP to stay below the large object Java behavior, which effectively doubles memory usage. If you need more memory for ES, then we'll need to create more Pods at or below 62 GB of memory (1/2 for Java HEAP half for buffer cache). Using client nodes to handle the bulk requests would help, and then having more spindles or I/O paths for sharding will also help. Further, having writes routed to the primary shards and reads routed to the replica shards might also help. Finally, analyzing the write perform to the ES disks to determine if we are maxing out that subsystem would be helpful. Additional PR to fix ansible changes where we try to keep the password so we dont redeploy the ES image https://github.com/openshift/openshift-ansible/pull/7294 Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/5e1e372658b57f916af9543efb3f096cc2936bfc bug 1548104. Generate record uuids upon ingestion to minimize duplicates Verified in logging 3.9.7. Sent 250 1K messages/second/node and 500 1K messages/second/node across 100 pods per node, each in its own namespace. Verified in Elasticsearch that bulk request rejections were occurring Verified in the final index counts that the correct number of messages were in the index. No duplicates. None missing. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489 *** Bug 1491401 has been marked as a duplicate of this bug. *** Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/3d9a9f05fdd8b0b18fade6e82bb5b66cc571f1d2 bug 1548104. Generate record uuids upon ingestion to minimize duplicates (cherry picked from commit 5e1e372658b57f916af9543efb3f096cc2936bfc) |