Bug 1627086
Summary: | ElasticSearch pods flapping with "fatal error on the network layer" exception when logging from 1000+ nodes | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.11.0 | CC: | aos-bugs, rmeggins | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.11.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | aos-scalability-311 | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: The Netty dependency does not make efficient use of the heap
Consequence: Elasticsearch begins to fail on the network layer at high logging volume
Fix: Disable Netty Recycler
Result: Elasticsearch is more efficient in processing connections
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-11-20 03:10:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Mike Fiedler
2018-09-10 12:05:04 UTC
Created attachment 1482113 [details]
pod logs, logging-es log, deployment config, system log for 2 nodes
pod logs, logging-es log, deployment config, system log for 2 nodes that hit this issue during a 1000 fluentd, 100 messages per second per fluentd test.
A search shows a lot of hits where this is preceded by an OutOfMemory exception which is not the case here. https://discuss.elastic.co/t/faced-fatal-error-on-the-network-layer-and-fatal-error-in-thread-error/105882 suggests a variant of this is fixed in 5.6.4, but we're running 5.6.10. https://github.com/elastic/elasticsearch/issues/22406 suggests trying to run with -Dio.netty.recycler.maxCapacityPerThread=0. I will do that when I get time on the gear again. Additional settings from the PR: https://github.com/elastic/elasticsearch/pull/22452/files That should also be included Since setting the java opts from comment 4 I have not seen this exception. (In reply to Mike Fiedler from comment #5) > Since setting the java opts from comment 4 I have not seen this exception. So what do we need to do to fix this bug? Change the default ES configuration to include those settings? Will those settings negatively impact small/all-in-one deployments? @richm Have a look at the comments in these two PRs: https://github.com/elastic/elasticsearch/pull/22452 https://github.com/elastic/elasticsearch/pull/24793 It seems netty has caused ES a lot of grief and they've disabled it in 6.x. Setting these options in our ES config will accomplish the same thing on 5.x. There should be minimal or no impact. Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/7912f2ff318d7169f9b78c60b82b76329b102c47 bug 1627086. Add configuration to disable Netty recycler https://github.com/openshift/origin-aggregated-logging/commit/2aacf9be390c298de28c89802e14ab4fc6503844 Merge pull request #1390 from jcantrill/bz1627086 bug 1627086. Add configuration to disable Netty recycler Verified in logging-es.log that all 4 options in https://github.com/elastic/elasticsearch/pull/22452/files are set Verified on 3.11.36 [2018-11-01T20:19:18,222][INFO ][o.e.n.Node ] [logging-es-data-master-a63oo25y] JVM arguments [-XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UnlockExperimentalVMOptions, -XX:+UseCGroupMemoryLimitForHeap, -XX:MaxRAMFraction=2, -XX:InitialRAMFraction=2, -XX:MinRAMFraction=2, -Dmapper.allow_dots_in_name=true, -Xms8192m, -Xmx8192m, -XX:HeapDumpPath=/elasticsearch/persistent/heapdump.hprof, -Dsg.display_lic_none=false, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.type=unpooled, -Des.path.home=/usr/share/elasticsearch] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3537 |