Bug 1627086
| Summary: | ElasticSearch pods flapping with "fatal error on the network layer" exception when logging from 1000+ nodes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
| Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.11.0 | CC: | aos-bugs, rmeggins | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | aos-scalability-311 | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: The Netty dependency does not make efficient use of the heap
Consequence: Elasticsearch begins to fail on the network layer at high logging volume
Fix: Disable Netty Recycler
Result: Elasticsearch is more efficient in processing connections
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-11-20 03:10:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 1482113 [details]
pod logs, logging-es log, deployment config, system log for 2 nodes
pod logs, logging-es log, deployment config, system log for 2 nodes that hit this issue during a 1000 fluentd, 100 messages per second per fluentd test.
A search shows a lot of hits where this is preceded by an OutOfMemory exception which is not the case here. https://discuss.elastic.co/t/faced-fatal-error-on-the-network-layer-and-fatal-error-in-thread-error/105882 suggests a variant of this is fixed in 5.6.4, but we're running 5.6.10. https://github.com/elastic/elasticsearch/issues/22406 suggests trying to run with -Dio.netty.recycler.maxCapacityPerThread=0. I will do that when I get time on the gear again. Additional settings from the PR: https://github.com/elastic/elasticsearch/pull/22452/files That should also be included Since setting the java opts from comment 4 I have not seen this exception. (In reply to Mike Fiedler from comment #5) > Since setting the java opts from comment 4 I have not seen this exception. So what do we need to do to fix this bug? Change the default ES configuration to include those settings? Will those settings negatively impact small/all-in-one deployments? @richm Have a look at the comments in these two PRs: https://github.com/elastic/elasticsearch/pull/22452 https://github.com/elastic/elasticsearch/pull/24793 It seems netty has caused ES a lot of grief and they've disabled it in 6.x. Setting these options in our ES config will accomplish the same thing on 5.x. There should be minimal or no impact. Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/7912f2ff318d7169f9b78c60b82b76329b102c47 bug 1627086. Add configuration to disable Netty recycler https://github.com/openshift/origin-aggregated-logging/commit/2aacf9be390c298de28c89802e14ab4fc6503844 Merge pull request #1390 from jcantrill/bz1627086 bug 1627086. Add configuration to disable Netty recycler Verified in logging-es.log that all 4 options in https://github.com/elastic/elasticsearch/pull/22452/files are set Verified on 3.11.36 [2018-11-01T20:19:18,222][INFO ][o.e.n.Node ] [logging-es-data-master-a63oo25y] JVM arguments [-XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UnlockExperimentalVMOptions, -XX:+UseCGroupMemoryLimitForHeap, -XX:MaxRAMFraction=2, -XX:InitialRAMFraction=2, -XX:MinRAMFraction=2, -Dmapper.allow_dots_in_name=true, -Xms8192m, -Xmx8192m, -XX:HeapDumpPath=/elasticsearch/persistent/heapdump.hprof, -Dsg.display_lic_none=false, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.type=unpooled, -Des.path.home=/usr/share/elasticsearch] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3537 |
Description of problem: In 3.11, when trying to log from 1000 or more nodes, the elasticsearch pods crash and restart regularly with a "fatal error on the network layer" exception. Stack trace in Additional info and full logs attached. The cluster has 3 elasticsearch nodes. Each node has 40 vCPU and 164GB RAM. The cpu limit for each ES dc is 61GB. At the time the crash occurs, there are no oom_kill reports, no OutOfMemory exceptions and each pod is using 45-55GB of memory. I'll link internal pbench data in a private comment. At 2000 nodes, the elasticsearch pods flap even with no load other than operations logging. Just starting 2000 fluentd pods causes the issue With 1000 fluentd started, elasticsearch is stable until a pod logging load is started. At 100 messages per second per fluentd 2 out of the 3 elasticsearch pods crashed and restarted during a 30 minute run. Version-Release number of selected component (if applicable): 3.11.0-0.28.0 How reproducible: Always, when starting 2000 fluentd or when logging at a moderate rate with 1000 fluentd Additional info: [2018-09-09T22:46:42,555][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:184) at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:89) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1273) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at java.lang.Thread.run(Thread.java:748)