Hide Forgot
Description of problem: In 3.11, when trying to log from 1000 or more nodes, the elasticsearch pods crash and restart regularly with a "fatal error on the network layer" exception. Stack trace in Additional info and full logs attached. The cluster has 3 elasticsearch nodes. Each node has 40 vCPU and 164GB RAM. The cpu limit for each ES dc is 61GB. At the time the crash occurs, there are no oom_kill reports, no OutOfMemory exceptions and each pod is using 45-55GB of memory. I'll link internal pbench data in a private comment. At 2000 nodes, the elasticsearch pods flap even with no load other than operations logging. Just starting 2000 fluentd pods causes the issue With 1000 fluentd started, elasticsearch is stable until a pod logging load is started. At 100 messages per second per fluentd 2 out of the 3 elasticsearch pods crashed and restarted during a 30 minute run. Version-Release number of selected component (if applicable): 3.11.0-0.28.0 How reproducible: Always, when starting 2000 fluentd or when logging at a moderate rate with 1000 fluentd Additional info: [2018-09-09T22:46:42,555][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:184) at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:89) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1273) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at java.lang.Thread.run(Thread.java:748)
Created attachment 1482113 [details] pod logs, logging-es log, deployment config, system log for 2 nodes pod logs, logging-es log, deployment config, system log for 2 nodes that hit this issue during a 1000 fluentd, 100 messages per second per fluentd test.
A search shows a lot of hits where this is preceded by an OutOfMemory exception which is not the case here. https://discuss.elastic.co/t/faced-fatal-error-on-the-network-layer-and-fatal-error-in-thread-error/105882 suggests a variant of this is fixed in 5.6.4, but we're running 5.6.10. https://github.com/elastic/elasticsearch/issues/22406 suggests trying to run with -Dio.netty.recycler.maxCapacityPerThread=0. I will do that when I get time on the gear again.
Additional settings from the PR: https://github.com/elastic/elasticsearch/pull/22452/files That should also be included
Since setting the java opts from comment 4 I have not seen this exception.
(In reply to Mike Fiedler from comment #5) > Since setting the java opts from comment 4 I have not seen this exception. So what do we need to do to fix this bug? Change the default ES configuration to include those settings? Will those settings negatively impact small/all-in-one deployments?
@richm Have a look at the comments in these two PRs: https://github.com/elastic/elasticsearch/pull/22452 https://github.com/elastic/elasticsearch/pull/24793 It seems netty has caused ES a lot of grief and they've disabled it in 6.x. Setting these options in our ES config will accomplish the same thing on 5.x. There should be minimal or no impact.
PR for master https://github.com/openshift/origin-aggregated-logging/pull/1390
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/7912f2ff318d7169f9b78c60b82b76329b102c47 bug 1627086. Add configuration to disable Netty recycler https://github.com/openshift/origin-aggregated-logging/commit/2aacf9be390c298de28c89802e14ab4fc6503844 Merge pull request #1390 from jcantrill/bz1627086 bug 1627086. Add configuration to disable Netty recycler
3.11 backport https://github.com/openshift/origin-aggregated-logging/pull/1416
Verified in logging-es.log that all 4 options in https://github.com/elastic/elasticsearch/pull/22452/files are set Verified on 3.11.36 [2018-11-01T20:19:18,222][INFO ][o.e.n.Node ] [logging-es-data-master-a63oo25y] JVM arguments [-XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UnlockExperimentalVMOptions, -XX:+UseCGroupMemoryLimitForHeap, -XX:MaxRAMFraction=2, -XX:InitialRAMFraction=2, -XX:MinRAMFraction=2, -Dmapper.allow_dots_in_name=true, -Xms8192m, -Xmx8192m, -XX:HeapDumpPath=/elasticsearch/persistent/heapdump.hprof, -Dsg.display_lic_none=false, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.type=unpooled, -Des.path.home=/usr/share/elasticsearch]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3537