Bug 1627086

Summary: ElasticSearch pods flapping with "fatal error on the network layer" exception when logging from 1000+ nodes
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, rmeggins
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-311
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The Netty dependency does not make efficient use of the heap Consequence: Elasticsearch begins to fail on the network layer at high logging volume Fix: Disable Netty Recycler Result: Elasticsearch is more efficient in processing connections
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-20 03:10:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pod logs, logging-es log, deployment config, system log for 2 nodes none

Description Mike Fiedler 2018-09-10 12:05:04 UTC
Description of problem:

In 3.11, when trying to log from 1000 or more nodes, the elasticsearch pods crash and restart regularly with a "fatal error on the network layer" exception.  Stack trace in Additional info and full logs attached.

The cluster has 3 elasticsearch nodes.  Each node has 40 vCPU and 164GB RAM.   The cpu limit for each ES dc is 61GB.   At the time the crash occurs, there are no oom_kill reports, no OutOfMemory exceptions and each pod is using 45-55GB of memory.  I'll link internal pbench data in a private comment.

At 2000 nodes, the elasticsearch pods flap even with no load other than operations logging.   Just starting 2000 fluentd pods causes the issue

With 1000 fluentd started, elasticsearch is stable until a pod logging load is started.   At 100 messages per second per fluentd 2 out of the 3 elasticsearch pods crashed and restarted during a 30 minute run.


Version-Release number of selected component (if applicable): 3.11.0-0.28.0


How reproducible: Always, when starting 2000 fluentd or when logging at a moderate rate with 1000 fluentd




Additional info:

[2018-09-09T22:46:42,555][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:184)
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:89)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1273)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)

Comment 1 Mike Fiedler 2018-09-10 12:07:49 UTC
Created attachment 1482113 [details]
pod logs, logging-es log, deployment config, system log for 2 nodes

pod logs, logging-es log, deployment config, system log for 2 nodes that hit this issue during a 1000 fluentd, 100 messages per second per fluentd test.

Comment 2 Mike Fiedler 2018-09-10 12:11:42 UTC
A search shows a lot of hits where this is preceded by an OutOfMemory exception which is not the case here. 

https://discuss.elastic.co/t/faced-fatal-error-on-the-network-layer-and-fatal-error-in-thread-error/105882 suggests a variant of this is fixed in 5.6.4, but we're running 5.6.10.

https://github.com/elastic/elasticsearch/issues/22406 suggests trying to run with -Dio.netty.recycler.maxCapacityPerThread=0.   I will do that when I get time on the gear again.

Comment 4 Jeff Cantrill 2018-09-10 17:38:51 UTC
Additional settings from the PR:

https://github.com/elastic/elasticsearch/pull/22452/files

That should also be included

Comment 5 Mike Fiedler 2018-09-12 16:01:50 UTC
Since setting the java opts from comment 4 I have not seen this exception.

Comment 6 Rich Megginson 2018-09-12 16:13:25 UTC
(In reply to Mike Fiedler from comment #5)
> Since setting the java opts from comment 4 I have not seen this exception.

So what do we need to do to fix this bug?  Change the default ES configuration to include those settings?  Will those settings negatively impact small/all-in-one deployments?

Comment 7 Mike Fiedler 2018-09-12 17:08:11 UTC
@richm  Have a look at the comments in these two PRs:

https://github.com/elastic/elasticsearch/pull/22452
https://github.com/elastic/elasticsearch/pull/24793

It seems netty has caused ES a lot of grief and they've disabled it in 6.x.   Setting these options in our ES config will accomplish the same thing on 5.x.   There should be minimal or no impact.

Comment 8 Jeff Cantrill 2018-10-09 14:46:01 UTC
PR for master https://github.com/openshift/origin-aggregated-logging/pull/1390

Comment 9 openshift-github-bot 2018-10-15 21:00:02 UTC
Commits pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/7912f2ff318d7169f9b78c60b82b76329b102c47
bug 1627086. Add configuration to disable Netty recycler

https://github.com/openshift/origin-aggregated-logging/commit/2aacf9be390c298de28c89802e14ab4fc6503844
Merge pull request #1390 from jcantrill/bz1627086

bug 1627086. Add configuration to disable Netty recycler

Comment 10 Jeff Cantrill 2018-10-15 21:05:59 UTC
3.11 backport https://github.com/openshift/origin-aggregated-logging/pull/1416

Comment 12 Mike Fiedler 2018-11-01 20:40:10 UTC
Verified in logging-es.log that all 4 options in https://github.com/elastic/elasticsearch/pull/22452/files are set

Verified on 3.11.36


[2018-11-01T20:19:18,222][INFO ][o.e.n.Node               ] [logging-es-data-master-a63oo25y] JVM arguments [-XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UnlockExperimentalVMOptions, -XX:+UseCGroupMemoryLimitForHeap, -XX:MaxRAMFraction=2, -XX:InitialRAMFraction=2, -XX:MinRAMFraction=2, -Dmapper.allow_dots_in_name=true, -Xms8192m, -Xmx8192m, -XX:HeapDumpPath=/elasticsearch/persistent/heapdump.hprof, -Dsg.display_lic_none=false, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.type=unpooled, -Des.path.home=/usr/share/elasticsearch]

Comment 14 errata-xmlrpc 2018-11-20 03:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537