1627086 – ElasticSearch pods flapping with "fatal error on the network layer" exception when logging from 1000+ nodes

Bug 1627086 - ElasticSearch pods flapping with "fatal error on the network layer" exception when logging from 1000+ nodes

Summary: ElasticSearch pods flapping with "fatal error on the network layer" exception...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Jeff Cantrill
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-311
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-10 12:05 UTC by Mike Fiedler
Modified:	2018-11-20 03:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The Netty dependency does not make efficient use of the heap Consequence: Elasticsearch begins to fail on the network layer at high logging volume Fix: Disable Netty Recycler Result: Elasticsearch is more efficient in processing connections
Clone Of:
Environment:
Last Closed:	2018-11-20 03:10:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pod logs, logging-es log, deployment config, system log for 2 nodes (5.86 MB, application/x-gzip) 2018-09-10 12:07 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin-aggregated-logging pull 1390	'None'	closed	bug 1627086. Add configuration to disable Netty recycler	2020-12-17 06:23:06 UTC
Github	openshift origin-aggregated-logging pull 1416	'None'	closed	[release-3.11] bug 1627086. Add configuration to disable Netty recycler	2020-12-17 06:23:06 UTC
Red Hat Product Errata	RHBA-2018:3537	None	None	None	2018-11-20 03:11:48 UTC

Description Mike Fiedler 2018-09-10 12:05:04 UTC

Description of problem:

In 3.11, when trying to log from 1000 or more nodes, the elasticsearch pods crash and restart regularly with a "fatal error on the network layer" exception.  Stack trace in Additional info and full logs attached.

The cluster has 3 elasticsearch nodes.  Each node has 40 vCPU and 164GB RAM.   The cpu limit for each ES dc is 61GB.   At the time the crash occurs, there are no oom_kill reports, no OutOfMemory exceptions and each pod is using 45-55GB of memory.  I'll link internal pbench data in a private comment.

At 2000 nodes, the elasticsearch pods flap even with no load other than operations logging.   Just starting 2000 fluentd pods causes the issue

With 1000 fluentd started, elasticsearch is stable until a pod logging load is started.   At 100 messages per second per fluentd 2 out of the 3 elasticsearch pods crashed and restarted during a 30 minute run.


Version-Release number of selected component (if applicable): 3.11.0-0.28.0


How reproducible: Always, when starting 2000 fluentd or when logging at a moderate rate with 1000 fluentd




Additional info:

[2018-09-09T22:46:42,555][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:184)
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:89)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1273)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)

Comment 1 Mike Fiedler 2018-09-10 12:07:49 UTC

Created attachment 1482113 [details]
pod logs, logging-es log, deployment config, system log for 2 nodes

pod logs, logging-es log, deployment config, system log for 2 nodes that hit this issue during a 1000 fluentd, 100 messages per second per fluentd test.

Comment 2 Mike Fiedler 2018-09-10 12:11:42 UTC

A search shows a lot of hits where this is preceded by an OutOfMemory exception which is not the case here. 

https://discuss.elastic.co/t/faced-fatal-error-on-the-network-layer-and-fatal-error-in-thread-error/105882 suggests a variant of this is fixed in 5.6.4, but we're running 5.6.10.

https://github.com/elastic/elasticsearch/issues/22406 suggests trying to run with -Dio.netty.recycler.maxCapacityPerThread=0.   I will do that when I get time on the gear again.

Comment 4 Jeff Cantrill 2018-09-10 17:38:51 UTC

Additional settings from the PR:

https://github.com/elastic/elasticsearch/pull/22452/files

That should also be included

Comment 5 Mike Fiedler 2018-09-12 16:01:50 UTC

Since setting the java opts from comment 4 I have not seen this exception.

Comment 6 Rich Megginson 2018-09-12 16:13:25 UTC

(In reply to Mike Fiedler from comment #5)
> Since setting the java opts from comment 4 I have not seen this exception.

So what do we need to do to fix this bug?  Change the default ES configuration to include those settings?  Will those settings negatively impact small/all-in-one deployments?

Comment 7 Mike Fiedler 2018-09-12 17:08:11 UTC

@richm  Have a look at the comments in these two PRs:

https://github.com/elastic/elasticsearch/pull/22452
https://github.com/elastic/elasticsearch/pull/24793

It seems netty has caused ES a lot of grief and they've disabled it in 6.x.   Setting these options in our ES config will accomplish the same thing on 5.x.   There should be minimal or no impact.

Comment 8 Jeff Cantrill 2018-10-09 14:46:01 UTC

PR for master https://github.com/openshift/origin-aggregated-logging/pull/1390

Comment 9 openshift-github-bot 2018-10-15 21:00:02 UTC

Commits pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/7912f2ff318d7169f9b78c60b82b76329b102c47
bug 1627086. Add configuration to disable Netty recycler

https://github.com/openshift/origin-aggregated-logging/commit/2aacf9be390c298de28c89802e14ab4fc6503844
Merge pull request #1390 from jcantrill/bz1627086

bug 1627086. Add configuration to disable Netty recycler

Comment 10 Jeff Cantrill 2018-10-15 21:05:59 UTC

3.11 backport https://github.com/openshift/origin-aggregated-logging/pull/1416

Comment 12 Mike Fiedler 2018-11-01 20:40:10 UTC

Verified in logging-es.log that all 4 options in https://github.com/elastic/elasticsearch/pull/22452/files are set

Verified on 3.11.36


[2018-11-01T20:19:18,222][INFO ][o.e.n.Node               ] [logging-es-data-master-a63oo25y] JVM arguments [-XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UnlockExperimentalVMOptions, -XX:+UseCGroupMemoryLimitForHeap, -XX:MaxRAMFraction=2, -XX:InitialRAMFraction=2, -XX:MinRAMFraction=2, -Dmapper.allow_dots_in_name=true, -Xms8192m, -Xmx8192m, -XX:HeapDumpPath=/elasticsearch/persistent/heapdump.hprof, -Dsg.display_lic_none=false, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.type=unpooled, -Des.path.home=/usr/share/elasticsearch]

Comment 14 errata-xmlrpc 2018-11-20 03:10:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537

Note You need to log in before you can comment on or make changes to this bug.