Bug 1616169 - Elasticsearch logging missing rollover and max size params which caused out of disk error.
Summary: Elasticsearch logging missing rollover and max size params which caused out o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.0
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-15 07:30 UTC by Junqi Zhao
Modified: 2019-03-06 01:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Elasticsearch 5's log4j.properties file did not contain a size rollover or max rollover configuration Consequence: ES logs would continue to rollover and be kept, causing the pod to run out of local storage. Fix: Add in a rollover policy based on file size and define a maximum file count. Result: We correctly see files rolled over based on size and date and removed once a maximum amount has been met.
Clone Of:
Environment:
Last Closed: 2018-10-11 07:24:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
IOException: No space left on device (43.47 KB, text/plain)
2018-08-15 07:32 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:25:28 UTC

Description Junqi Zhao 2018-08-15 07:30:57 UTC
Description of problem:
free-int cluster, one ES pod is in CrashLoopBackOff status, checked logs, exception is "Caused by: java.io.IOException: No space left on device"

logging-es-data-master-f26an8nv-17-v24hw   2/2       Running            2          11d
logging-es-data-master-t7rrl3te-10-5z8cx   2/2       Running            2          11d
logging-es-data-master-w6l9n07t-10-lvvc4   1/2       CrashLoopBackOff   151        11d

[2018-08-15 03:24:41,596][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
2018-08-15 03:24:45,111 main ERROR Unable to write to stream /elasticsearch/persistent/logging-es/logs/logging-es.log for appender rolling: org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to stream /elasticsearch/persistent/logging-es/logs/logging-es.log
2018-08-15 03:24:45,113 main ERROR An exception occurred processing Appender rolling org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to stream /elasticsearch/persistent/logging-es/logs/logging-es.log
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:264)
	at org.apache.logging.log4j.core.appender.FileManager.writeToDestination(FileManager.java:261)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.writeToDestination(RollingFileManager.java:219)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:294)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:303)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:179)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
	at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:308)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
	at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
	at org.apache.logging.log4j.spi.ExtendedLoggerWrapper.logMessage(ExtendedLoggerWrapper.java:217)
	at org.elasticsearch.common.logging.PrefixLogger.logMessage(PrefixLogger.java:102)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2116)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2100)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1994)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1966)
	at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1303)
	at org.elasticsearch.node.Node.<init>(Node.java:254)
	at org.elasticsearch.node.Node.<init>(Node.java:245)
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233)
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233)
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342)
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132)
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123)
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134)
	at org.elasticsearch.cli.Command.main(Command.java:90)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:262)
	... 37 more
Version-Release number of selected component (if applicable):
logging images version: v3.11.0-0.10.0

How reproducible:
Always

Steps to Reproduce:
1. Check ES pod logs.
2.
3.

Actual results:
"java.io.IOException: No space left on device" for one es pod

Expected results:


Additional info:

Comment 1 Junqi Zhao 2018-08-15 07:32:14 UTC
Created attachment 1476062 [details]
IOException: No space left on device

Comment 2 Jeff Cantrill 2018-08-16 20:02:33 UTC
There is an issue with the available node and the ES pod in question:

1 node(s) were not ready, 1 node(s) were out of disk space, 15 Insufficient memory, 16 node(s) didn't match node selector, 2 node(s) were unschedulable, 8 Insufficient cpu.

Comment 3 Justin Pierce 2018-08-16 21:26:58 UTC
I managed to get access to the ES PV and wiped it clean. The pod is now running:

[root@free-int-master-3c664 ~]# oc get pods
NAME                                       READY     STATUS             RESTARTS   AGE
logging-es-data-master-f26an8nv-17-v24hw   2/2       Running            6          13d
logging-es-data-master-t7rrl3te-10-5z8cx   2/2       Running            6          12d
logging-es-data-master-w6l9n07t-10-wqnxj   2/2       Running            0          37m

Comment 9 Jeff Cantrill 2018-08-21 16:28:10 UTC
Moving component and updating title to reflect missing rollover params in the deployment.

Comment 11 Anping Li 2018-08-29 10:56:11 UTC
Move to verified.

1) The log are rolling as expected when log4j2.properties are configured.


-rw-r--r--. 1 1000120000 1000120000 8.4M Aug 29 10:53 anlitest.log
-rw-r--r--. 1 1000120000 1000120000  11M Aug 29 10:53 logging-es-2018-08-29.log
-rw-r--r--. 1 1000120000 1000120000 2.9K Aug 29 10:48 logging-es_deprecation.log
-rw-r--r--. 1 1000120000 1000120000    0 Aug 29 10:47 logging-es_index_indexing_slowlog.log
-rw-r--r--. 1 1000120000 1000120000    0 Aug 29 10:47 logging-es_index_search_slowlog.log
[anli@upg_slave_qeos10 311rsyslog]$ oc rsh logging-es-data-master-4y5wzs8t-1-4pvvp ls -lh /elasticsearch/persistent/logging-es/logs
Defaulting container name to elasticsearch.
Use 'oc describe pod/logging-es-data-master-4y5wzs8t-1-4pvvp -n openshift-logging' to see all of the containers in this pod.
total 11M
-rw-r--r--. 1 1000120000 1000120000 336K Aug 29 10:53 anlitest.log

Comment 13 errata-xmlrpc 2018-10-11 07:24:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Comment 15 Jeff Cantrill 2019-02-06 14:12:42 UTC
(In reply to Aditya Deshpande from comment #14)
> Hello,
> 
> IHAC using OCP 3.9 and facing a similar issue.
> Do we have to backport of this fix?

This was introduced into 3.9 as part of:  https://bugzilla.redhat.com/show_bug.cgi?id=1568361

> 
> Also, the customer is asking that as per solution described here
> https://github.com/openshift/openshift-ansible/pull/9663/files#diff-
> 60a291fe55d2965aefe7aa6e5018f658
> Do they need to append the following params on ES config-map, section
> logging.yml and re-deploy the POD?
> appender.rolling.policies.size.type=SizeBasedTriggeringPolicy
> appender.rolling.policies.size.size=100MB
> appender.rolling.strategy.type=DefaultRolloverStrategy
> appender.rolling.strategy.max=5
> 
> Or is there any workaround possible as the customer does not want to upgrade
> the cluster?

One would have to either manually modify the logging-elasticsearch configmap to include the proper configuration block [1]  and restart or you can set environment variables and restart per [2]

[1] https://github.com/jcantrill/openshift-log4jextras
[2] https://github.com/openshift/origin-aggregated-logging/pull/1127/files#diff-05261c9e7776c31e9b2fed5a68db6a3aR43


Note You need to log in before you can comment on or make changes to this bug.