Bug 1616169

Summary:

Elasticsearch logging missing rollover and max size params which caused out of disk error.

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Logging

Assignee:

ewolinet

Status:

CLOSED ERRATA

QA Contact:

Anping Li <anli>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.11.0

CC:

adeshpan, aos-bugs, ewolinet, jcantril, jupierce, rmeggins

Target Milestone:

---

Target Release:

3.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: Elasticsearch 5's log4j.properties file did not contain a size rollover or max rollover configuration Consequence: ES logs would continue to rollover and be kept, causing the pod to run out of local storage. Fix: Add in a rollover policy based on file size and define a maximum file count. Result: We correctly see files rolled over based on size and date and removed once a maximum amount has been met.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-10-11 07:24:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
IOException: No space left on device	none

Description Junqi Zhao 2018-08-15 07:30:57 UTC

Description of problem:
free-int cluster, one ES pod is in CrashLoopBackOff status, checked logs, exception is "Caused by: java.io.IOException: No space left on device"

logging-es-data-master-f26an8nv-17-v24hw   2/2       Running            2          11d
logging-es-data-master-t7rrl3te-10-5z8cx   2/2       Running            2          11d
logging-es-data-master-w6l9n07t-10-lvvc4   1/2       CrashLoopBackOff   151        11d

[2018-08-15 03:24:41,596][INFO ][container.run            ] Checking if Elasticsearch is ready on https://localhost:9200
2018-08-15 03:24:45,111 main ERROR Unable to write to stream /elasticsearch/persistent/logging-es/logs/logging-es.log for appender rolling: org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to stream /elasticsearch/persistent/logging-es/logs/logging-es.log
2018-08-15 03:24:45,113 main ERROR An exception occurred processing Appender rolling org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to stream /elasticsearch/persistent/logging-es/logs/logging-es.log
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:264)
	at org.apache.logging.log4j.core.appender.FileManager.writeToDestination(FileManager.java:261)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.writeToDestination(RollingFileManager.java:219)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:294)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:303)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:179)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
	at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:308)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
	at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
	at org.apache.logging.log4j.spi.ExtendedLoggerWrapper.logMessage(ExtendedLoggerWrapper.java:217)
	at org.elasticsearch.common.logging.PrefixLogger.logMessage(PrefixLogger.java:102)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2116)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2100)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1994)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1966)
	at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1303)
	at org.elasticsearch.node.Node.<init>(Node.java:254)
	at org.elasticsearch.node.Node.<init>(Node.java:245)
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233)
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233)
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342)
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132)
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123)
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134)
	at org.elasticsearch.cli.Command.main(Command.java:90)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84)
Caused by: java.io.IOException: No space left on device
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:262)
	... 37 more
Version-Release number of selected component (if applicable):
logging images version: v3.11.0-0.10.0

How reproducible:
Always

Steps to Reproduce:
1. Check ES pod logs.
2.
3.

Actual results:
"java.io.IOException: No space left on device" for one es pod

Expected results:


Additional info:

Comment 1 Junqi Zhao 2018-08-15 07:32:14 UTC

Created attachment 1476062 [details]
IOException: No space left on device

Comment 2 Jeff Cantrill 2018-08-16 20:02:33 UTC

There is an issue with the available node and the ES pod in question:

1 node(s) were not ready, 1 node(s) were out of disk space, 15 Insufficient memory, 16 node(s) didn't match node selector, 2 node(s) were unschedulable, 8 Insufficient cpu.

Comment 3 Justin Pierce 2018-08-16 21:26:58 UTC

I managed to get access to the ES PV and wiped it clean. The pod is now running:

[root@free-int-master-3c664 ~]# oc get pods
NAME                                       READY     STATUS             RESTARTS   AGE
logging-es-data-master-f26an8nv-17-v24hw   2/2       Running            6          13d
logging-es-data-master-t7rrl3te-10-5z8cx   2/2       Running            6          12d
logging-es-data-master-w6l9n07t-10-wqnxj   2/2       Running            0          37m

Comment 8 ewolinet 2018-08-17 23:36:08 UTC

https://github.com/openshift/openshift-ansible/pull/9663

Comment 9 Jeff Cantrill 2018-08-21 16:28:10 UTC

Moving component and updating title to reflect missing rollover params in the deployment.

Comment 11 Anping Li 2018-08-29 10:56:11 UTC

Move to verified.

1) The log are rolling as expected when log4j2.properties are configured.


-rw-r--r--. 1 1000120000 1000120000 8.4M Aug 29 10:53 anlitest.log
-rw-r--r--. 1 1000120000 1000120000  11M Aug 29 10:53 logging-es-2018-08-29.log
-rw-r--r--. 1 1000120000 1000120000 2.9K Aug 29 10:48 logging-es_deprecation.log
-rw-r--r--. 1 1000120000 1000120000    0 Aug 29 10:47 logging-es_index_indexing_slowlog.log
-rw-r--r--. 1 1000120000 1000120000    0 Aug 29 10:47 logging-es_index_search_slowlog.log
[anli@upg_slave_qeos10 311rsyslog]$ oc rsh logging-es-data-master-4y5wzs8t-1-4pvvp ls -lh /elasticsearch/persistent/logging-es/logs
Defaulting container name to elasticsearch.
Use 'oc describe pod/logging-es-data-master-4y5wzs8t-1-4pvvp -n openshift-logging' to see all of the containers in this pod.
total 11M
-rw-r--r--. 1 1000120000 1000120000 336K Aug 29 10:53 anlitest.log

Comment 13 errata-xmlrpc 2018-10-11 07:24:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Comment 15 Jeff Cantrill 2019-02-06 14:12:42 UTC

(In reply to Aditya Deshpande from comment #14)
> Hello,
> 
> IHAC using OCP 3.9 and facing a similar issue.
> Do we have to backport of this fix?

This was introduced into 3.9 as part of:  https://bugzilla.redhat.com/show_bug.cgi?id=1568361

> 
> Also, the customer is asking that as per solution described here
> https://github.com/openshift/openshift-ansible/pull/9663/files#diff-
> 60a291fe55d2965aefe7aa6e5018f658
> Do they need to append the following params on ES config-map, section
> logging.yml and re-deploy the POD?
> appender.rolling.policies.size.type=SizeBasedTriggeringPolicy
> appender.rolling.policies.size.size=100MB
> appender.rolling.strategy.type=DefaultRolloverStrategy
> appender.rolling.strategy.max=5
> 
> Or is there any workaround possible as the customer does not want to upgrade
> the cluster?

One would have to either manually modify the logging-elasticsearch configmap to include the proper configuration block [1]  and restart or you can set environment variables and restart per [2]

[1] https://github.com/jcantrill/openshift-log4jextras
[2] https://github.com/openshift/origin-aggregated-logging/pull/1127/files#diff-05261c9e7776c31e9b2fed5a68db6a3aR43