Bug 1432607

Summary: Elasticsearch no longer logging to a local file in a pod
Product: OpenShift Container Platform Reporter: Peter Portante <pportant>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Xia Zhao <xiazhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.1CC: anli, aos-bugs, kurktchiev, pportant, rmeggins, xiazhao
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: ES logs to console logs Consequence: ES could end up in a feedback loop ingesting its own logs Fix: Turn off console logs in favor of file logs Result: Feedback loop is broken but users will need to setup ES log volume with file rotation to get ES logs. Additionally 'oc logs' against an ES pod will no longer be sufficient to retrieve ES pod logs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 21:53:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
file logs when openshift_logging_es_log_appenders is set to "console"
none
console logs when openshift_logging_es_log_appenders is set to "console"
none
console logs when openshift_logging_es_log_appenders is set to "file"
none
file logs when openshift_logging_es_log_appenders is default to "file" none

Description Peter Portante 2017-03-15 18:57:08 UTC
At UNC, it appears that with 3.4.1-7 of openshift installed we do not have a local-to-the-pod log file for Elasticsearch. Only "console" logging is configured.

We should actually disable console logging for Elasticsearch and log locally to a file:

Like:

# you can override this using by setting a system property, for example -Des.logger.level=DEBUG
es.logger.level: INFO
rootLogger: ${es.logger.level}, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG

  # deprecation logging, turn to DEBUG to see them
  deprecation: INFO, deprecation_log_file

  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN
  # aws will try to do some sketchy JMX stuff, but its not needed.
  com.amazonaws.jmx.SdkMBeanRegistrySupport: ERROR
  com.amazonaws.metrics.AwsSdkMetrics: ERROR

  org.apache.http: INFO

  index.search.slowlog: TRACE, file
  index.indexing.slowlog: TRACE, file

additivity:
  index.search.slowlog: false
  index.indexing.slowlog: false
  deprecation: false

appender:
  # Use the following log4j-extras RollingFileAppender to enable gzip compression of log files. 
  # For more information see https://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html
  file:
    type: extrasRollingFile
    file: ${path.logs}/${cluster.name}.log
    rollingPolicy: timeBased
    rollingPolicy.FileNamePattern: ${path.logs}/${cluster.name}.log.%d{yyyy-MM-dd}.gz
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  deprecation_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_deprecation.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

Comment 1 Jeff Cantrill 2017-03-15 19:37:46 UTC
We need to weight the pros and cons of doing this.  This will essentially disable using 'oc logs' for the es pods.  IF we do this, we should:

* Add docs to say "get es logs like: oc exec COMMAND HERE"
* Modify fluentd to ignore ES logs

Comment 2 Boris Kurktchiev 2017-03-15 19:58:02 UTC
Shoulndt it at least be an optional thing that can be toggled with a configmap option?

Comment 3 Boris Kurktchiev 2017-03-15 19:58:02 UTC
Shoulndt it at least be an optional thing that can be toggled with a configmap option?

Comment 4 Rich Megginson 2017-03-15 20:27:05 UTC
We can at least provide some sort of documentation for when to do this (advanced debugging of elasticsearch, problems with feedback loops, too many es log messages, too many ES Java stacktraces, etc.), and how to do this (oc edit configmap logging-elasticsearch, change the logging.yml, oc rollout latest dc/logging-es-xxx)

Comment 5 Peter Portante 2017-03-16 01:53:35 UTC
So the problem is that if you have logs from Elasticsearch itself being fed back into Elasticsearch to leave open a nasty feedback loop.  If Elasticsearch starts behaving badly, and we need to debug it, are we going to rely on logs flowing through fluentd where we have the potential to miss logs that might get dropped during that bad behavior period?  Won't that hamper debugging efforts?  It already has been shown that not having the local log file prevented us from seeing current bad behaviors.

Isn't it the case that when everything is working well, the logs for Elasticsearch are not needed?  Aren't they needed when things are bad?

So if we don't have a local log file for Elasticsearch logs, and only the console output, aren't we opening up the feedback loop and preventing us from having additional information to debug its behavior?

If we make it a configmap option, when we need the local log file, won't that be potentially too late, as we'll have to restart that pod missing the capture of the behavior in that state?

Let's drop console logging all together.  Let's log locally only for Elasticsearch pods.  Arguably we should prevent stdout/stderr from flowing through to journal as well from the Elasticsearch instances.

The only other consideration, it would seem, would be to ensure we have those pod logs persisted and rotated appropriately.

Further, if Elasticsearch pods are not logging to stdout/stderr and only logging to a local file, then no modification would be needed for fluentd to ignore Elasticsearch logs, right?

Then the documentation just notes that there will be no logs from Elasticsearch pods in "oc logs", and the admin should pick the log rollover period for the pods.

And again, this is for the single Elasticsearch cluster.

When an ops Elasticsearch cluster is used, you CAN have logs from the app Elasticsearch cluster flow to the ops Elasticsearch cluster, but then the ops Elasticsearch cluster would only be logging locally.

Does that make sense?

Comment 6 Jeff Cantrill 2017-03-16 13:07:15 UTC
Doing some research, need to levelset on what we desire from this issue.  We already put the ES logging config in a configmap. This is 3.4 and is done via the deployer:

https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/scripts/install.sh#L272-L275

And is mounted into the pod:

https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/scripts/install.sh#L272-L275

So if we need to modify just whats on and what's off then that will be defaults for 3.6 in the ansible code.

Comment 7 Peter Portante 2017-03-16 15:35:44 UTC
I think we just need to modify the base logging file, see PR https://github.com/openshift/origin-aggregated-logging/pull/349

Comment 8 Rich Megginson 2017-03-20 21:30:19 UTC
My only concern is if openshift admins expect `oc logs` and the openshift console to display the elasticsearch logs.  If this is a requirement, then we will have to change fluentd to "throw away" Elasticsearch pod logs, or send them to some alternate sink.

Comment 9 Jeff Cantrill 2017-03-27 19:22:03 UTC
The installer fix:

https://github.com/openshift/openshift-ansible/pull/3777

Pending docs fix on 3.5 docs being merged so we can make a PR against ansible docs.  It would be nice to move this into 3.5 but its not a blocker bug for 3.5 since there is a work around: manually edit configmap

Comment 10 openshift-github-bot 2017-03-31 14:01:49 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/a222702ed4316f99eac497ac755c5349804c66d2
bug 1432607.  Allow configuration of ES log destination

Comment 11 Xia Zhao 2017-05-12 10:34:07 UTC
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1439451

Comment 12 Xia Zhao 2017-06-07 09:48:45 UTC
Tested with the both value (the default one: 'file' and non-default one :'console') of parameter `openshift_logging_es_log_appenders`, didn't find any difference here:

Get same test result in both of the above 2 situations: 
1) full es logs were printed out on cli console when cluster-admin run command "$ oc logs ${es-master-pod-name} " on openshift master

2) es logs visible at /elasticsearch/logging-es/logs/logging-es.log  after remote shell into es master container: "# oc rsh logging-es-data-master" 

3) the logs seen in "file" (as is described in 2)) are identical to those seen from "console" (as is described in 1))

Could you please let us know what's diff between these 2 settings? Also I have to reopen the bz since "es is still logging to a local file in a pod", which is opposite to this bug title. Please correct me if I'm wrong, thanks!

Test env:
# openshift version
openshift v3.6.96
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Images tested with: brew registry with tag v3.6 -- currently can't get the exact image ids due to https://bugzilla.redhat.com/show_bug.cgi?id=1455472

Comment 13 Xia Zhao 2017-06-07 09:55:03 UTC
Created attachment 1285729 [details]
file logs when openshift_logging_es_log_appenders is set to "console"

Comment 14 Xia Zhao 2017-06-07 09:55:33 UTC
Created attachment 1285730 [details]
console logs when openshift_logging_es_log_appenders is set to "console"

Comment 15 Xia Zhao 2017-06-07 09:56:07 UTC
Created attachment 1285731 [details]
console logs when openshift_logging_es_log_appenders is set to "file"

Comment 16 Xia Zhao 2017-06-07 09:56:44 UTC
Created attachment 1285732 [details]
file logs when openshift_logging_es_log_appenders is default to "file"

Comment 17 Jeff Cantrill 2017-06-07 21:07:58 UTC
These values are not mutually exclusive.  The ansible variable is a list and can be set in the inventory file like:

openshift_logging_es_log_appenders=['console','file']

This means 'send logs to both the console and file appenders'.

This default was set to file because @peter realized while working with customers that ES can get into a very bad state because it is possible for it to index its own logs.  This settings allows 'oc logs' to still function as expected if console logging is specified.  Based on your comment from #12 it seems to describe how I would expect the ES logs to function.

Comment 18 Xia Zhao 2017-06-08 00:55:22 UTC
@Jeff,

Thanks to the explanations. Now I realized I should set 

openshift_logging_es_log_appenders=['console']

in my inventory file, instead of 

openshift_logging_es_log_appenders=console

as what I did in comment #12. Seems by this way I'm actually always using the default value ['file'] there. Will give this another try today, to see if es still logging to the local file /elasticsearch/logging-es/logs/logging-es.log in the ['console'] case.

Comment 19 Xia Zhao 2017-06-08 09:43:31 UTC
Verified on 3.6.99 with ansible version openshift-ansible-playbooks-3.6.89.3-1.git.0.6e50d9a.el7.noarch, no matter console or file log appenders was set in inventory like:
openshift_logging_es_log_appenders=['console']
or
openshift_logging_es_log_appenders=['file']
or 
the default value,

the following es special logs are only printed on console when running "oc logs" cli, not longer exist in the local file inside es pod /elasticsearch/logging-es/logs/logging-es.log 

* About to connect() to localhost port 9200 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/elasticsearch/secret/admin-ca
  CApath: none
* NSS: client certificate from file
* 	subject: CN=system.admin,OU=OpenShift,O=Logging
* 	start date: Jun 07 17:01:42 2017 GMT
* 	expire date: Jun 07 17:01:42 2019 GMT
* 	common name: system.admin
* 	issuer: CN=logging-signer-test
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=logging-es,OU=OpenShift,O=Logging
* 	start date: Jun 07 17:02:04 2017 GMT
* 	expire date: Jun 07 17:02:04 2019 GMT
* 	common name: logging-es
* 	issuer: CN=logging-signer-test
> PUT /_template/com.redhat.viaq-openshift-operations.template.json HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
> Content-Length: 16125
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
> 
< HTTP/1.1 100 Continue
} [data not shown]
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 21
< 
{ [data not shown]
* Connection #0 to host localhost left intact
{"acknowledged":true}Create index template 'com.redhat.viaq-openshift-project.template.json'
* About to connect() to localhost port 9200 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/elasticsearch/secret/admin-ca
  CApath: none
* NSS: client certificate from file
* 	subject: CN=system.admin,OU=OpenShift,O=Logging
* 	start date: Jun 07 17:01:42 2017 GMT
* 	expire date: Jun 07 17:01:42 2019 GMT
* 	common name: system.admin
* 	issuer: CN=logging-signer-test
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=logging-es,OU=OpenShift,O=Logging
* 	start date: Jun 07 17:02:04 2017 GMT
* 	expire date: Jun 07 17:02:04 2019 GMT
* 	common name: logging-es
* 	issuer: CN=logging-signer-test
> PUT /_template/com.redhat.viaq-openshift-project.template.json HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
> Content-Length: 16121
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
> 
< HTTP/1.1 100 Continue
} [data not shown]
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 21
< 
{ [data not shown]
* Connection #0 to host localhost left intact

Comment 24 errata-xmlrpc 2017-11-28 21:53:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188