1515448 – [3.7] Aggregated Logging replacing all log levels with '3' and '6' after upgrade to 3.5 from 3.4

Bug 1515448 - [3.7] Aggregated Logging replacing all log levels with '3' and '6' after upgrade to 3.5 from 3.4

Summary: [3.7] Aggregated Logging replacing all log levels with '3' and '6' after upgr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.7.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.z
Assignee:	Rich Megginson
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1514110 1515447
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-20 19:40 UTC by Rich Megginson
Modified:	2018-04-05 09:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Fluentd was adding the level field with a value of 3 or 6, overwriting any existing level field. Consequence: The level field set by the application was being removed, and the '3' or '6' value was not useful. Fix: If there is already a `level` field in the record, then see if it is a "close" match to one of the canonical `level` field values at https://github.com/ViaQ/elasticsearch-templates/blob/master/namespaces/_default_.yml#L63 e.g. if `level` is "CRITICAL", convert to "crit", if level is "WARN" convert to 'warning', etc. Otherwise, if we cannot use it directly or normalize it, convert it to its string representation (ruby `to_s` method) and store the string value in the `level` field. Result: If the record already has a level field, the value is normalized or preserved, otherwise, a value like `info` or `err` is used.
Clone Of:	1515447
Environment:
Last Closed:	2018-04-05 09:32:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin-aggregated-logging pull 791	None	None	None	2017-11-20 19:52:30 UTC
Github	openshift origin-aggregated-logging pull 810	None	None	None	2017-12-04 18:48:07 UTC
Red Hat Product Errata	RHBA-2018:0636	None	None	None	2018-04-05 09:32:45 UTC

Description Rich Megginson 2017-11-20 19:40:16 UTC

+++ This bug was initially created as a clone of Bug #1515447 +++

+++ This bug was initially created as a clone of Bug #1514110 +++

Description of problem:
After upgrading the cluster from 3.4 to 3.5, all message log levels (as seen in Kibana/Elasticsearch) are either '3' or '6'. This makes solving issues harder, as filtering on errors etc. is now not possible in Kibana.

Version-Release number of selected component (if applicable):
openshift v3.5.5.31.36
kubernetes v1.5.2+43a9be4
openshift3/logging-elasticsearch:v3.5
openshift3/logging-fluentd:v3.5
openshift3/logging-kibana:v3.5

How reproducible:
Unconfirmed

Steps to Reproduce:
Unconfirmed

Actual results:
Messages contain 3 or 6 (numerals) as log levels.

Expected results:
Messages should contain (at least) the usual INFO, WARN, ERROR levels.

Additional info:

--- Additional comment from Rich Megginson on 2017-11-16 16:43:31 EST ---

In 3.6 and later we are strictly enforcing the use of the `level` field as defined in our data model: https://github.com/ViaQ/elasticsearch-templates/blob/master/namespaces/_default_.yml#L63

"Logging level as provided by: rsyslog(severitytext property), python's
        logging module, etc.
        Possible values are as listed here: http://sourceware.org/git/?p=glibc.git;a=blob;f=misc/sys/syslog.h;h=ee01478c4b19a954426a96448577c5a76e6647c0;hb=HEAD#l74 plus `trace` and `unknown`
        That is: alert crit debug emerg err info notice trace unknown warning
"

systemd PRIORITY levels are mapped here: https://github.com/ViaQ/fluent-plugin-viaq_data_model/blob/master/lib/fluent/plugin/filter_viaq_data_model_systemd.rb#L107

basically, map 1, 2, 3, ... to alert, crit, err, ...

container json-file levels are mapped here: https://github.com/ViaQ/fluent-plugin-viaq_data_model/blob/master/lib/fluent/plugin/filter_viaq_data_model.rb#L247

This just maps stream=stdout to level=info and stream=stderr to level=err

It will overwrite any existing level field.

The problem is that we have to strictly enforce the data model, which means we must force the `level` field to be one of the values listed above.  If we don't strictly enforce the `level` field, then it just becomes a dumping ground for any value, and correlation of the `level` field across different applications becomes impossible.

We could do this: If the level field is set, convert to lower case.  If it matches one of the canonical values, use it.  We can also convert "error" to "err", "panic" to "emerg", "warn" to "warning" - that is, do some simple normalization.  If the value does not match, then we just use it as provided by the application, and the user is responsible for either doing some sort of post processing, or otherwise determining what is the meaning of the `level` field across different applications.

--- Additional comment from  on 2017-11-16 18:35:22 EST ---

Interesting. I also found this line in our fluentd configuration (/etc/fluent/configs.d/openshift/filter-k8s-record-transform.conf):

    level ${record['stream'] == 'stdout' ? 6 : 3}

which seems to relate to the problem at hand.

At the moment we have no plans for the use of custom values in the 'level' field. Getting the usual container logging levels working would be nice, and the matching/conversion sound good. What is the proposed method and timeline for the fix?

--- Additional comment from Rich Megginson on 2017-11-16 20:22:39 EST ---

(In reply to antti.niskanen from comment #2)
> Interesting. I also found this line in our fluentd configuration
> (/etc/fluent/configs.d/openshift/filter-k8s-record-transform.conf):
> 
>     level ${record['stream'] == 'stdout' ? 6 : 3}
> 
> which seems to relate to the problem at hand.

Yes.  That is the problematic code in 3.5 which was replace with https://bugzilla.redhat.com/show_bug.cgi?id=1514110#c1 in 3.6 and later.
 
> At the moment we have no plans for the use of custom values in the 'level'
> field. Getting the usual container logging levels working would be nice, and
> the matching/conversion sound good. What is the proposed method and timeline
> for the fix?

The only level data we get for container logs is the "stream" field - "stdout" means "info" and "stderr" means "err".  There is no other log level information.  Are you ok if we did something like this?:

     level ${record['stream'] == 'stdout' ? 'info' : 'err'}

?

--- Additional comment from  on 2017-11-17 05:30:55 EST ---

(In reply to Rich Megginson from comment #3)
> The only level data we get for container logs is the "stream" field -
> "stdout" means "info" and "stderr" means "err".  There is no other log level
> information.

Now I kind of understand what you mean. For example, our users' applications write logs like this:

{"log":"{\"@timestamp\":\"2017-11-08T21:43:16.802+00:00\",\"@version\":1,\"message\":\"Checking mq connection failed: java.util.concurrent.TimeoutException\",\"logger_name\":\"checkActiveMQ\",\"thread_name\":\"qtp1681838063-40\",\"level\":\"WARN\",\"level_value\":30000}\n","stream":"stdout","time":"2017-11-08T21:43:16.803202093Z"}

Previously, the 'level' inside the message has been shown in Kibana (how...?), but now the 'stream' is the only defining factor.

Using 'stream' as you suggested would still not work as the warnings/errors seem to be mostly stdout.

Would it be possible to check for the message contents (JSON) for 'level' and use that as-is or convert/normalize if certain conditions are met?

--- Additional comment from Rich Megginson on 2017-11-17 09:58:57 EST ---

(In reply to antti.niskanen from comment #4)
> (In reply to Rich Megginson from comment #3)
> > The only level data we get for container logs is the "stream" field -
> > "stdout" means "info" and "stderr" means "err".  There is no other log level
> > information.
> 
> Now I kind of understand what you mean. For example, our users' applications
> write logs like this:
> 
> {"log":"{\"@timestamp\":\"2017-11-08T21:43:16.802+00:00\",\"@version\":1,
> \"message\":\"Checking mq connection failed:
> java.util.concurrent.TimeoutException\",\"logger_name\":\"checkActiveMQ\",
> \"thread_name\":\"qtp1681838063-40\",\"level\":\"WARN\",\"level_value\":
> 30000}\n","stream":"stdout","time":"2017-11-08T21:43:16.803202093Z"}
> 
> Previously, the 'level' inside the message has been shown in Kibana
> (how...?),

Because the kubernetes metadata filter has a side effect - if the value of the "log" field looks like a JSON string, it will parse it and put the fields into the record:

https://github.com/openshift/origin-aggregated-logging/blob/release-1.5/fluentd/configs.d/openshift/filter-k8s-meta.conf

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter#configuration

merge_json_log - merge logs in JSON format as top level keys (default: true)

preserve_json_log - preserve JSON logs in raw form in the log key, only used if the previous option is true (default: true)

which means, in the above case, the fields @timestamp, @version, message, logger_name, thread_name, level and level_value are added to the record.

> but now the 'stream' is the only defining factor.
> 
> Using 'stream' as you suggested would still not work as the warnings/errors
> seem to be mostly stdout.
> 
> Would it be possible to check for the message contents (JSON) for 'level'
> and use that as-is or convert/normalize if certain conditions are met?

Yes, I think that is the correct approach:

if level field exists
  if level.lower looks like one of our canonical error levels
    record.level = canonical error level
  else
    preserve existing value
  fi
else
  record.level = info or err from stream
fi

In this case, that means level: INFO would be converted to level: info.

Comment 1 openshift-github-bot 2017-12-02 05:12:45 UTC

Commits pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/ae9dcff602d92285a322674d18d4c42e990ad3e5
Bug 1515448 - [3.7] Aggregated Logging replacing all log levels with '3' and '6' after upgrade to 3.5 from 3.4

https://bugzilla.redhat.com/show_bug.cgi?id=1515448
Use fluent-plugin-viaq_data_model-0.0.12

https://github.com/openshift/origin-aggregated-logging/commit/c7235234d8fcc1eeedfff81725dbf4a15fb81b02
Merge pull request #791 from richm/bug-1515448

Automatic merge from submit-queue.

Bug 1515448 - [3.7] Aggregated Logging replacing all log levels with '3' and '6' after upgrade to 3.5 from 3.4

https://bugzilla.redhat.com/show_bug.cgi?id=1515448
Use fluent-plugin-viaq_data_model-0.0.12
/test

Comment 2 Rich Megginson 2017-12-11 22:22:41 UTC

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=631910
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.7-rhel-7-docker-candidate-10416-20171211201451
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.15
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7.15-2
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.7

Comment 3 Rich Megginson 2017-12-11 22:32:14 UTC

This is fixed in https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=631552 logging-fluentd-docker-v3.7.14-2

which is included in https://errata.devel.redhat.com/advisory/31700

RHBA-2017:31700-01 OpenShift Container Platform 3.7.z images update

Comment 4 Anping Li 2017-12-19 10:48:13 UTC

Verified with openshift3/logging-fluentd/images/v3.7.14-5
When level is set in json array. it use its value instead of stream

Comment 8 errata-xmlrpc 2018-04-05 09:32:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636

Note You need to log in before you can comment on or make changes to this bug.