Bug 1722380

Summary:	Logging data from all projects are stored to .orphaned indexes with Elasticsearch
Product:	OpenShift Container Platform	Reporter:	Radomir Ludva <rludva>
Component:	Logging	Assignee:	Rich Megginson <rmeggins>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	anli, aos-bugs, gabriela, jcantril, rmeggins
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift3/ose-logging-fluentd:v3.11.130-1	Doc Type:	Bug Fix
Doc Text:	Cause: Fluentd is unable to correctly determine the docker log driver. It thinks the log driver is journald when it is json-file. Fluentd then looks for the `CONTAINER_NAME` field in the record to hold the kubernetes metadata and it is not present. Consequence: Fluentd is not able to add kubernetes metadata to records. Records go to the .orphaned index. Fluentd spews lots of errors like this: [error]: record cannot use elasticsearch index na me type project_full: record is missing kubernetes field Fix: Fluentd should not rely on reading the docker configuration file to determine if the record contains kubernetes metadata. It should look at both the record tag and the record data and use whatever kubernetes metadata it finds there. Result: Fluentd can correctly add kubernetes metadata and assign records to the correct indices no matter which log driver docker is using. Records read from files under /var/log/containers/.log will have a fluentd tag like kubernetes.var.log.containers.. This applies both to CRI-O and docker file logs. Kubernetes records read from journald with CONTAINER_NAME will have a tag like journal.kubernetes.*. There is no CRI-O journald log driver yet, and it is not clear how those records will be represented, but hopefully they will follow the same CONTAINER_NAME convention, in which case they will Just Work.	Story Points:	---
Clone Of:
Clones:	1722898 1724263 (view as bug list)		Environment:
Last Closed:	2019-08-13 14:09:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1722898, 1724263

Description Radomir Ludva 2019-06-20 08:14:17 UTC

Description of problem:
All logs from all projects are transferred to Orphaned within Elasticsearch. Elasticsearch is using only Operation indices and Orphaned for logs.

Version-Release number of selected component (if applicable):
redhat-release-server-7.6-4.el7.x86_64
atomic-openshift-3.11.98-1.git.0.0cbaff3.el7.x86_64
atomic-openshift-clients-3.11.98-1.git.0.0cbaff3.el7.x86_64
atomic-openshift-docker-excluder-3.11.98-1.git.0.0cbaff3.el7.noarch
atomic-openshift-excluder-3.11.98-1.git.0.0cbaff3.el7.noarch
atomic-openshift-hyperkube-3.11.98-1.git.0.0cbaff3.el7.x86_64
atomic-openshift-node-3.11.98-1.git.0.0cbaff3.el7.x86_64
atomic-registries-1.22.1-26.gitb507039.el7.x86_64
docker-1.13.1-94.gitb2f74b2.el7.x86_64
docker-client-1.13.1-94.gitb2f74b2.el7.x86_64
docker-common-1.13.1-94.gitb2f74b2.el7.x86_64
docker-rhel-push-plugin-1.13.1-94.gitb2f74b2.el7.x86_64

How reproducible:
I am not able to reproduce this issue.

Actual results:
Logs are stored in Orphaned indices.

Expected results:
Logs from projects are stored in *.Project indices.


Additional info:
We are not sure if this is a configuration issue or if we hit a bug. But logs from Elasticsearch:
---
Clustername: logging-es
Clusterstate: GREEN
Number of nodes: 1
Number of data nodes: 1
.searchguard index does not exists, attempt to create it ... done (0-all replicas)
Populate config from /opt/app-root/src/sgconfig/
Will update 'config' with /opt/app-root/src/sgconfig/sg_config.yml
   SUCC: Configuration for 'config' created or updated
Will update 'roles' with /opt/app-root/src/sgconfig/sg_roles.yml
   SUCC: Configuration for 'roles' created or updated
Will update 'rolesmapping' with /opt/app-root/src/sgconfig/sg_roles_mapping.yml
   SUCC: Configuration for 'rolesmapping' created or updated
Will update 'internalusers' with /opt/app-root/src/sgconfig/sg_internal_users.yml
   SUCC: Configuration for 'internalusers' created or updated
Will update 'actiongroups' with /opt/app-root/src/sgconfig/sg_action_groups.yml
   SUCC: Configuration for 'actiongroups' created or updated
Done with success
--

Comment 3 Jeff Cantrill 2019-06-20 14:26:50 UTC

Can you please describe when this occurred?  Was this after an upgrade?  Reviewing the logs I see there is a point where fluent is starting an unable to contact Elasticsearch. This is indicative of an upgrade or logging start scenario.  If fluent is unable to contact the kube API server in order to fetch metadata it will push the logs to the 'orphaned' index.  Many times this could be from pods and/or namespaces which no longer exist and it is unable to retrieve meta data at all.

If you start a new pod now the logging stack is running are you still experiencing this issue?

Comment 4 Rich Megginson 2019-06-20 17:10:42 UTC

Could be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1711596#c8

The cause there is that fluentd could not correctly determine which logging driver was being used for container logs by looking at the docker configuration file.

Comment 5 Rich Megginson 2019-06-20 17:12:26 UTC

Please try this:

oc set env ds/logging-fluentd DEBUG=true VERBOSE=true

This will restart all of your fluentd pods with tracing so we can see what it is doing.

Also, please provide your /etc/docker/daemon.json and /etc/sysconfig/docker from one of your nodes where fluentd is running.

Comment 9 Rich Megginson 2019-06-25 16:20:33 UTC

https://github.com/openshift/origin-aggregated-logging/pull/1680

Comment 10 Rich Megginson 2019-06-26 15:09:22 UTC

merged upstream https://github.com/openshift/origin-aggregated-logging/commit/396764296721ca67a73799357ca2451d484f16dc

Comment 11 Rich Megginson 2019-06-26 15:35:35 UTC

*** Bug 1711596 has been marked as a duplicate of this bug. ***

Comment 15 Rich Megginson 2019-07-12 19:47:08 UTC

Needs rubygem-fluent-plugin-kubernetes_metadata_filter-1.2.1-1.el7 - this is built and tagged into rhaos-3.11-rhel-7-candidate

NOTE: This rpm cannot be tagged into 3.10 and earlier.  It requires that the fluentd config is using the separate merge json log parser.

A customer that needs this particular fix will have to upgrade to 3.11.

Next step: need a 3.11 compose built with this package, then logging-fluentd 3.11 image built with this rpm

Comment 16 Rich Megginson 2019-07-12 22:24:38 UTC

ART says 3.11 compose rebuild will be in about a week from now

Comment 18 Rich Megginson 2019-07-22 16:20:03 UTC

the fix is in openshift3/ose-logging-fluentd:v3.11.130-1 or later - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=935020

Comment 20 Anping Li 2019-08-07 06:24:03 UTC

The fix and the gem are in openshift3/ose-logging-fluentd:v3.11.135.

Comment 21 Anping Li 2019-08-07 11:22:47 UTC

The journald container logs are parsed automatically without USE_JOURNAL=true.

Comment 23 errata-xmlrpc 2019-08-13 14:09:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2352