Bug 1700787

Summary:	Fluentd and record cannot use elasticsearch index name type project_full: record is missing kubernetes field
Product:	OpenShift Container Platform	Reporter:	Gabriel Stein <gferrazs>
Component:	Logging	Assignee:	Noriko Hosoi <nhosoi>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.9.0	CC:	aos-bugs, jcantril, nhosoi, rmeggins
Target Milestone:	---
Target Release:	3.9.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-26 15:38:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Jeff Cantrill 2019-04-17 12:17:43 UTC

Please attach the output of [1] so we may have additional information regarding the environment.

[1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/hack/logging-dump.sh

Comment 4 Jeff Cantrill 2019-04-22 19:03:15 UTC

I would expect the log collection to hang trying to get metadata info from the API server, log a message of being unable to contact the api server, or ship these logs to the 'orphaned' index before one would see the listed message.  This message is indicative of not having the necessary field to index messages properly which is set upstream of the output plugin.

Reviewing the information I am at best able to determine the logging stack is not using the latest 3.9 images [1] but a floating tag registry.access.redhat.com/openshift3/logging-fluentd:v3.9.  The SHA is different then [1].  Additionally, the pull policy is such that the node will not necessarily have the latest available image because it is set to pull only if it is not present.  My first suggestion is to redeploy logging and peg the version of logging in the ansible file to the latest available [1].  This will atleast ensure you are using the latest images where I expect this issue to have been fixed.


[1] https://access.redhat.com/containers/?tab=docker-file#/registry.access.redhat.com/openshift3/logging-fluentd/images/v3.9.74-4

Comment 5 Rich Megginson 2019-04-22 19:13:14 UTC

(In reply to Jeff Cantrill from comment #4)
> I would expect the log collection to hang trying to get metadata info from
> the API server, log a message of being unable to contact the api server, or
> ship these logs to the 'orphaned' index before one would see the listed
> message.  This message is indicative of not having the necessary field to
> index messages properly which is set upstream of the output plugin.

There is no kubernetes field at all, and the record does not look like a container record, which means the error message "record is missing kubernetes field" is misleading - I'm guessing that the record is falling through all of the viaq formatter and index matches - https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/fluentd/configs.d/openshift/filter-viaq-data-model.conf#L55.  That is, the tag did not match  `tag "journal.system** system.var.log** audit.log** %OCP_FLUENTD_TAGS%"` so it fell through to the `**` tag match.  It would be very interesting to see what was the tag for this record.  It means fluentd was not able to determine if this record was an operations or a container record.  Unfortunately the error message does not give us the tag, so you'll have to add some sort of type @stdout filter to dump all records.

> 
> Reviewing the information I am at best able to determine the logging stack
> is not using the latest 3.9 images [1] but a floating tag
> registry.access.redhat.com/openshift3/logging-fluentd:v3.9.  The SHA is
> different then [1].  Additionally, the pull policy is such that the node
> will not necessarily have the latest available image because it is set to
> pull only if it is not present.  My first suggestion is to redeploy logging
> and peg the version of logging in the ansible file to the latest available
> [1].  This will atleast ensure you are using the latest images where I
> expect this issue to have been fixed.

According to the logging dump:

Dockerfile info: /root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.71-3
"build-date"="2019-03-04T21:15:08.049829"

That is a very recent image.  So I would expect if there were a bug fix for this issue (we've seen this issue since 3.6), it would have already been in this image.  But in general yes, please use the latest available images for the best supportability.

> 
> 
> [1]
> https://access.redhat.com/containers/?tab=docker-file#/registry.access.
> redhat.com/openshift3/logging-fluentd/images/v3.9.74-4

Comment 8 Noriko Hosoi 2019-04-23 20:31:42 UTC

(In reply to Rich Megginson from comment #5)
> There is no kubernetes field at all, and the record does not look like a
> container record, which means the error message "record is missing
> kubernetes field" is misleading - I'm guessing that the record is falling
> through all of the viaq formatter and index matches -
> https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/
> fluentd/configs.d/openshift/filter-viaq-data-model.conf#L55.  That is, the
> tag did not match  `tag "journal.system** system.var.log** audit.log**
> %OCP_FLUENTD_TAGS%"` so it fell through to the `**` tag match.  It would be
> very interesting to see what was the tag for this record.  It means fluentd
> was not able to determine if this record was an operations or a container
> record.  Unfortunately the error message does not give us the tag, so you'll
> have to add some sort of type @stdout filter to dump all records.

Thank you, @Rich.  I'm still having a hard time to understand this behaviour...

In this log file logging-20190418_150736/fluentd/logs/logging-fluentd-67x8v.log, starting from 2019-02-20 07:29:50 through 2019-04-05 09:41:00, this rejected errors are logged.
  dump an error event: error_class=Fluent::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch"
Then, it switches to kubernetes field missing error at 2019-04-15 09:19:14.  

Records in the both type of errors are on CMDLINE=/usr/bin/dockerd-current with --log-driver=journald. And it is processed by the input plugin "fluent-plugin-systemd".  The former contains kubernetes field, while the latter does not...  Did something get changed when restarted the fluentd just before 2019-04-15 09:19:14???

Another interesting thing is if I grep with "Dockerfile-openshift3-logging-fluentd" in logging-20190418_150736/fluentd/, I see multiple fluentd versions as follows.

/root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.51-2
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.57-2
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.60-1
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.71-3
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.74-4

But I could not see any apparent correlation between the version and the missing kubernetes errors.
logging-fluentd-67x8v -- Dockerfile info: /root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.57-2
logging-fluentd-dlt4l -- Dockerfile info: /root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.71-3
logging-fluentd-dx6jv -- Dockerfile info: /root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.71-3
logging-fluentd-ltq94 -- Dockerfile info: /root/buildinfo/Dockerfile-openshift3-logging-fluentd-v3.9.60-1

Please note that all the above versions have this fix, which was included in v3.9.47-1 and newer.
Bug 1638900 - [3.9] Record is missing kubernetes field when use '--log-driver journald' in /etc/sysconfig/docker

(also, since there's no whitespaces in "--log-driver=journald", it should not matter, I'd think...)

Comment 9 Rich Megginson 2019-04-23 20:51:13 UTC

Ah - so they are using --log-driver=journald

Check the /etc/sysconfig/docker and /etc/docker/daemon.json files to see how the log driver is set there.

Also check the fluentd daemonset to see if they have hardcoded USE_JOURNAL=true or false.

Comment 10 Noriko Hosoi 2019-04-23 21:14:01 UTC

(In reply to Rich Megginson from comment #9)
> Ah - so they are using --log-driver=journald
> 
> Check the /etc/sysconfig/docker and /etc/docker/daemon.json files to see how
> the log driver is set there.
> 
> Also check the fluentd daemonset to see if they have hardcoded
> USE_JOURNAL=true or false.

Hi Gabriel, could you check them?

Comment 21 Jeff Cantrill 2019-05-14 13:40:53 UTC

@Gabriel

I would reset the log level to error.  If the error went away using the latest images as suggested, please close the issue accordingly.

Comment 22 Rich Megginson 2019-05-14 13:43:31 UTC

(In reply to Jeff Cantrill from comment #21)
> @Gabriel
> 
> I would reset the log level to error.  If the error went away using the
> latest images as suggested, please close the issue accordingly.

Also note if the .orphaned.* indices in Elasticsearch are growing:

oc exec -c elasticsearch $espod -- es_util --query=_cat/indices | grep orphaned

Comment 23 Jeff Cantrill 2019-05-17 15:50:57 UTC

@Gabriel per #c20 I am inclined to close this issue. Please provide additional information if the issue still exists.