Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1481354

Summary:	fluentd log is filled up with KubeClient messages when using journal and output queue is full
Product:	OpenShift Container Platform	Reporter:	Rich Megginson <rmeggins>
Component:	Logging	Assignee:	Rich Megginson <rmeggins>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.7.0	CC:	aos-bugs, jcantril, jdyson, nhosoi, qitang, rmeggins, sradco, wsun, xiazhao
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	This is a bug in fluentd filter_kubernetes_metadata plugin. Please ignore fluentd log messages from Kubeclient::Common::WatchNotice like this. #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{...}, :spec=>{...}, :status=>{:phase=>"Active"}}>	Story Points:	---
Clone Of:	1476731	Environment:
Last Closed:	2017-11-28 22:07:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1476731
Bug Blocks:

Description Rich Megginson 2017-08-14 16:53:17 UTC

+++ This bug was initially created as a clone of Bug #1476731 +++

Description of problem:
Observed in https://bugzilla.redhat.com/show_bug.cgi?id=1473788#c4

When fluentd is reading from the journald, and the output buffer queue is full, fluentd log is filled up with KubeClient messages:
2017-07-28 03:36:20 -0400 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>
#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-system", :selfLink=>"/api/v1/namespaces/kube-system", :uid=>"55d84610-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14117", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c1,c0", :"openshift.io/sa.scc.supplemental-groups"=>"1000000000/10000", :"openshift.io/sa.scc.uid-range"=>"1000000000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>
#<Kubeclient::Common::WatchNotice type="ADDED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"project1", :selfLink=>"/api/v1/namespaces/project1", :uid=>"f3223a65-736a-11e7-98a0-fa163ec70700", :resourceVersion=>"14184", :creationTimestamp=>"2017-07-28T08:01:24Z", :annotations=>{:"openshift.io/description"=>"", :"openshift.io/display-name"=>"", :"openshift.io/requester"=>"xiazhao"}}, :spec=>{:finalizers=>["openshift.io/origin", "kubernetes"]}, :status=>{:phase=>"Active"}}>
#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"project1", :selfLink=>"/api/v1/namespaces/project1", :uid=>"f3223a65-736a-11e7-98a0-fa163ec70700", :resourceVersion=>"14187", :creationTimestamp=>"2017-07-28T08:01:24Z", :annotations=>{:"openshift.io/description"=>"", :"openshift.io/display-name"=>"", :"openshift.io/requester"=>"xiazhao", :"openshift.io/sa.scc.mcs"=>"s0:c9,c4", :"openshift.io/sa.scc.supplemental-groups"=>"1000080000/10000", :"openshift.io/sa.scc.uid-range"=>"1000080000/10000"}}, :spec=>{:finalizers=>["openshift.io/origin", "kubernetes"]}, :status=>{:phase=>"Active"}}>

Version-Release number of selected component (if applicable):
logging-fluentd            v3.6.172.0.4-1      424d76ec1b51        3 hours ago         231.7 MB

How reproducible:
Always

Steps to Reproduce:
1. Deployed logging with openshift_logging_fluentd_journal_read_from_head=true in inventory file

2. To increase the read journal load of fluentd, created 30 projects with 38 pods running inside on openshift, each pod sended out logs continously:

# for i in {1..30}
do
      oc new-project project${i}
      oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT
      sleep 30
done

# oc get po --all-namespaces | grep -i running | wc -l
38

3. Wait after 1 hour until every index can be found in ES, check the fluentd log

Actual results:
fluentd log is filled up with KubeClient messages

Expected results:
fluentd log shouldn't be filled up with KubeClient messages

Additional info:
full logs here: https://bugzilla.redhat.com/attachment.cgi?id=1305848

--- Additional comment from Noriko Hosoi on 2017-07-31 14:43:52 EDT ---

The current log level of the fluentd is "warn".

$ cat openshift/system.conf
<system>
  log_level warn
</system>

That makes these messages visible in the fluentd log.
> [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not ... 

We have another request from the RHV team to change the default log level:

On 07/31/2017 02:23 AM, Shirly Radco wrote:
> I would like to propose adding 
> 
> <system>
>   log_level error
> </system>
> 
>  to a default fluentd.conf provided with the fluentd package.
> 
> It is a general configuration and not oVirt specific.
> 
> We require it since when the client fluent can't connect to the mux it starts logging many warning messages.

Do we want to change the fluentd's default log level to error?  Please note that if we lose these warning logs, we may get waited for a loooong time for the process completion without any input.  I feel it might make the administrators' job harder.

Regarding the "output buffer queue is full" issue itself, did you have a chance to tune the fluentd buffer chunk limit, Xia?  Could you please tell us what is the current size?  1m?
> openshift/output-es-config.conf:   buffer_chunk_limit "#{ENV['BUFFER_SIZE_LIMIT'] || '1m' }"

I assume you are using memory type buffering.  Could it be possible to increase the memory limit?  By default, it'd be 512M.  Could you try 2G, for instance?  Then, set the environment variable BUFFER_SIZE_LIMIT to 16m to retry the test?

Thanks.

--- Additional comment from Rich Megginson on 2017-07-31 15:02:05 EDT ---

re: log levels - another option is to change the log level for individual plugins.  Some plugins support a `log_level` parameter.  That may allow us to sufficiently increase the signal to noise ratio in the fluentd logs.

If that's not good enough, then we may have to submit PRs for the upstream code for the plugins and perhaps fluentd itself to get those log levels changed for those messages.

This (bz) problem is that we are now seeing messages like this that we have not seen before:

#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>

I'm not sure why we are seeing these now.

--- Additional comment from Noriko Hosoi on 2017-07-31 15:26:39 EDT ---

Sorry, I misunderstood the log output.  So, this message does not have the date, does it?

#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>

--- Additional comment from Rich Megginson on 2017-07-31 15:46:47 EDT ---

@nhosoi right - it doesn't look like a normal fluentd log message - I suspect the kubernetes metadata plugin is writing these messages directly to stdout/stderr instead of using `log.error("....")`

--- Additional comment from Noriko Hosoi on 2017-07-31 17:02:06 EDT ---

I see. @richm, thanks! 

I've been looking into the kubeclient code.  The watch_stream.rb is retrieving the contents of the buffer. The lib is required by "filter_kubernetes_metadata.rb".

$ egrep -a4 WatchNotice lib/*/*
[...]
--
lib/kubeclient/watch_stream.rb-        buffer = ''
lib/kubeclient/watch_stream.rb-        response.body.each do |chunk|
lib/kubeclient/watch_stream.rb-          buffer << chunk
lib/kubeclient/watch_stream.rb-          while (line = buffer.slice!(/.+\n/))
lib/kubeclient/watch_stream.rb:            yield(@format == :json ? WatchNotice.new(JSON.parse(line)) : line.chomp)
lib/kubeclient/watch_stream.rb-          end
lib/kubeclient/watch_stream.rb-        end
lib/kubeclient/watch_stream.rb-      rescue IOError
lib/kubeclient/watch_stream.rb-        raise unless @finished

It looks the watch code is enabled by default in "filter_kubernetes_metadata.rb".
    config_param :watch, :bool, default: true
        [...]
        if @watch
          thread = Thread.new(self) { |this| this.start_watch }

I haven't tried it myself yet, but is it possible to run the test with disabling watch in /etc/fluentd/configs.d/openshift/filter-k8s-meta.conf like this?

 <filter kubernetes.**>
   type kubernetes_metadata
+  @watch false
   kubernetes_url "#{ENV['K8S_HOST_URL']}"
   bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token

--- Additional comment from Rich Megginson on 2017-07-31 17:22:54 EDT ---

The watch code is for performance.  That is what allows us to use cached values rather than query the Kubernetes API server for every record - it is used to keep the cache up-to-date.  If we get rid of the watch, the performance will be bad.

Did you find some code in the kubeclient libraries that does a "puts" or otherwise writes directly to stdout/stderr?

--- Additional comment from Noriko Hosoi on 2017-07-31 17:34:15 EDT ---

Not in the kubeclient libraries; all "puts" I found is this namespace watch in "filter_kubernetes_metadata.rb", where "notice" seems coming from WatchNotice.

    def start_namespace_watch
      resource_version = @client.get_namespaces.resourceVersion
      watcher          = @client.watch_namespaces(resource_version)
      watcher.each do |notice|
        puts notice
        case notice.type
          when 'MODIFIED'
            cache_key = notice.object['metadata']['name']
            cached    = @namespace_cache[cache_key]
            if cached
              @namespace_cache[cache_key] = parse_namespace_metadata(notice.object)
            end

Continue investigating...

--- Additional comment from Rich Megginson on 2017-07-31 19:22:05 EDT ---

>        puts notice

This is a bug.  This should be `log.debug(notice)` - this should not use puts.  When was this introduced?

--- Additional comment from Noriko Hosoi on 2017-07-31 19:41:41 EDT ---

Well, it's pretty old...

> Date:   Tue Sep 8 15:31:45 2015 -0700

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/commit/692d95ea20810c96d231eccad93c059c1b8b92b3

Am I looking at a wrong branch?

--- Additional comment from Rich Megginson on 2017-07-31 20:01:33 EDT ---

(In reply to Noriko Hosoi from comment #9)
> Well, it's pretty old...
> 
> > Date:   Tue Sep 8 15:31:45 2015 -0700
> 
> https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/commit/
> 692d95ea20810c96d231eccad93c059c1b8b92b3
> 
> Am I looking at a wrong branch?

You are correct.

Wow - how did we miss this problem for so long?

--- Additional comment from Noriko Hosoi on 2017-07-31 20:11:20 EDT ---

Maybe, the outputs have been appreciated? :)

--- Additional comment from Noriko Hosoi on 2017-08-01 12:28:20 EDT ---

Filed an issue:
https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/84

--- Additional comment from Noriko Hosoi on 2017-08-02 13:16:19 EDT ---

The master branch of fluent-plugin-kubernetes_metadata_filter has the fix:

  commit 004c74ab904c2450f74ecb4fcb9be229722917eb
  Merge: d50e98c 4e9aa57
  Author: Jimmi Dyson <jimmidyson>
  Date:   Tue Aug 1 22:55:46 2017 +0100

    Merge pull request #80 from flix-tech/remove-namespace-watch-logging

    Removed logging of namespace watch events

  diff --git a/lib/fluent/plugin/filter_kubernetes_metadata.rb b/lib/fluent/plugin/filter_kubernetes_metadata.rb
  index 7f4ede4..a3fb86b 100644
  --- a/lib/fluent/plugin/filter_kubernetes_metadata.rb
  +++ b/lib/fluent/plugin/filter_kubernetes_metadata.rb
  @@ -405,7 +405,6 @@ module Fluent
         resource_version = @client.get_namespaces.resourceVersion
         watcher          = @client.watch_namespaces(resource_version)
         watcher.each do |notice|
  -        puts notice
           case notice.type
             when 'MODIFIED'
               cache_key = notice.object['metadata']['name']
Is it ok to reset the status of this bug to POST?

--- Additional comment from Rich Megginson on 2017-08-02 13:21:31 EDT ---

(In reply to Noriko Hosoi from comment #13)
> The master branch of fluent-plugin-kubernetes_metadata_filter has the fix:
> 
>   commit 004c74ab904c2450f74ecb4fcb9be229722917eb
>   Merge: d50e98c 4e9aa57
>   Author: Jimmi Dyson <jimmidyson>
>   Date:   Tue Aug 1 22:55:46 2017 +0100
> 
>     Merge pull request #80 from flix-tech/remove-namespace-watch-logging
>     
>     Removed logging of namespace watch events
> 
>   diff --git a/lib/fluent/plugin/filter_kubernetes_metadata.rb
> b/lib/fluent/plugin/filter_kubernetes_metadata.rb
>   index 7f4ede4..a3fb86b 100644
>   --- a/lib/fluent/plugin/filter_kubernetes_metadata.rb
>   +++ b/lib/fluent/plugin/filter_kubernetes_metadata.rb
>   @@ -405,7 +405,6 @@ module Fluent
>          resource_version = @client.get_namespaces.resourceVersion
>          watcher          = @client.watch_namespaces(resource_version)
>          watcher.each do |notice|
>   -        puts notice
>            case notice.type
>              when 'MODIFIED'
>                cache_key = notice.object['metadata']['name']
> Is it ok to reset the status of this bug to POST?

Has this fix been released as a rubygem available from rubygems.org?  If so, then yes, please mark as POST, otherwise, MODIFIED.

--- Additional comment from Noriko Hosoi on 2017-08-02 13:35:07 EDT ---

(In reply to Rich Megginson from comment #14)
> Has this fix been released as a rubygem available from rubygems.org?  If so,
> then yes, please mark as POST, otherwise, MODIFIED.

Thanks, Rich.

On the rubytem.org site:
 https://rubygems.org/gems/fluent-plugin-kubernetes_metadata_filter
the latest available version is:
 0.27.0 - May 18, 2017 (20.5 KB) 

That is, it's not there yet.

Resetting the bug's status to MODIFIED.

(Interesting.  We are using the MODIFIED status when the change is included in the built bits for the download release.  Indeed, POST is for the upstream fix.  And yes, in this case, fixing in the downstream is simpler for us...  I need to switch my common sense. :)

--- Additional comment from Rich Megginson on 2017-08-02 18:09:45 EDT ---

@jdyson - when is the next rubygems release of fluent-plugin-kubernetes_metadata_filter planned?

--- Additional comment from Rich Megginson on 2017-08-02 18:10:15 EDT ---

@nhosoi - yeah, I think POST is more appropriate in this case

--- Additional comment from Jimmi Dyson on 2017-08-03 04:00:29 EDT ---

0.28.0 just released including the fix.

--- Additional comment from Rich Megginson on 2017-08-14 12:51:56 EDT ---

commit 21c7f15ff6d95398bef044540f58064260f3ffca
Author: Rich Megginson <rmeggins>
Date:   Mon Aug 14 10:36:20 2017 -0600

    0.28.0
    Bug 1476731 - fluentd log is filled up with KubeClient messages
        when using journal and output queue is full

built and tagged in to rhaos-3.6-rhel-7-candidate and rhaos-3.7-rhel-7-candidate

Comment 1 Rich Megginson 2017-08-16 00:33:12 UTC

https://github.com/openshift/origin-aggregated-logging/commit/75dea649f637fea84d177ef9cd2a112dbb862d60

fix will get picked up when we merge upstream to downstream and build fluentd

Comment 2 Rich Megginson 2017-09-08 16:14:33 UTC

Fixed in 3.6:

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=595912
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-82316-20170908145540
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.31-2

Waiting for 3.7 puddle rebuild to fix in 3.7

Comment 3 Junqi Zhao 2017-09-15 09:50:01 UTC

Test steps:

1. Create a 3 projects and populate logs before logging is deployed
# for i in {1..3}
do
      oc new-project project${i}
      oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT
      sleep 10s
done

2. Deploy logging with openshift_logging_fluentd_journal_read_from_head=true in inventory file

2. To increase the read journal load of fluentd, created 27 projects with 38 pods running inside on openshift, each pod sended out logs continously:

# for i in {4..30}
do
      oc new-project project${i}
      oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT
      sleep 30s
done


3. Wait after a few hour until every index can be found in ES, searched in fluentd logs for "BufferQueueLimitError", no findings.


Test env:
# openshift version
openshift v3.7.0-0.126.1
kubernetes v1.7.0+80709908fd
etcd 3.2.1


Images tested with:
logging-kibana:v3.7.0-0.126.1.0
logging-fluentd:v3.7.0-0.126.1.0
logging-auth-proxy:v3.7.0-0.126.1.0
logging-curator:v3.7.0-0.126.1.0
logging-elasticsearch:v3.7.0-0.126.1.0

Comment 6 errata-xmlrpc 2017-11-28 22:07:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188