Bug 1476731 - fluentd log is filled up with KubeClient messages when using journal and output queue is full
Summary: fluentd log is filled up with KubeClient messages when using journal and outp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 3.6.z
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1481354
TreeView+ depends on / blocked
 
Reported: 2017-07-31 10:00 UTC by Xia Zhao
Modified: 2017-09-08 03:15 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
This is a bug in fluentd filter_kubernetes_metadata plugin. Please ignore fluentd log messages from Kubeclient::Common::WatchNotice like this. #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{...}, :spec=>{...}, :status=>{:phase=>"Active"}}>
Clone Of:
: 1481354 (view as bug list)
Environment:
Last Closed: 2017-09-08 03:15:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2642 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6.1 bug fix and enhancement update 2017-09-08 07:14:52 UTC

Description Xia Zhao 2017-07-31 10:00:48 UTC
Description of problem:
Observed in https://bugzilla.redhat.com/show_bug.cgi?id=1473788#c4

When fluentd is reading from the journald, and the output buffer queue is full, fluentd log is filled up with KubeClient messages:
2017-07-28 03:36:20 -0400 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>
#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-system", :selfLink=>"/api/v1/namespaces/kube-system", :uid=>"55d84610-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14117", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c1,c0", :"openshift.io/sa.scc.supplemental-groups"=>"1000000000/10000", :"openshift.io/sa.scc.uid-range"=>"1000000000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>
#<Kubeclient::Common::WatchNotice type="ADDED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"project1", :selfLink=>"/api/v1/namespaces/project1", :uid=>"f3223a65-736a-11e7-98a0-fa163ec70700", :resourceVersion=>"14184", :creationTimestamp=>"2017-07-28T08:01:24Z", :annotations=>{:"openshift.io/description"=>"", :"openshift.io/display-name"=>"", :"openshift.io/requester"=>"xiazhao"}}, :spec=>{:finalizers=>["openshift.io/origin", "kubernetes"]}, :status=>{:phase=>"Active"}}>
#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"project1", :selfLink=>"/api/v1/namespaces/project1", :uid=>"f3223a65-736a-11e7-98a0-fa163ec70700", :resourceVersion=>"14187", :creationTimestamp=>"2017-07-28T08:01:24Z", :annotations=>{:"openshift.io/description"=>"", :"openshift.io/display-name"=>"", :"openshift.io/requester"=>"xiazhao", :"openshift.io/sa.scc.mcs"=>"s0:c9,c4", :"openshift.io/sa.scc.supplemental-groups"=>"1000080000/10000", :"openshift.io/sa.scc.uid-range"=>"1000080000/10000"}}, :spec=>{:finalizers=>["openshift.io/origin", "kubernetes"]}, :status=>{:phase=>"Active"}}>

Version-Release number of selected component (if applicable):
logging-fluentd            v3.6.172.0.4-1      424d76ec1b51        3 hours ago         231.7 MB

How reproducible:
Always

Steps to Reproduce:
1. Deployed logging with openshift_logging_fluentd_journal_read_from_head=true in inventory file

2. To increase the read journal load of fluentd, created 30 projects with 38 pods running inside on openshift, each pod sended out logs continously:

# for i in {1..30}
do
      oc new-project project${i}
      oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT
      sleep 30
done

# oc get po --all-namespaces | grep -i running | wc -l
38

3. Wait after 1 hour until every index can be found in ES, check the fluentd log

Actual results:
fluentd log is filled up with KubeClient messages

Expected results:
fluentd log shouldn't be filled up with KubeClient messages

Additional info:
full logs here: https://bugzilla.redhat.com/attachment.cgi?id=1305848

Comment 1 Noriko Hosoi 2017-07-31 18:43:52 UTC
The current log level of the fluentd is "warn".

$ cat openshift/system.conf
<system>
  log_level warn
</system>

That makes these messages visible in the fluentd log.
> [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not ... 

We have another request from the RHV team to change the default log level:

On 07/31/2017 02:23 AM, Shirly Radco wrote:
> I would like to propose adding 
> 
> <system>
>   log_level error
> </system>
> 
>  to a default fluentd.conf provided with the fluentd package.
> 
> It is a general configuration and not oVirt specific.
> 
> We require it since when the client fluent can't connect to the mux it starts logging many warning messages.

Do we want to change the fluentd's default log level to error?  Please note that if we lose these warning logs, we may get waited for a loooong time for the process completion without any input.  I feel it might make the administrators' job harder.

Regarding the "output buffer queue is full" issue itself, did you have a chance to tune the fluentd buffer chunk limit, Xia?  Could you please tell us what is the current size?  1m?
> openshift/output-es-config.conf:   buffer_chunk_limit "#{ENV['BUFFER_SIZE_LIMIT'] || '1m' }"

I assume you are using memory type buffering.  Could it be possible to increase the memory limit?  By default, it'd be 512M.  Could you try 2G, for instance?  Then, set the environment variable BUFFER_SIZE_LIMIT to 16m to retry the test?

Thanks.

Comment 2 Rich Megginson 2017-07-31 19:02:05 UTC
re: log levels - another option is to change the log level for individual plugins.  Some plugins support a `log_level` parameter.  That may allow us to sufficiently increase the signal to noise ratio in the fluentd logs.

If that's not good enough, then we may have to submit PRs for the upstream code for the plugins and perhaps fluentd itself to get those log levels changed for those messages.



This (bz) problem is that we are now seeing messages like this that we have not seen before:

#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>

I'm not sure why we are seeing these now.

Comment 3 Noriko Hosoi 2017-07-31 19:26:39 UTC
Sorry, I misunderstood the log output.  So, this message does not have the date, does it?

#<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>

Comment 4 Rich Megginson 2017-07-31 19:46:47 UTC
@nhosoi right - it doesn't look like a normal fluentd log message - I suspect the kubernetes metadata plugin is writing these messages directly to stdout/stderr instead of using `log.error("....")`

Comment 5 Noriko Hosoi 2017-07-31 21:02:06 UTC
I see. @richm, thanks! 

I've been looking into the kubeclient code.  The watch_stream.rb is retrieving the contents of the buffer. The lib is required by "filter_kubernetes_metadata.rb".

$ egrep -a4 WatchNotice lib/*/*
[...]
--
lib/kubeclient/watch_stream.rb-        buffer = ''
lib/kubeclient/watch_stream.rb-        response.body.each do |chunk|
lib/kubeclient/watch_stream.rb-          buffer << chunk
lib/kubeclient/watch_stream.rb-          while (line = buffer.slice!(/.+\n/))
lib/kubeclient/watch_stream.rb:            yield(@format == :json ? WatchNotice.new(JSON.parse(line)) : line.chomp)
lib/kubeclient/watch_stream.rb-          end
lib/kubeclient/watch_stream.rb-        end
lib/kubeclient/watch_stream.rb-      rescue IOError
lib/kubeclient/watch_stream.rb-        raise unless @finished

It looks the watch code is enabled by default in "filter_kubernetes_metadata.rb".
    config_param :watch, :bool, default: true
        [...]
        if @watch
          thread = Thread.new(self) { |this| this.start_watch }

I haven't tried it myself yet, but is it possible to run the test with disabling watch in /etc/fluentd/configs.d/openshift/filter-k8s-meta.conf like this?

 <filter kubernetes.**>
   type kubernetes_metadata
+  @watch false
   kubernetes_url "#{ENV['K8S_HOST_URL']}"
   bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token

Comment 6 Rich Megginson 2017-07-31 21:22:54 UTC
The watch code is for performance.  That is what allows us to use cached values rather than query the Kubernetes API server for every record - it is used to keep the cache up-to-date.  If we get rid of the watch, the performance will be bad.

Did you find some code in the kubeclient libraries that does a "puts" or otherwise writes directly to stdout/stderr?

Comment 7 Noriko Hosoi 2017-07-31 21:34:15 UTC
Not in the kubeclient libraries; all "puts" I found is this namespace watch in "filter_kubernetes_metadata.rb", where "notice" seems coming from WatchNotice.

    def start_namespace_watch
      resource_version = @client.get_namespaces.resourceVersion
      watcher          = @client.watch_namespaces(resource_version)
      watcher.each do |notice|
        puts notice
        case notice.type
          when 'MODIFIED'
            cache_key = notice.object['metadata']['name']
            cached    = @namespace_cache[cache_key]
            if cached
              @namespace_cache[cache_key] = parse_namespace_metadata(notice.object)
            end

Continue investigating...

Comment 8 Rich Megginson 2017-07-31 23:22:05 UTC
>        puts notice

This is a bug.  This should be `log.debug(notice)` - this should not use puts.  When was this introduced?

Comment 9 Noriko Hosoi 2017-07-31 23:41:41 UTC
Well, it's pretty old...

> Date:   Tue Sep 8 15:31:45 2015 -0700

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/commit/692d95ea20810c96d231eccad93c059c1b8b92b3

Am I looking at a wrong branch?

Comment 10 Rich Megginson 2017-08-01 00:01:33 UTC
(In reply to Noriko Hosoi from comment #9)
> Well, it's pretty old...
> 
> > Date:   Tue Sep 8 15:31:45 2015 -0700
> 
> https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/commit/
> 692d95ea20810c96d231eccad93c059c1b8b92b3
> 
> Am I looking at a wrong branch?

You are correct.

Wow - how did we miss this problem for so long?

Comment 11 Noriko Hosoi 2017-08-01 00:11:20 UTC
Maybe, the outputs have been appreciated? :)

Comment 13 Noriko Hosoi 2017-08-02 17:16:19 UTC
The master branch of fluent-plugin-kubernetes_metadata_filter has the fix:

  commit 004c74ab904c2450f74ecb4fcb9be229722917eb
  Merge: d50e98c 4e9aa57
  Author: Jimmi Dyson <jimmidyson>
  Date:   Tue Aug 1 22:55:46 2017 +0100

    Merge pull request #80 from flix-tech/remove-namespace-watch-logging
    
    Removed logging of namespace watch events

  diff --git a/lib/fluent/plugin/filter_kubernetes_metadata.rb b/lib/fluent/plugin/filter_kubernetes_metadata.rb
  index 7f4ede4..a3fb86b 100644
  --- a/lib/fluent/plugin/filter_kubernetes_metadata.rb
  +++ b/lib/fluent/plugin/filter_kubernetes_metadata.rb
  @@ -405,7 +405,6 @@ module Fluent
         resource_version = @client.get_namespaces.resourceVersion
         watcher          = @client.watch_namespaces(resource_version)
         watcher.each do |notice|
  -        puts notice
           case notice.type
             when 'MODIFIED'
               cache_key = notice.object['metadata']['name']
Is it ok to reset the status of this bug to POST?

Comment 14 Rich Megginson 2017-08-02 17:21:31 UTC
(In reply to Noriko Hosoi from comment #13)
> The master branch of fluent-plugin-kubernetes_metadata_filter has the fix:
> 
>   commit 004c74ab904c2450f74ecb4fcb9be229722917eb
>   Merge: d50e98c 4e9aa57
>   Author: Jimmi Dyson <jimmidyson>
>   Date:   Tue Aug 1 22:55:46 2017 +0100
> 
>     Merge pull request #80 from flix-tech/remove-namespace-watch-logging
>     
>     Removed logging of namespace watch events
> 
>   diff --git a/lib/fluent/plugin/filter_kubernetes_metadata.rb
> b/lib/fluent/plugin/filter_kubernetes_metadata.rb
>   index 7f4ede4..a3fb86b 100644
>   --- a/lib/fluent/plugin/filter_kubernetes_metadata.rb
>   +++ b/lib/fluent/plugin/filter_kubernetes_metadata.rb
>   @@ -405,7 +405,6 @@ module Fluent
>          resource_version = @client.get_namespaces.resourceVersion
>          watcher          = @client.watch_namespaces(resource_version)
>          watcher.each do |notice|
>   -        puts notice
>            case notice.type
>              when 'MODIFIED'
>                cache_key = notice.object['metadata']['name']
> Is it ok to reset the status of this bug to POST?

Has this fix been released as a rubygem available from rubygems.org?  If so, then yes, please mark as POST, otherwise, MODIFIED.

Comment 15 Noriko Hosoi 2017-08-02 17:35:07 UTC
(In reply to Rich Megginson from comment #14)
> Has this fix been released as a rubygem available from rubygems.org?  If so,
> then yes, please mark as POST, otherwise, MODIFIED.

Thanks, Rich.

On the rubytem.org site:
 https://rubygems.org/gems/fluent-plugin-kubernetes_metadata_filter
the latest available version is:
 0.27.0 - May 18, 2017 (20.5 KB) 

That is, it's not there yet.

Resetting the bug's status to MODIFIED.

(Interesting.  We are using the MODIFIED status when the change is included in the built bits for the download release.  Indeed, POST is for the upstream fix.  And yes, in this case, fixing in the downstream is simpler for us...  I need to switch my common sense. :)

Comment 16 Rich Megginson 2017-08-02 22:09:45 UTC
@jdyson - when is the next rubygems release of fluent-plugin-kubernetes_metadata_filter planned?

Comment 17 Rich Megginson 2017-08-02 22:10:15 UTC
@nhosoi - yeah, I think POST is more appropriate in this case

Comment 18 Jimmi Dyson 2017-08-03 08:00:29 UTC
0.28.0 just released including the fix.

Comment 19 Rich Megginson 2017-08-14 16:51:56 UTC
commit 21c7f15ff6d95398bef044540f58064260f3ffca
Author: Rich Megginson <rmeggins>
Date:   Mon Aug 14 10:36:20 2017 -0600

    0.28.0
    Bug 1476731 - fluentd log is filled up with KubeClient messages
        when using journal and output queue is full

built and tagged in to rhaos-3.6-rhel-7-candidate and rhaos-3.7-rhel-7-candidate

Comment 20 Rich Megginson 2017-08-16 00:31:36 UTC
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=583049
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-77006-20170815231440
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5-6

Comment 21 Rich Megginson 2017-08-16 00:34:04 UTC
@smunilla - how do I get this bug/image attached to the 3.6.1 errata?

Comment 22 Xia Zhao 2017-08-16 09:24:23 UTC
The bug verification work is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1482002

Comment 23 Xia Zhao 2017-08-24 10:05:20 UTC
It's fixed with fluentd v3.6.173.0.5-10

Created 38 pods running on openshift env and fluentd log is back to normal, with log entries gracefully show on kibana. Set to verified.

Image tested with:
logging-fluentd   v3.6.173.0.5-10     58ab4badc0b7        13 hours ago        235.1 MB

Comment 25 errata-xmlrpc 2017-09-08 03:15:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2642


Note You need to log in before you can comment on or make changes to this bug.