Description of problem: Observed in https://bugzilla.redhat.com/show_bug.cgi?id=1473788#c4 When fluentd is reading from the journald, and the output buffer queue is full, fluentd log is filled up with KubeClient messages: 2017-07-28 03:36:20 -0400 [info]: reading config file path="/etc/fluent/fluent.conf" 2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not 2017-07-28 03:36:29 -0400 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}> #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-system", :selfLink=>"/api/v1/namespaces/kube-system", :uid=>"55d84610-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14117", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c1,c0", :"openshift.io/sa.scc.supplemental-groups"=>"1000000000/10000", :"openshift.io/sa.scc.uid-range"=>"1000000000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}> #<Kubeclient::Common::WatchNotice type="ADDED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"project1", :selfLink=>"/api/v1/namespaces/project1", :uid=>"f3223a65-736a-11e7-98a0-fa163ec70700", :resourceVersion=>"14184", :creationTimestamp=>"2017-07-28T08:01:24Z", :annotations=>{:"openshift.io/description"=>"", :"openshift.io/display-name"=>"", :"openshift.io/requester"=>"xiazhao"}}, :spec=>{:finalizers=>["openshift.io/origin", "kubernetes"]}, :status=>{:phase=>"Active"}}> #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"project1", :selfLink=>"/api/v1/namespaces/project1", :uid=>"f3223a65-736a-11e7-98a0-fa163ec70700", :resourceVersion=>"14187", :creationTimestamp=>"2017-07-28T08:01:24Z", :annotations=>{:"openshift.io/description"=>"", :"openshift.io/display-name"=>"", :"openshift.io/requester"=>"xiazhao", :"openshift.io/sa.scc.mcs"=>"s0:c9,c4", :"openshift.io/sa.scc.supplemental-groups"=>"1000080000/10000", :"openshift.io/sa.scc.uid-range"=>"1000080000/10000"}}, :spec=>{:finalizers=>["openshift.io/origin", "kubernetes"]}, :status=>{:phase=>"Active"}}> Version-Release number of selected component (if applicable): logging-fluentd v3.6.172.0.4-1 424d76ec1b51 3 hours ago 231.7 MB How reproducible: Always Steps to Reproduce: 1. Deployed logging with openshift_logging_fluentd_journal_read_from_head=true in inventory file 2. To increase the read journal load of fluentd, created 30 projects with 38 pods running inside on openshift, each pod sended out logs continously: # for i in {1..30} do oc new-project project${i} oc new-app chunyunchen/java-mainclass:2.3-SNAPSHOT sleep 30 done # oc get po --all-namespaces | grep -i running | wc -l 38 3. Wait after 1 hour until every index can be found in ES, check the fluentd log Actual results: fluentd log is filled up with KubeClient messages Expected results: fluentd log shouldn't be filled up with KubeClient messages Additional info: full logs here: https://bugzilla.redhat.com/attachment.cgi?id=1305848
The current log level of the fluentd is "warn". $ cat openshift/system.conf <system> log_level warn </system> That makes these messages visible in the fluentd log. > [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not ... We have another request from the RHV team to change the default log level: On 07/31/2017 02:23 AM, Shirly Radco wrote: > I would like to propose adding > > <system> > log_level error > </system> > > to a default fluentd.conf provided with the fluentd package. > > It is a general configuration and not oVirt specific. > > We require it since when the client fluent can't connect to the mux it starts logging many warning messages. Do we want to change the fluentd's default log level to error? Please note that if we lose these warning logs, we may get waited for a loooong time for the process completion without any input. I feel it might make the administrators' job harder. Regarding the "output buffer queue is full" issue itself, did you have a chance to tune the fluentd buffer chunk limit, Xia? Could you please tell us what is the current size? 1m? > openshift/output-es-config.conf: buffer_chunk_limit "#{ENV['BUFFER_SIZE_LIMIT'] || '1m' }" I assume you are using memory type buffering. Could it be possible to increase the memory limit? By default, it'd be 512M. Could you try 2G, for instance? Then, set the environment variable BUFFER_SIZE_LIMIT to 16m to retry the test? Thanks.
re: log levels - another option is to change the log level for individual plugins. Some plugins support a `log_level` parameter. That may allow us to sufficiently increase the signal to noise ratio in the fluentd logs. If that's not good enough, then we may have to submit PRs for the upstream code for the plugins and perhaps fluentd itself to get those log levels changed for those messages. This (bz) problem is that we are now seeing messages like this that we have not seen before: #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}> I'm not sure why we are seeing these now.
Sorry, I misunderstood the log output. So, this message does not have the date, does it? #<Kubeclient::Common::WatchNotice type="MODIFIED", object={:kind=>"Namespace", :apiVersion=>"v1", :metadata=>{:name=>"kube-public", :selfLink=>"/api/v1/namespaces/kube-public", :uid=>"55d90bad-7315-11e7-98a0-fa163ec70700", :resourceVersion=>"14110", :creationTimestamp=>"2017-07-27T21:48:32Z", :annotations=>{:"openshift.io/sa.scc.mcs"=>"s0:c3,c2", :"openshift.io/sa.scc.supplemental-groups"=>"1000010000/10000", :"openshift.io/sa.scc.uid-range"=>"1000010000/10000"}}, :spec=>{:finalizers=>["kubernetes", "openshift.io/origin"]}, :status=>{:phase=>"Active"}}>
@nhosoi right - it doesn't look like a normal fluentd log message - I suspect the kubernetes metadata plugin is writing these messages directly to stdout/stderr instead of using `log.error("....")`
I see. @richm, thanks! I've been looking into the kubeclient code. The watch_stream.rb is retrieving the contents of the buffer. The lib is required by "filter_kubernetes_metadata.rb". $ egrep -a4 WatchNotice lib/*/* [...] -- lib/kubeclient/watch_stream.rb- buffer = '' lib/kubeclient/watch_stream.rb- response.body.each do |chunk| lib/kubeclient/watch_stream.rb- buffer << chunk lib/kubeclient/watch_stream.rb- while (line = buffer.slice!(/.+\n/)) lib/kubeclient/watch_stream.rb: yield(@format == :json ? WatchNotice.new(JSON.parse(line)) : line.chomp) lib/kubeclient/watch_stream.rb- end lib/kubeclient/watch_stream.rb- end lib/kubeclient/watch_stream.rb- rescue IOError lib/kubeclient/watch_stream.rb- raise unless @finished It looks the watch code is enabled by default in "filter_kubernetes_metadata.rb". config_param :watch, :bool, default: true [...] if @watch thread = Thread.new(self) { |this| this.start_watch } I haven't tried it myself yet, but is it possible to run the test with disabling watch in /etc/fluentd/configs.d/openshift/filter-k8s-meta.conf like this? <filter kubernetes.**> type kubernetes_metadata + @watch false kubernetes_url "#{ENV['K8S_HOST_URL']}" bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
The watch code is for performance. That is what allows us to use cached values rather than query the Kubernetes API server for every record - it is used to keep the cache up-to-date. If we get rid of the watch, the performance will be bad. Did you find some code in the kubeclient libraries that does a "puts" or otherwise writes directly to stdout/stderr?
Not in the kubeclient libraries; all "puts" I found is this namespace watch in "filter_kubernetes_metadata.rb", where "notice" seems coming from WatchNotice. def start_namespace_watch resource_version = @client.get_namespaces.resourceVersion watcher = @client.watch_namespaces(resource_version) watcher.each do |notice| puts notice case notice.type when 'MODIFIED' cache_key = notice.object['metadata']['name'] cached = @namespace_cache[cache_key] if cached @namespace_cache[cache_key] = parse_namespace_metadata(notice.object) end Continue investigating...
> puts notice This is a bug. This should be `log.debug(notice)` - this should not use puts. When was this introduced?
Well, it's pretty old... > Date: Tue Sep 8 15:31:45 2015 -0700 https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/commit/692d95ea20810c96d231eccad93c059c1b8b92b3 Am I looking at a wrong branch?
(In reply to Noriko Hosoi from comment #9) > Well, it's pretty old... > > > Date: Tue Sep 8 15:31:45 2015 -0700 > > https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/commit/ > 692d95ea20810c96d231eccad93c059c1b8b92b3 > > Am I looking at a wrong branch? You are correct. Wow - how did we miss this problem for so long?
Maybe, the outputs have been appreciated? :)
Filed an issue: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/issues/84
The master branch of fluent-plugin-kubernetes_metadata_filter has the fix: commit 004c74ab904c2450f74ecb4fcb9be229722917eb Merge: d50e98c 4e9aa57 Author: Jimmi Dyson <jimmidyson> Date: Tue Aug 1 22:55:46 2017 +0100 Merge pull request #80 from flix-tech/remove-namespace-watch-logging Removed logging of namespace watch events diff --git a/lib/fluent/plugin/filter_kubernetes_metadata.rb b/lib/fluent/plugin/filter_kubernetes_metadata.rb index 7f4ede4..a3fb86b 100644 --- a/lib/fluent/plugin/filter_kubernetes_metadata.rb +++ b/lib/fluent/plugin/filter_kubernetes_metadata.rb @@ -405,7 +405,6 @@ module Fluent resource_version = @client.get_namespaces.resourceVersion watcher = @client.watch_namespaces(resource_version) watcher.each do |notice| - puts notice case notice.type when 'MODIFIED' cache_key = notice.object['metadata']['name'] Is it ok to reset the status of this bug to POST?
(In reply to Noriko Hosoi from comment #13) > The master branch of fluent-plugin-kubernetes_metadata_filter has the fix: > > commit 004c74ab904c2450f74ecb4fcb9be229722917eb > Merge: d50e98c 4e9aa57 > Author: Jimmi Dyson <jimmidyson> > Date: Tue Aug 1 22:55:46 2017 +0100 > > Merge pull request #80 from flix-tech/remove-namespace-watch-logging > > Removed logging of namespace watch events > > diff --git a/lib/fluent/plugin/filter_kubernetes_metadata.rb > b/lib/fluent/plugin/filter_kubernetes_metadata.rb > index 7f4ede4..a3fb86b 100644 > --- a/lib/fluent/plugin/filter_kubernetes_metadata.rb > +++ b/lib/fluent/plugin/filter_kubernetes_metadata.rb > @@ -405,7 +405,6 @@ module Fluent > resource_version = @client.get_namespaces.resourceVersion > watcher = @client.watch_namespaces(resource_version) > watcher.each do |notice| > - puts notice > case notice.type > when 'MODIFIED' > cache_key = notice.object['metadata']['name'] > Is it ok to reset the status of this bug to POST? Has this fix been released as a rubygem available from rubygems.org? If so, then yes, please mark as POST, otherwise, MODIFIED.
(In reply to Rich Megginson from comment #14) > Has this fix been released as a rubygem available from rubygems.org? If so, > then yes, please mark as POST, otherwise, MODIFIED. Thanks, Rich. On the rubytem.org site: https://rubygems.org/gems/fluent-plugin-kubernetes_metadata_filter the latest available version is: 0.27.0 - May 18, 2017 (20.5 KB) That is, it's not there yet. Resetting the bug's status to MODIFIED. (Interesting. We are using the MODIFIED status when the change is included in the built bits for the download release. Indeed, POST is for the upstream fix. And yes, in this case, fixing in the downstream is simpler for us... I need to switch my common sense. :)
@jdyson - when is the next rubygems release of fluent-plugin-kubernetes_metadata_filter planned?
@nhosoi - yeah, I think POST is more appropriate in this case
0.28.0 just released including the fix.
commit 21c7f15ff6d95398bef044540f58064260f3ffca Author: Rich Megginson <rmeggins> Date: Mon Aug 14 10:36:20 2017 -0600 0.28.0 Bug 1476731 - fluentd log is filled up with KubeClient messages when using journal and output queue is full built and tagged in to rhaos-3.6-rhel-7-candidate and rhaos-3.7-rhel-7-candidate
koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=583049 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-77006-20170815231440 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.5-6
@smunilla - how do I get this bug/image attached to the 3.6.1 errata?
The bug verification work is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1482002
It's fixed with fluentd v3.6.173.0.5-10 Created 38 pods running on openshift env and fluentd log is back to normal, with log entries gracefully show on kibana. Set to verified. Image tested with: logging-fluentd v3.6.173.0.5-10 58ab4badc0b7 13 hours ago 235.1 MB
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2642