Description of problem: Setup: split-fluentd (secure forward) [1] 20 enrichment fluentd pods 200 fluentd dispatcher pods While load testing logging split-fluentd configured with secure_forward we've found that most of the enrichment pods buffer_chunk_limits were being exceeded. How reproducible: Steps to Reproduce: 1. Install logging infra 2. Scale enrichment fluentd deploymentconfigs to 20 3. Label 200 nodes with logging-infra-fluentd=true Actual results: In most of the enrichment fluent pods 2017-01-04 13:31:15 -0500 [info]: reading config file path="/etc/fluent/fluent.conf" 2017-01-04 18:32:23 -0500 [warn]: Size of the emitted data exceeds buffer_chunk_limit. 2017-01-04 18:32:23 -0500 [warn]: This may occur problems in the output plugins ``at this server.`` 2017-01-04 18:32:23 -0500 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit 2017-01-04 18:32:23 -0500 [warn]: in the forward output ``at the log forwarding server.`` 2017-01-04 18:44:38 -0500 [warn]: Size of the emitted data exceeds buffer_chunk_limit. 2017-01-04 18:44:38 -0500 [warn]: This may occur problems in the output plugins ``at this server.`` 2017-01-04 18:44:38 -0500 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit 2017-01-04 18:44:38 -0500 [warn]: in the forward output ``at the log forwarding server.`` Expected results: Pod logs should display: 2017-01-04 13:31:14 -0500 [info]: reading config file path="/etc/fluent/fluent.conf" At 10wppn one of the fluentds had: http://pbench.perf.lab.eng.bos.redhat.com/results/ocp34-master-1/erd_fluentd_256ll_10wppn_20_200_3ES_1h/log/oc/post_test/pods/oc_logs-logging-enrichment-fluentd-2-8xufg.log Not sure if it is a bug per se but it is worth registering in order to mitigate buffer_chunk_limit and on how that limit is determined. Additional info: [1] https://github.com/ewolinetz/origin-aggregated-logging/blob/7a814f58513b03f1660290864e550efc9db1e7a6/deployer/README.md#splitting-your-fluentd-for-high-scale-capacity-advanced
Looks like the enrichment pods need a bigger chunk size to match what the dispatcher pods are emitting. See https://groups.google.com/d/msg/fluentd/tb_kiN80cNk/_-OnzhGEcykJ
Created https://trello.com/c/pHnZPpoE to make alterations.
Modifying priority/severity based on conversation with @jerms
Thanks Jeff -- the end of that thread was that we're not aware of this causing a loss of log data. If someone can prove that, then the prio of this would become urgent. If/until that's proven, this BZ is about correctness, optimization and cleaning up logs ... and can be dealt with as low severity.
Noriko is working on this.
(In reply to Peter Portante from comment #2) > Looks like the enrichment pods need a bigger chunk size to match what the > dispatcher pods are emitting. I agree with Peter. Could you find out the version(s) of fluentd? E.g., 0.12.29 Can we also have the buffer_chunk_size and buffer_queue_size values on the enrichment pods as well as on the dispatcher pods if they are configured? oc rsh $ENRICHMENT_FLUENTD_POD as well as $DISPATCHER_FLUENTD_POD grep buffer_chunk_limit /etc/fluent/configs.d/*/*.conf If you don't see the output, the default value in the version should be used... Note: In the upstream origin-aggregated-logging, we made the parameters configurable, that is, we could pass them via environment variables. Thanks.
Could you rerun the test using these container images? Thanks. repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-20170525163637 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.84-2 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.84 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6 koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=560368
Is it ok to set "Target Release" to 3.6.0?
*** Bug 1442951 has been marked as a duplicate of this bug. ***
This bug is targeted for 3.6.0, but verification is blocked by bug 1469859 which is currently targeted for 3.6.1
tl;dr This message means that the size of a single log record is bigger than the internal fluentd chunk size. Fluentd will still accept this record. When this happens, it is really important to find out where this large record came from, exactly how large it is, and ideally, get a copy of the record, so we can see if this is just a fluke, or a case we need to handle regularly. Both the elasticsearch and secure_forward plugins are ObjectBufferedOutput plugins. emit method: https://github.com/fluent/fluentd/blob/v0.12.37/lib/fluent/output.rb#L463 if @buffer.emit(key, data, chain) submit_flush end If the @buffer.emit call returns true, it tries to flush immediately. The buffer.emit call returns true if the current chunk is full (buffer_chunk_limit reached) _and_ the queue is empty: https://github.com/fluent/fluentd/blob/v0.12.37/lib/fluent/buffer.rb#L190 # chunk unique id is generated in #new_chunk chunk = (@map[key] ||= new_chunk(key)) if storable?(chunk, data) chain.next chunk << data return false @map[key] is the current chunk being filled. Note that it _is not queued_ - it is not yet in the queue. If the new data fits in the current chunk, just add it and continue - return false - no flush yet. elsif @queue.size >= @buffer_queue_limit raise BufferQueueLimitError, "queue size exceeds limit" end if the data is not storable, that means we need to add the current chunk to the queue and start a new chunk. However, if we are already at the queue size limit, we raise an exception. rescue BufferQueueLimitError => e case @buffer_queue_full_action when :exception raise e when :block # This is rough implementation. New Buffer API should improve this routine by using wait/signal $log.debug "buffer queue is full. Wait 1 second to re-emit events" sleep 1 retry when :drop_oldest_chunk $log.debug "buffer queue is full. Dropping oldest chunk" pop(nil) end end The default case is :exception which means the input plugin must be able to handle it. if data.bytesize > @buffer_chunk_limit $log.warn "Size of the emitted data exceeds buffer_chunk_limit." $log.warn "This may occur problems in the output plugins ``at this server.``" $log.warn "To avoid problems, set a smaller number to the buffer_chunk_limit" $log.warn "in the forward output ``at the log forwarding server.``" ### TODO # raise BufferChunkLimitError, "received data too large" end This is the case when the size of a single record exceeds the buffer_chunk_limit - which so far has been very rare, and yet fluentd will accept it with the warning.
Verified on 3.7.0-0.126.4. The buffer_chunk_limit message is no longer thrown. As logging-mux pods reach the limit of scalability (primarily driven by message rates), they exhibit high cpu utilization but no longer throw errors or lose messages.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188