Description of problem:
split-fluentd (secure forward) 
20 enrichment fluentd pods
200 fluentd dispatcher pods
While load testing logging split-fluentd configured with secure_forward we've found that most of the enrichment pods buffer_chunk_limits were being exceeded.
Steps to Reproduce:
1. Install logging infra
2. Scale enrichment fluentd deploymentconfigs to 20
3. Label 200 nodes with logging-infra-fluentd=true
In most of the enrichment fluent pods
2017-01-04 13:31:15 -0500 [info]: reading config file path="/etc/fluent/fluent.conf"
2017-01-04 18:32:23 -0500 [warn]: Size of the emitted data exceeds buffer_chunk_limit.
2017-01-04 18:32:23 -0500 [warn]: This may occur problems in the output plugins ``at this server.``
2017-01-04 18:32:23 -0500 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit
2017-01-04 18:32:23 -0500 [warn]: in the forward output ``at the log forwarding server.``
2017-01-04 18:44:38 -0500 [warn]: Size of the emitted data exceeds buffer_chunk_limit.
2017-01-04 18:44:38 -0500 [warn]: This may occur problems in the output plugins ``at this server.``
2017-01-04 18:44:38 -0500 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit
2017-01-04 18:44:38 -0500 [warn]: in the forward output ``at the log forwarding server.``
Pod logs should display:
2017-01-04 13:31:14 -0500 [info]: reading config file path="/etc/fluent/fluent.conf"
At 10wppn one of the fluentds had:
Not sure if it is a bug per se but it is worth registering in order to mitigate buffer_chunk_limit and on how that limit is determined.
Looks like the enrichment pods need a bigger chunk size to match what the dispatcher pods are emitting.
Created https://trello.com/c/pHnZPpoE to make alterations.
Modifying priority/severity based on conversation with @jerms
Thanks Jeff -- the end of that thread was that we're not aware of this causing a loss of log data. If someone can prove that, then the prio of this would become urgent. If/until that's proven, this BZ is about correctness, optimization and cleaning up logs ... and can be dealt with as low severity.
Noriko is working on this.
(In reply to Peter Portante from comment #2)
> Looks like the enrichment pods need a bigger chunk size to match what the
> dispatcher pods are emitting.
I agree with Peter.
Could you find out the version(s) of fluentd? E.g., 0.12.29
Can we also have the buffer_chunk_size and buffer_queue_size values on the enrichment pods as well as on the dispatcher pods if they are configured?
oc rsh $ENRICHMENT_FLUENTD_POD as well as $DISPATCHER_FLUENTD_POD
grep buffer_chunk_limit /etc/fluent/configs.d/*/*.conf
If you don't see the output, the default value in the version should be used...
Note: In the upstream origin-aggregated-logging, we made the parameters configurable, that is, we could pass them via environment variables.
Could you rerun the test using these container images? Thanks.
Is it ok to set "Target Release" to 3.6.0?
*** Bug 1442951 has been marked as a duplicate of this bug. ***
This bug is targeted for 3.6.0, but verification is blocked by bug 1469859 which is currently targeted for 3.6.1
tl;dr This message means that the size of a single log record is bigger than the internal fluentd chunk size. Fluentd will still accept this record. When this happens, it is really important to find out where this large record came from, exactly how large it is, and ideally, get a copy of the record, so we can see if this is just a fluke, or a case we need to handle regularly.
Both the elasticsearch and secure_forward plugins are ObjectBufferedOutput plugins. emit method: https://github.com/fluent/fluentd/blob/v0.12.37/lib/fluent/output.rb#L463
if @buffer.emit(key, data, chain)
If the @buffer.emit call returns true, it tries to flush immediately. The buffer.emit call returns true if the current chunk is full (buffer_chunk_limit reached) _and_ the queue is empty: https://github.com/fluent/fluentd/blob/v0.12.37/lib/fluent/buffer.rb#L190
# chunk unique id is generated in #new_chunk
chunk = (@map[key] ||= new_chunk(key))
if storable?(chunk, data)
chunk << data
@map[key] is the current chunk being filled. Note that it _is not queued_ - it is not yet in the queue. If the new data fits in the current chunk, just add it and continue - return false - no flush yet.
elsif @queue.size >= @buffer_queue_limit
raise BufferQueueLimitError, "queue size exceeds limit"
if the data is not storable, that means we need to add the current chunk to the queue and start a new chunk. However, if we are already at the queue size limit, we raise an exception.
rescue BufferQueueLimitError => e
# This is rough implementation. New Buffer API should improve this routine by using wait/signal
$log.debug "buffer queue is full. Wait 1 second to re-emit events"
$log.debug "buffer queue is full. Dropping oldest chunk"
The default case is :exception which means the input plugin must be able to handle it.
if data.bytesize > @buffer_chunk_limit
$log.warn "Size of the emitted data exceeds buffer_chunk_limit."
$log.warn "This may occur problems in the output plugins ``at this server.``"
$log.warn "To avoid problems, set a smaller number to the buffer_chunk_limit"
$log.warn "in the forward output ``at the log forwarding server.``"
# raise BufferChunkLimitError, "received data too large"
This is the case when the size of a single record exceeds the buffer_chunk_limit - which so far has been very rare, and yet fluentd will accept it with the warning.
Verified on 3.7.0-0.126.4. The buffer_chunk_limit message is no longer thrown. As logging-mux pods reach the limit of scalability (primarily driven by message rates), they exhibit high cpu utilization but no longer throw errors or lose messages.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.