Description of problem: Some fluentd pods are not sending logs. Their logs show: 2019-07-11 19:01:22 -0300 [warn]: fluent/output.rb:381:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2019-07-11 19:06:22 -0300 error_class="Elasticsearch::Transport::Transport::Errors::InternalServerError" error="[500] {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@656db2d1; line: 1, column: 173]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@656db2d1; line: 1, column: 173]\"},\"status\":500}" plugin_id="object:3fbfc7979034" This is very similar to some previously opened bugs: https://bugzilla.redhat.com/show_bug.cgi?id=1654704 https://bugzilla.redhat.com/show_bug.cgi?id=1625254 In the past, it was determined with high confidence the issue was caused when the buffers ran out of disk space, causing corruption. However, in this case, disk space does not appear to be full at all. Version-Release number of selected component (if applicable): 3.9.85-1 Additional info: I am not convinced this is a bug; we are still monitoring disk usage on the nodes. However, it seems possible this is not the case and want to cover all bases.
This is a duplicate of the issues referenced. If fluent is truly stuck spinning without pushing messages you could try removing the bad messages using [1] https://github.com/openshift/origin-aggregated-logging/tree/master/fluentd#sanitize_msg_chunks which may be in the 3.9 fluent image. It will require access to the node. This utility was merged into 3.9 but looking at HEAD of 3.9 I no longer see it and am unable to determine when it was removed. Regardless, you should be able to copy the code from the master branch into a debug fluentd pod and sanitize the buffers.
Hi, Yes, I suspected it was a dupe of those bugs as noted. I'll check with the customer on the sanitizing msg chunks
Closing EOL as likely not going to address before release EOL in Oct 2019
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days