Bug 1734177

Summary: some fluentd buffers won't flush with Invalid UTF-8 start byte 0x92
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED EOL QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, rmeggins
Target Milestone: ---Flags: jcantril: needinfo? (stwalter)
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-22 14:55:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steven Walter 2019-07-29 21:08:13 UTC
Description of problem:

Some fluentd pods are not sending logs. Their logs show:

2019-07-11 19:01:22 -0300 [warn]: fluent/output.rb:381:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2019-07-11 19:06:22 -0300 error_class="Elasticsearch::Transport::Transport::Errors::InternalServerError" error="[500] {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@656db2d1; line: 1, column: 173]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@656db2d1; line: 1, column: 173]\"},\"status\":500}" plugin_id="object:3fbfc7979034"

This is very similar to some previously opened bugs:
In the past, it was determined with high confidence the issue was caused when the buffers ran out of disk space, causing corruption. However, in this case, disk space does not appear to be full at all.

Version-Release number of selected component (if applicable):

Additional info:
I am not convinced this is a bug; we are still monitoring disk usage on the nodes. However, it seems possible this is not the case and want to cover all bases.

Comment 7 Jeff Cantrill 2019-07-30 14:25:05 UTC
This is a duplicate of the issues referenced. If fluent is truly stuck spinning without pushing messages you could try removing the bad messages using [1] https://github.com/openshift/origin-aggregated-logging/tree/master/fluentd#sanitize_msg_chunks which may be in the 3.9 fluent image.  It will require access to the node.

This utility was merged into 3.9 but looking at HEAD of 3.9 I no longer see it and am unable to determine when it was removed.  Regardless, you should be able to copy the code from the master branch into a debug fluentd pod and sanitize the buffers.

Comment 8 Steven Walter 2019-07-30 20:04:49 UTC
Yes, I suspected it was a dupe of those bugs as noted. I'll check with the customer on the sanitizing msg chunks

Comment 9 Jeff Cantrill 2019-08-22 14:55:50 UTC
Closing EOL as likely not going to address before release EOL in Oct 2019