Bug 1734177 - some fluentd buffers won't flush with Invalid UTF-8 start byte 0x92 [NEEDINFO]
Summary: some fluentd buffers won't flush with Invalid UTF-8 start byte 0x92
Keywords:
Status: CLOSED EOL
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.9.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-29 21:08 UTC by Steven Walter
Modified: 2019-08-22 14:55 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-22 14:55:50 UTC
Target Upstream Version:
jcantril: needinfo? (stwalter)


Attachments (Terms of Use)

Description Steven Walter 2019-07-29 21:08:13 UTC
Description of problem:

Some fluentd pods are not sending logs. Their logs show:

2019-07-11 19:01:22 -0300 [warn]: fluent/output.rb:381:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2019-07-11 19:06:22 -0300 error_class="Elasticsearch::Transport::Transport::Errors::InternalServerError" error="[500] {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@656db2d1; line: 1, column: 173]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@656db2d1; line: 1, column: 173]\"},\"status\":500}" plugin_id="object:3fbfc7979034"

This is very similar to some previously opened bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1654704
https://bugzilla.redhat.com/show_bug.cgi?id=1625254
In the past, it was determined with high confidence the issue was caused when the buffers ran out of disk space, causing corruption. However, in this case, disk space does not appear to be full at all.

Version-Release number of selected component (if applicable):
3.9.85-1


Additional info:
I am not convinced this is a bug; we are still monitoring disk usage on the nodes. However, it seems possible this is not the case and want to cover all bases.

Comment 7 Jeff Cantrill 2019-07-30 14:25:05 UTC
This is a duplicate of the issues referenced. If fluent is truly stuck spinning without pushing messages you could try removing the bad messages using [1] https://github.com/openshift/origin-aggregated-logging/tree/master/fluentd#sanitize_msg_chunks which may be in the 3.9 fluent image.  It will require access to the node.

This utility was merged into 3.9 but looking at HEAD of 3.9 I no longer see it and am unable to determine when it was removed.  Regardless, you should be able to copy the code from the master branch into a debug fluentd pod and sanitize the buffers.

Comment 8 Steven Walter 2019-07-30 20:04:49 UTC
Hi,
Yes, I suspected it was a dupe of those bugs as noted. I'll check with the customer on the sanitizing msg chunks

Comment 9 Jeff Cantrill 2019-08-22 14:55:50 UTC
Closing EOL as likely not going to address before release EOL in Oct 2019


Note You need to log in before you can comment on or make changes to this bug.