Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be unavailable on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1654704 - Invalid UTF-8 start byte 0x92
Summary: Invalid UTF-8 start byte 0x92
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
: 1716553 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-29 13:14 UTC by Nicolas Nosenzo
Modified: 2021-03-01 08:43 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-26 14:57:30 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Nicolas Nosenzo 2018-11-29 13:14:48 UTC
Created attachment 1509802 [details]
logging dump

Description of problem:
Fluentd shows the following message:
[warn]: temporarily failed to flush the buffer. next_retry=2018-11-05 14:40:27 -0500 error_class="Elasticsearch::Transport::Transport::Errors::InternalServerError" error="[500] {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@53d93e65; line: 1, column: 179]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: [B@53d93e65; line: 1, column: 179]\"},\"status\":500}" plugin_id="object:3f9728980be4

Version-Release number of selected component (if applicable):
logging-fluentd-v3.9.41-2

How reproducible:
Customer environment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1562004

Comment 1 Dan Yocum 2018-12-01 17:07:35 UTC
Carrying this comment over from https://bugzilla.redhat.com/show_bug.cgi?id=1625254


Jeff Cantrill 2018-10-17 14:17:42 EDT
RED HAT CONFIDENTIAL
Dan,

Reading through the attached customer case seems to indicate the customer is trying to use fluent to forward logs to splunk an was having issues routing messages there (i.e. firewall issue)?  The previous cases I have examined where the utf-8 parse issue presents itself seem to be related to available disk.  Fluent tries to write buffers to the host filesystem, buffer gets corrupted because it runs out of disk.  It is plausible that if fluent is unable to write its messages to the destination that they get pushed back into a buffer for retry.  This could eventually lead to disk issues if fluent is filling up the file system.

Comment 2 Nicolas Nosenzo 2018-12-05 13:40:26 UTC
We verified that docker space disk sometimes reached the 88%, but even though, the remaining 12% stands for ~30G of free space. Not really sure if this issue is due to a disk pressure.

Comment 3 Rich Megginson 2018-12-05 15:33:50 UTC
(In reply to Nicolas Nosenzo from comment #2)
> We verified that docker space disk sometimes reached the 88%, but even
> though, the remaining 12% stands for ~30G of free space. Not really sure if
> this issue is due to a disk pressure.

If it isn't due to disk pressure, then we have absolutely no clue as to what the problem could possibly be, and we're going to need a lot of diagnostic help (i.e. changing fluentd configs to dump out the records before this error happens, dumping the fluentd buffer files, etc.) in order to figure out what else it could be.

Comment 4 Nicolas Nosenzo 2018-12-12 11:31:31 UTC
(In reply to Rich Megginson from comment #3)

> changing fluentd configs to dump out the records before this
> error happens, dumping the fluentd buffer files, etc.) in order to figure
> out what else it could be.

Is this possible with setting fluentd in trace log level ? and just listing (ls) the buffer file folder ?

Comment 5 Rich Megginson 2018-12-12 14:30:07 UTC
(In reply to Nicolas Nosenzo from comment #4)
> (In reply to Rich Megginson from comment #3)
> 
> > changing fluentd configs to dump out the records before this
> > error happens, dumping the fluentd buffer files, etc.) in order to figure
> > out what else it could be.
> 
> Is this possible with setting fluentd in trace log level ?

No - trace mode will not dump the contents of the record.

> and just listing
> (ls) the buffer file folder ?

The problem is that we want to see the contents of the records _before_ they are written to the file buffers, so that we can rule out the records being corrupted in some other way.

Comment 8 Jeff Cantrill 2019-01-09 22:02:01 UTC
@Nicolas,

Is the customer still experiencing this issue?  If so, would it be possible to get some recent fluent and elasticsearch logs:

oc exec $ESPOD -- logs
oc exec $FLUENT -- logs

which is contingent on their version having:

[1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/elasticsearch/utils/logs
[2] https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/fluentd/utils/logs

We have recently seen an issue with records getting rejected by Elasticsearch with a 400 error.  My belief is the rejected entries cycle through fluent's buffers indefinitely which may lead to buffers never being flushed.

Comment 9 Nicolas Nosenzo 2019-01-10 11:50:56 UTC
(In reply to Jeff Cantrill from comment #8)
> @Nicolas,
> 
> Is the customer still experiencing this issue?  If so, would it be possible
> to get some recent fluent and elasticsearch logs:
> 
> oc exec $ESPOD -- logs
> oc exec $FLUENT -- logs
> 
> which is contingent on their version having:
> 
> [1]
> https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/
> elasticsearch/utils/logs
> [2]
> https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/
> fluentd/utils/logs
> 
> We have recently seen an issue with records getting rejected by
> Elasticsearch with a 400 error.  My belief is the rejected entries cycle
> through fluent's buffers indefinitely which may lead to buffers never being
> flushed.

Jeff, Haven't got a response from my last update, anyhow I will ask them to gather these logs in case the issue is still there.

Comment 24 Jeff Cantrill 2019-04-22 19:23:49 UTC
I defer to the comments from Rich. We are confident these issues are related to lack of buffer space on the node for fluent to write buffer files.  Speculating, I would bet the nodes where you see the utf-8 issues are also nodes which have containers that generate high log volumes which would only be confirmed in 3.9 by monitoring the node's disk space over time.

Comment 26 Jeff Cantrill 2019-04-24 13:09:16 UTC
Later installations of Openshift (e.g. 3.11) do/can include prometheus to grabe node metrics, one of which is disk usage and capacity.  In 3.9 you may have hawkular metrics deployed which might provide you the same information.  I can not speak definitively to what is available.

Alternatively, you have to introduce a service on the node that periodically dumps information about the disk which you then would have to aggregate and graph to see trends about its usage.

Comment 27 Rich Megginson 2019-06-04 15:55:02 UTC
*** Bug 1716553 has been marked as a duplicate of this bug. ***

Comment 28 Jeff Cantrill 2019-07-26 14:57:30 UTC
Reclosing this issue as there has been no recent activity and marked as a duplicate


Note You need to log in before you can comment on or make changes to this bug.