Bug 1482002

Summary: Can't collect log entries due to fluentd error
Product: OpenShift Container Platform Reporter: Xia Zhao <xiazhao>
Component: LoggingAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Xia Zhao <xiazhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.0CC: aos-bugs, jwozniak, mifiedle, nhosoi, rmeggins
Target Milestone: ---Keywords: Regression
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Fluentd could not write the files it uses for buffering records due to a problem converting values from ascii-8bit to utf-8. Consequence: Fluentd emits a lot of errors and cannot add records to Elasticsearch. Fix: Remove the patch that forced the utf-8 conversion. Result: Fluentd can write ascii-8bit encoded files for its buffer.
Story Points: ---
Clone Of:
: 1498999 (view as bug list) Environment:
Last Closed: 2017-10-25 13:04:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1469859, 1498999    
Attachments:
Description Flags
inventory file used for logging deployment
none
fluentd log none

Description Xia Zhao 2017-08-16 09:16:35 UTC
Description of problem:
Deploy logging 3.6 stacks on OCP 3.6, can't collect log entries due to the following fluentd error:
2017-08-16 04:52:01 -0400 [error]: Exception emitting record: "\x92" from ASCII-8BIT to UTF-8
2017-08-16 04:52:01 -0400 [warn]: emit transaction failed: error_class=Encoding::UndefinedConversionError error="\"\\x92\" from ASCII-8BIT to UTF-8" tag="journal.system"

Issue repro regardless of whether this parameter is specified in inventory: 
openshift_logging_fluentd_journal_read_from_head=true

[error]: Exception emitting record: "\x92" from ASCII-8BIT to UTF-8

Version-Release number of selected component (if applicable):
logging-fluentd         v3.6.173.0.5-6      95dede9f3cb2        9 hours ago         235.1 MB

# openshift version
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.6 stacks on OCP 3.6 with the attached inventory file
2.Wait until EFK pods are running
3.Check fluentd logs

Actual results:
Can't collect log entries due to fluentd error

Expected results:
fluentd should work

Additional info:
full log of fluentd attached
inventory file of logging deployment attached

Comment 1 Xia Zhao 2017-08-16 09:20:07 UTC
Created attachment 1314021 [details]
inventory file used for logging deployment

Comment 2 Xia Zhao 2017-08-16 09:23:09 UTC
Created attachment 1314027 [details]
fluentd log

Comment 3 Xia Zhao 2017-08-16 09:23:55 UTC
This bz is currently blocking logging tests on OCP 3.6.0 envs.

Comment 4 Jan Wozniak 2017-08-16 14:45:53 UTC
the fluentd image in brew looks suspiciously small compared to the one freshly build from the branch rhaos-3.6-rhel-7. Perhaps incorrect build got pushed to brew?

brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd         v3.6                95dede9f3cb2        15 hours ago        235.1 MB
local-reg:5000/openshift/logging-fluentd                                                <none>              0ac973960bdb        4 hours ago         360.6 MB

Comment 5 Rich Megginson 2017-08-16 15:09:27 UTC
Can we log into the system? I want to look at the journal and see if I can find which record is causing this problem.

Comment 10 Rich Megginson 2017-08-17 13:51:11 UTC
*** Bug 1482532 has been marked as a duplicate of this bug. ***

Comment 11 Mike Fiedler 2017-08-17 14:21:12 UTC
Installing logging with openshift_logging_image_version=v3.6.173.0.5 - this problem is seen.

Installing logging with openshift_logging_image_version=v3.6.171 - this problem is NOT seen.

Comment 12 Noriko Hosoi 2017-08-17 16:37:06 UTC
(In reply to Mike Fiedler from comment #11)
> Installing logging with openshift_logging_image_version=v3.6.173.0.5 - this
> problem is seen.
> 
> Installing logging with openshift_logging_image_version=v3.6.171 - this
> problem is NOT seen.

Right.  Switching the buffer_type from "memory" to "file" happened after the version was bumped to v3.6.171.

Comment 14 Rich Megginson 2017-08-18 02:24:50 UTC
This is easy to reproduce with flexy.

I'm thinking that the fluentd dependencies are conflicting - they are not up to date - once the 3.6 puddle is rebuilt I can build and test a new fluentd image.

Comment 16 Rich Megginson 2017-08-18 18:09:20 UTC
(In reply to Rich Megginson from comment #14)
> This is easy to reproduce with flexy.
> 
> I'm thinking that the fluentd dependencies are conflicting - they are not up
> to date - once the 3.6 puddle is rebuilt I can build and test a new fluentd
> image.

This did not help :-(

Now resorting to debugging the ruby code . . .

Comment 17 Rich Megginson 2017-08-18 21:44:00 UTC
The bug was introduced in logging-fluentd:v3.6.173.0.5-6 - logging-fluentd:v3.6.173.0.5-5 and earlier work.

These are the commits between -5 and -6:

http://pkgs.devel.redhat.com/cgit/rpms/logging-fluentd-docker/log/?h=rhaos-3.6-rhel-7

Impl fluentd file buffer.
remove USE_MUX_CLIENT; mux service always check for k8s metadata
fluentd 0.12.39; k8s filter 0.28.0; viaq 0.0.5

Comment 18 Rich Megginson 2017-08-19 01:08:33 UTC
The error doesn't appear to be related to systemd input or elasticsearch output - I tried fluentd secure_forward with file buffer -> mux es with file buffer

in both fluentd and mux I see the conversion error.  So it must have something to do with file buffer, but I just don't know what it could be.

Comment 22 Xia Zhao 2017-08-24 02:45:45 UTC
Thanks, the fluentd is now back, checked with fluentd:v3.6.173.0.5-10 image that log entries can be collected and reflect on kibana. Set to verified.

Image verified with:
logging-fluentd   v3.6.173.0.5-10     58ab4badc0b7        6 hours ago         235.1 MB

Comment 24 errata-xmlrpc 2017-10-25 13:04:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049