Bug 1552304 - CRI-O pod log lines are occasionally split mid-line
Summary: CRI-O pod log lines are occasionally split mid-line
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Noriko Hosoi
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On: 1642547
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-06 22:19 UTC by Mike Fiedler
Modified: 2021-12-10 15:45 UTC (History)
12 users (show)

Fixed In Version: openshift3/ose-logging-fluentd:v3.11.23-1
Doc Type: Bug Fix
Doc Text:
Cause: Log messages from cri-o pod could be split in the middle by nature. Consequence: Partial log messages are indexed in the elasticsearch. Fix: The newer fluent-plugin-concat supports merging the cri-o style split messages into one, which is not available for the current fluentd (v0.12) that openshift logging v3.11 uses. The functionality was backported to the fluentd v0.12. Result: The cri-o style split log messages are merged back to the original full message.
Clone Of:
Environment:
Last Closed: 2018-11-20 03:10:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod log showing line splits (152.94 KB, application/x-gzip)
2018-03-06 22:22 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin-aggregated-logging pull 1386 0 'None' closed [release-3.11] Including fluent-plugin-concat to support cri-o multiline logs. 2021-02-19 12:21:19 UTC
Red Hat Product Errata RHBA-2018:3537 0 None None None 2018-11-20 03:11:48 UTC

Description Mike Fiedler 2018-03-06 22:19:03 UTC
Description of problem:

Running OpenShift logging tests we use a tool to create flexible pod logs.  The tool controls rate, payload size, payload content etc.  See 

When logging with Docker json-file logging and Docker 1.12 runtime, the pod logs under /var/log/containers always contain exactly the number of lines emitted by the pod.  There is no splitting of lines. 10000 lines out of the tool = 10000 lines in the pod log = 10000 lines in elasticsearch

When logging with the crio runtime (registry.reg-aws.openshift.com:443/openshift3/cri-o   v3.9.0   1617b0aab7ce) there are always a few split lines.   The number is seemingly random and appears to happen slightly more often with larger message payloads.  Recent runs:

10000 lines with 512 byte payload:  10006 lines in the pod log
10000 lines with 1024 byte payload: 10022 lines in the pod log (attached)
10000 lines with 4096 byte payload: 10573 lines in the pod log


Version-Release number of selected component (if applicable): 

OCP 3.9.2
registry.reg-aws.openshift.com:443/openshift3/cri-o   v3.9.0   1617b0aab7ce


How reproducible: Always when running enough messages.  Usually > 5000 or so


Steps to Reproduce:
1. With a cri-o run time, run the SVT logtest replication controller (see README above)
2. Use the following input yaml:

projects:
  - num: 2
    basename: logtest
    ifexists: delete
    tuning: default
    templates:
      - num: 1
        file: ./content/logtest/logtest-rc.json
        parameters:
         - REPLICAS: "1"
         - INITIAL_FLAGS: "--num-lines 10000 --line-length 1024 --word-length 9 --rate 6000 --fixed-line\n"

This will run the logging pod for about 2 minutes and produce 10000 lines of fixed content.  Every line is the same with the exception of the sequence number

3.  Go to the node and find the pod log in /var/log/containers and run wc -l

Actual results:

10000 lines in pod log

Expected results:

> 10000 lines in pod log

Additional info:

Comment 1 Mike Fiedler 2018-03-06 22:20:52 UTC
This could be considered high severity for applications which produce log lines requiring parsing by business applications.

Comment 2 Mike Fiedler 2018-03-06 22:22:34 UTC
Created attachment 1405046 [details]
pod log showing line splits

In the attached file, see lines 164 and 165 for the split happening

Lines 645 and 646 show the split happening at the same place in the payload as line 164

Lines 837 and 838 have it happening in a different position in the payload

Comment 3 Mrunal Patel 2018-03-09 16:12:34 UTC
All of the split lines have the P annotation as designed. The log parser should look for P as a partial line and continue reading till it hits the next F in the output to complete the line.

Comment 4 Antonio Murdaca 2018-03-12 15:00:10 UTC
Looks like this should be moved to Logging team according to Murnal's comment. Moving it, re-move otherwise

Comment 5 Rich Megginson 2018-03-12 15:06:30 UTC
(In reply to Mrunal Patel from comment #3)
> All of the split lines have the P annotation as designed. The log parser
> should look for P as a partial line and continue reading till it hits the
> next F in the output to complete the line.

What is the default maximum size of a log entry that crio writes?  Is this configurable?

Note that this is a huge feature i.e. 3.11 at the earliest.  We will need to figure out a way to mitigate this problem in the meantime.

Comment 6 Mike Fiedler 2018-03-12 15:28:51 UTC
Moving from POST back to NEW so this does not get lost.

Comment 7 Rich Megginson 2018-03-23 15:21:02 UTC
@mrunal - What is the default maximum size of a log entry that crio writes?  Is this configurable?

Comment 8 Mrunal Patel 2018-03-28 17:29:18 UTC
Rich, the line size today is 8kb. We are making some changes and might make it configurable. I will let you know if/when that goes in.

Comment 9 Peter Portante 2018-05-12 20:49:16 UTC
This gets to the broken foundations of byte capture console logging: the interpretation of the byte stream is performed too early.

Mruanl, where is the proper upstream code base for CRI-O so that I might be able to make a suggested PR to address this?

Comment 10 Mike Fiedler 2018-05-31 01:31:34 UTC
@portante,  cri-o upstream is here:  https://github.com/kubernetes-incubator/cri-o

Confirming that the issue still exists in OCP 3.10 with cri-o.x86_64                               1.10.2-1.rhaos3.10.git095e88c.el7

Saw this while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1554407 is no longer an issue.

Comment 16 Jeff Cantrill 2018-10-15 18:16:50 UTC
Fixed in https://github.com/openshift/origin-aggregated-logging/pull/1386

Comment 21 Mike Fiedler 2018-10-24 17:43:33 UTC
Verified on OCP 3.11.31 + CRIO 1.11.7 but I hit https://bugzilla.redhat.com/show_bug.cgi?id=1642547 along the way.   Holding off on final verification until that is fixed.

Comment 23 Mike Fiedler 2018-11-01 19:42:00 UTC
Docs in progress for https://bugzilla.redhat.com/show_bug.cgi?id=1642547

Marking this bz as verified on OCP 3.11.31 + CRIO 1.11.7

Comment 25 errata-xmlrpc 2018-11-20 03:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537

Comment 28 Rich Megginson 2019-03-18 15:22:07 UTC
We will not backport this fix to 3.9


Note You need to log in before you can comment on or make changes to this bug.