Bug 1552304

Summary:

CRI-O pod log lines are occasionally split mid-line

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Logging

Assignee:

Noriko Hosoi <nhosoi>

Status:

CLOSED ERRATA

QA Contact:

Mike Fiedler <mifiedle>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.9.0

CC:

amurdaca, aos-bugs, decarr, jcantril, jokerman, mmccomas, mpatel, nhosoi, pportant, rmeggins, sople, tibrahim

Target Milestone:

---

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openshift3/ose-logging-fluentd:v3.11.23-1

Doc Type:

Bug Fix

Doc Text:

Cause: Log messages from cri-o pod could be split in the middle by nature. Consequence: Partial log messages are indexed in the elasticsearch. Fix: The newer fluent-plugin-concat supports merging the cri-o style split messages into one, which is not available for the current fluentd (v0.12) that openshift logging v3.11 uses. The functionality was backported to the fluentd v0.12. Result: The cri-o style split log messages are merged back to the original full message.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-11-20 03:10:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1642547

Bug Blocks:

Attachments:

Description	Flags
pod log showing line splits	none

Description Mike Fiedler 2018-03-06 22:19:03 UTC

Description of problem:

Running OpenShift logging tests we use a tool to create flexible pod logs.  The tool controls rate, payload size, payload content etc.  See 

When logging with Docker json-file logging and Docker 1.12 runtime, the pod logs under /var/log/containers always contain exactly the number of lines emitted by the pod.  There is no splitting of lines. 10000 lines out of the tool = 10000 lines in the pod log = 10000 lines in elasticsearch

When logging with the crio runtime (registry.reg-aws.openshift.com:443/openshift3/cri-o   v3.9.0   1617b0aab7ce) there are always a few split lines.   The number is seemingly random and appears to happen slightly more often with larger message payloads.  Recent runs:

10000 lines with 512 byte payload:  10006 lines in the pod log
10000 lines with 1024 byte payload: 10022 lines in the pod log (attached)
10000 lines with 4096 byte payload: 10573 lines in the pod log


Version-Release number of selected component (if applicable): 

OCP 3.9.2
registry.reg-aws.openshift.com:443/openshift3/cri-o   v3.9.0   1617b0aab7ce


How reproducible: Always when running enough messages.  Usually > 5000 or so


Steps to Reproduce:
1. With a cri-o run time, run the SVT logtest replication controller (see README above)
2. Use the following input yaml:

projects:
  - num: 2
    basename: logtest
    ifexists: delete
    tuning: default
    templates:
      - num: 1
        file: ./content/logtest/logtest-rc.json
        parameters:
         - REPLICAS: "1"
         - INITIAL_FLAGS: "--num-lines 10000 --line-length 1024 --word-length 9 --rate 6000 --fixed-line\n"

This will run the logging pod for about 2 minutes and produce 10000 lines of fixed content.  Every line is the same with the exception of the sequence number

3.  Go to the node and find the pod log in /var/log/containers and run wc -l

Actual results:

10000 lines in pod log

Expected results:

> 10000 lines in pod log

Additional info:

Comment 1 Mike Fiedler 2018-03-06 22:20:52 UTC

This could be considered high severity for applications which produce log lines requiring parsing by business applications.

Comment 2 Mike Fiedler 2018-03-06 22:22:34 UTC

Created attachment 1405046 [details]
pod log showing line splits

In the attached file, see lines 164 and 165 for the split happening

Lines 645 and 646 show the split happening at the same place in the payload as line 164

Lines 837 and 838 have it happening in a different position in the payload

Comment 3 Mrunal Patel 2018-03-09 16:12:34 UTC

All of the split lines have the P annotation as designed. The log parser should look for P as a partial line and continue reading till it hits the next F in the output to complete the line.

Comment 4 Antonio Murdaca 2018-03-12 15:00:10 UTC

Looks like this should be moved to Logging team according to Murnal's comment. Moving it, re-move otherwise

Comment 5 Rich Megginson 2018-03-12 15:06:30 UTC

(In reply to Mrunal Patel from comment #3)
> All of the split lines have the P annotation as designed. The log parser
> should look for P as a partial line and continue reading till it hits the
> next F in the output to complete the line.

What is the default maximum size of a log entry that crio writes?  Is this configurable?

Note that this is a huge feature i.e. 3.11 at the earliest.  We will need to figure out a way to mitigate this problem in the meantime.

Comment 6 Mike Fiedler 2018-03-12 15:28:51 UTC

Moving from POST back to NEW so this does not get lost.

Comment 7 Rich Megginson 2018-03-23 15:21:02 UTC

@mrunal - What is the default maximum size of a log entry that crio writes?  Is this configurable?

Comment 8 Mrunal Patel 2018-03-28 17:29:18 UTC

Rich, the line size today is 8kb. We are making some changes and might make it configurable. I will let you know if/when that goes in.

Comment 9 Peter Portante 2018-05-12 20:49:16 UTC

This gets to the broken foundations of byte capture console logging: the interpretation of the byte stream is performed too early.

Mruanl, where is the proper upstream code base for CRI-O so that I might be able to make a suggested PR to address this?

Comment 10 Mike Fiedler 2018-05-31 01:31:34 UTC

@portante,  cri-o upstream is here:  https://github.com/kubernetes-incubator/cri-o

Confirming that the issue still exists in OCP 3.10 with cri-o.x86_64                               1.10.2-1.rhaos3.10.git095e88c.el7

Saw this while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1554407 is no longer an issue.

Comment 16 Jeff Cantrill 2018-10-15 18:16:50 UTC

Fixed in https://github.com/openshift/origin-aggregated-logging/pull/1386

Comment 21 Mike Fiedler 2018-10-24 17:43:33 UTC

Verified on OCP 3.11.31 + CRIO 1.11.7 but I hit https://bugzilla.redhat.com/show_bug.cgi?id=1642547 along the way.   Holding off on final verification until that is fixed.

Comment 23 Mike Fiedler 2018-11-01 19:42:00 UTC

Docs in progress for https://bugzilla.redhat.com/show_bug.cgi?id=1642547

Marking this bz as verified on OCP 3.11.31 + CRIO 1.11.7

Comment 25 errata-xmlrpc 2018-11-20 03:10:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537

Comment 28 Rich Megginson 2019-03-18 15:22:07 UTC

We will not backport this fix to 3.9