1554407 – [CRI-O] logging-fluentd throughput lower with CRI-O runtime vs Docker runtime

Bug 1554407 - [CRI-O] logging-fluentd throughput lower with CRI-O runtime vs Docker runtime

Summary: [CRI-O] logging-fluentd throughput lower with CRI-O runtime vs Docker runtime

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.9.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Mrunal Patel
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-12 15:52 UTC by Mike Fiedler
Modified:	2018-06-06 15:47 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-06 15:46:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Configuration and performance result details. (130.64 KB, application/pdf) 2018-03-12 15:52 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1796	0	None	None	None	2018-06-06 15:47:03 UTC

Description Mike Fiedler 2018-03-12 15:52:13 UTC

Created attachment 1407281 [details]
Configuration and performance result details.

Description of problem:

Fluentd cannot reach the same throughput message rate of reading and forwarding pod logs to Elasticsearch when the pod runtime is CRI-O as compared to Docker.

In OCP 3.9 fluentd performance testing, performance data (see attached document) indicated that fluentd was using considerably more CPU to process logs when the pod runtime was CRI-O.   The I/O pattern of how CRI-O writes/batches logs to disk appears to be a primary cause.   

Example:  Sending 1Kb messages at an aggregate rate of 500 messages/second across 100 pods on a single node.   For a docker runtime, fluentd CPU is 60% of 1 core.  For a cri-o runtime, it is maxing out the core at 100%.  fluentd as configured for use with OCP can only use 1 core.

With a docker runtime, we normally see maximum throughput around 750 - 1000 1K messages/second in our standard AWS m4.xlarge compute node configuration.   With a cri-o runtime the cap is somewhere below 500 1K messages/second

Version-Release number of selected component (if applicable): 3.9.4


How reproducible: Always


Steps to Reproduce:
1.  Install a cluster on AWS with 1 master, 3 infra and 3 compute nodes
2.  Deploy logging 3.9.4 with a  3 node elasticsearch cluster
3.  Use the SVT logtest tool (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) with the configuration below to load a cluster with 300 projects and pods (100 pods/compute node) logging at an aggregated rate of 500 1K messages/second/node
4.  Monitor CPU usage for fluentd

Actual results:

fluentd uses 100% of a single core - the max it can use - on a cri-o pod run time.  It uses ~ 60% CPU on a docker pod runtime.  See the attached document.

Expected results:

fluentd performance on a cri-o runtime compareable to a docker runtime

Additional info:

log test configuration 

- 300 projects and pods (100/node)
- 300 messages/minute/pod = 500 messages/second aggregate on the node.
- each pod sends 6000 1K messages = ~20 minutes of run time.   Can do longer if desired.

projects:
  - num: 300
    basename: logtest
    ifexists: delete
    tuning: default
    templates:
      - num: 1
        file: ./content/logtest/logtest-rc.json
        parameters:
         - REPLICAS: "1"
         - INITIAL_FLAGS: "--num-lines 6000 --line-length 1024 --word-length 9 --rate 300 --fixed-line\n"

tuningsets:
  - name: default
    pods:
      stepping:
        stepsize: 5
        pause: 0 min
      rate_limit:
        delay: 0 ms

Comment 5 Wei Sun 2018-05-31 02:18:35 UTC

Hi Mike,please help verify this bug.Thanks!

Comment 6 Mike Fiedler 2018-05-31 18:40:01 UTC

Verified on OCP 3.9.30 + cri-o.x86_64 1.9.12-1.gitfa11beb.el7

and also on OCP 3.10.0-0.54.0 + cri-o.x86_64 1.10.2-1.rhaos3.10.git095e88c.el7


The CPU and file I/O profile of fluentd + CRI-O is now virtually identical to fluentd + docker

Comment 9 errata-xmlrpc 2018-06-06 15:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1796

Note You need to log in before you can comment on or make changes to this bug.