Bug 1579018

Summary:

3.10 scalability: logging-fluentd wedged with buffer files in /var/lib/fluentd and missing messages for 300 nodes/1200 projects

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED ERRATA

QA Contact:

Mike Fiedler <mifiedle>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

aos-bugs, mifiedle, pportant, rmeggins

Target Milestone:

---

Target Release:

3.10.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

aos-scalability-310

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-30 19:15:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs and buffer files	none

Description Mike Fiedler 2018-05-16 19:17:14 UTC

Description of problem:

Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1541429, but opening a new bz since it is occurring at a different scale than that bug.

In a 300 node scalability environment, increase the number of namepspaces and pods while keeping the overall message rate (per node and total in the cluster) constant.   Scaling to 300/600/900 projects works fine, no missing messages in Elasticsearch and no buffers stuck on the fluentd nodes.  The message rate is 100 messages/second/node.  As the number of projects/pods increases, the per pod message rate is lowered to maintain 100 messages/second/node.

At 1200 projects, Elasticsearch starts to show significant increases in bulk.rejects and eventually many of the fluentd pod become completely wedged - /var/lib/fluentd contains buffer files which are never indexed and never go away.

Restarting fluentd does not seem to provide relief.  Only removing the contents of /var/lib/fluentd


Attaching logs from 2 fluentd pods in this condition, a tar of all of the "stuck" buffer files from one of those nodes, logs for the elasticsearch pods (/elasticsearch/persistent/logging-es/logs)


Version-Release number of selected component (if applicable): 3.10.0-0.41.0


How reproducible: Always when running 1200 pods in 1200 projects on 300 nodes at a rate of 100 messages/node


Steps to Reproduce:
1.  Install 300 node cluster
2.  Run the SVT logging tool (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) with the following configuration

projects:
  - num: 1200
    basename: logtest
    ifexists: delete
    tuning: default
    templates:
      - num: 1
        file: ./content/logtest/logtest-rc.json
        parameters:
         - REPLICAS: "1"
         - INITIAL_FLAGS: "--num-lines 45000 --line-length 1024 --word-length 9 --rate 1500 --fixed-line\n"


Actual results:

Many indices in ES are missing messages, logging-fluentd nodes have unsent buffer files in /var/lib/fluentd.   ES bulk queues consistently show full.  I will add a link to pbench results from a run.




Additional info:

Logging inventory:

# EFK logging stack variables
openshift_logging_install_logging: true
openshift_logging_es_cluster_size: 3
openshift_logging_es_pvc_dynamic: true
openshift_logging_es_pvc_size: 50Gi
openshift_logging_es_pvc_storage_class_name: "gluster-storage-block"
openshift_logging_fluentd_read_from_head: false
openshift_logging_use_mux: false
openshift_logging_curator_nodeselector: {"region": "infra"}
openshift_logging_kibana_nodeselector: {"region": "infra"}
openshift_logging_es_nodeselector: {"region": "infra"}

Comment 1 Mike Fiedler 2018-05-16 19:21:24 UTC

Created attachment 1437550 [details]
logs and buffer files

logging-fluentd logs from two stuck nodes
3 ES server logs
tar file of /var/lib/fluentd on a stuck node

Comment 2 Rich Megginson 2018-05-16 19:42:11 UTC

tl;dr Not sure if this is a "bug" or just an indication that we've hit the limit on what Elasticsearch can do with the ootb config, and we need to scale up Elasticsearch - increase bulk queue size, increase RAM, increase CPU, increase disk speed.

I don't see anything in the logs to indicate any problems.  The slow log threshold indicates that it is taking a long time to send records to Elasticsearch, even when they do get through (as opposed to being rejected due to bulk rejection), which is to be expected if Elasticsearch is completely overloaded.  I haven't looked at the log buffer files, but I'm assuming that the messages missing in Elasticsearch are in them.

If you look at _cat/indices, do you see new indices for new projects/namespaces eventually being created?  Do you see the doc counts going up?  If you are monitoring the bulk thread_pool, do you see the bulkcompletion (bc) rate increasing, meaning that some of the bulk operations are being processed successfully?

The pbench for elasticsearch will be interesting, but I'm assuming it will show that Elasticsearch is maxed out of RAM and/or CPU and/or disk I/O throughput.

Comment 3 Mike Fiedler 2018-05-16 19:53:40 UTC

Some more details, sorry.

ES memory limit is 62GB
ES storage is on local NVME 
ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent).

pbench data to follow in private comment

Comment 4 Rich Megginson 2018-05-16 19:59:16 UTC

(In reply to Mike Fiedler from comment #3)
> Some more details, sorry.
> 
> ES memory limit is 62GB
> ES storage is on local NVME 

Logging inventory:

# EFK logging stack variables

openshift_logging_es_pvc_storage_class_name: "gluster-storage-block"


> ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent).
> 
> pbench data to follow in private comment

Comment 6 Mike Fiedler 2018-05-16 20:05:09 UTC

re:  comment 4.  Initially deployed on gluster-block, but problems with throughput found.  PVCs recreated on local NVME (see pbench data in comment 5).

Comment 7 Rich Megginson 2018-05-16 20:06:34 UTC

On second thought - we don't yet have a 3.10 fluentd image build which has all of the latest retry logic in it - this testing will establish a good baseline, and we can use it to confirm that the new retry logic will actually help if not fix this issue.

Comment 10 Mike Fiedler 2018-05-21 16:48:37 UTC

The new image shows significant improvement.   Successfully logged across 3000 namespaces with no lost messages.   If we get the scale cluster back in 3.10, will try to push it further.


With thousands of namespaces, it is critical to precreate the indices.   Quickly creating even 1500 indices is very expensive - drives elasticsearch to cpu usage of 30+ cores.   I will follow up with links to performance data.

Comment 12 Mike Fiedler 2018-06-05 18:55:29 UTC

Verified with logging-fluentd:v3.10.0-0.47.0.1

Comment 14 errata-xmlrpc 2018-07-30 19:15:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816