Bug 1579018 - 3.10 scalability: logging-fluentd wedged with buffer files in /var/lib/fluentd and missing messages for 300 nodes/1200 projects
Summary: 3.10 scalability: logging-fluentd wedged with buffer files in /var/lib/fluen...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.10.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.10.0
Assignee: Jeff Cantrill
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-310
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-16 19:17 UTC by Mike Fiedler
Modified: 2018-07-30 19:15 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-07-30 19:15:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs and buffer files (12.23 MB, application/x-gzip)
2018-05-16 19:21 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 0 None None None 2018-07-30 19:15:54 UTC

Description Mike Fiedler 2018-05-16 19:17:14 UTC
Description of problem:

Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1541429, but opening a new bz since it is occurring at a different scale than that bug.

In a 300 node scalability environment, increase the number of namepspaces and pods while keeping the overall message rate (per node and total in the cluster) constant.   Scaling to 300/600/900 projects works fine, no missing messages in Elasticsearch and no buffers stuck on the fluentd nodes.  The message rate is 100 messages/second/node.  As the number of projects/pods increases, the per pod message rate is lowered to maintain 100 messages/second/node.

At 1200 projects, Elasticsearch starts to show significant increases in bulk.rejects and eventually many of the fluentd pod become completely wedged - /var/lib/fluentd contains buffer files which are never indexed and never go away.

Restarting fluentd does not seem to provide relief.  Only removing the contents of /var/lib/fluentd


Attaching logs from 2 fluentd pods in this condition, a tar of all of the "stuck" buffer files from one of those nodes, logs for the elasticsearch pods (/elasticsearch/persistent/logging-es/logs)


Version-Release number of selected component (if applicable): 3.10.0-0.41.0


How reproducible: Always when running 1200 pods in 1200 projects on 300 nodes at a rate of 100 messages/node


Steps to Reproduce:
1.  Install 300 node cluster
2.  Run the SVT logging tool (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) with the following configuration

projects:
  - num: 1200
    basename: logtest
    ifexists: delete
    tuning: default
    templates:
      - num: 1
        file: ./content/logtest/logtest-rc.json
        parameters:
         - REPLICAS: "1"
         - INITIAL_FLAGS: "--num-lines 45000 --line-length 1024 --word-length 9 --rate 1500 --fixed-line\n"


Actual results:

Many indices in ES are missing messages, logging-fluentd nodes have unsent buffer files in /var/lib/fluentd.   ES bulk queues consistently show full.  I will add a link to pbench results from a run.




Additional info:

Logging inventory:

# EFK logging stack variables
openshift_logging_install_logging: true
openshift_logging_es_cluster_size: 3
openshift_logging_es_pvc_dynamic: true
openshift_logging_es_pvc_size: 50Gi
openshift_logging_es_pvc_storage_class_name: "gluster-storage-block"
openshift_logging_fluentd_read_from_head: false
openshift_logging_use_mux: false
openshift_logging_curator_nodeselector: {"region": "infra"}
openshift_logging_kibana_nodeselector: {"region": "infra"}
openshift_logging_es_nodeselector: {"region": "infra"}

Comment 1 Mike Fiedler 2018-05-16 19:21:24 UTC
Created attachment 1437550 [details]
logs and buffer files

logging-fluentd logs from two stuck nodes
3 ES server logs
tar file of /var/lib/fluentd on a stuck node

Comment 2 Rich Megginson 2018-05-16 19:42:11 UTC
tl;dr Not sure if this is a "bug" or just an indication that we've hit the limit on what Elasticsearch can do with the ootb config, and we need to scale up Elasticsearch - increase bulk queue size, increase RAM, increase CPU, increase disk speed.

I don't see anything in the logs to indicate any problems.  The slow log threshold indicates that it is taking a long time to send records to Elasticsearch, even when they do get through (as opposed to being rejected due to bulk rejection), which is to be expected if Elasticsearch is completely overloaded.  I haven't looked at the log buffer files, but I'm assuming that the messages missing in Elasticsearch are in them.

If you look at _cat/indices, do you see new indices for new projects/namespaces eventually being created?  Do you see the doc counts going up?  If you are monitoring the bulk thread_pool, do you see the bulkcompletion (bc) rate increasing, meaning that some of the bulk operations are being processed successfully?

The pbench for elasticsearch will be interesting, but I'm assuming it will show that Elasticsearch is maxed out of RAM and/or CPU and/or disk I/O throughput.

Comment 3 Mike Fiedler 2018-05-16 19:53:40 UTC
Some more details, sorry.

ES memory limit is 62GB
ES storage is on local NVME 
ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent).

pbench data to follow in private comment

Comment 4 Rich Megginson 2018-05-16 19:59:16 UTC
(In reply to Mike Fiedler from comment #3)
> Some more details, sorry.
> 
> ES memory limit is 62GB
> ES storage is on local NVME 

Logging inventory:

# EFK logging stack variables

openshift_logging_es_pvc_storage_class_name: "gluster-storage-block"


> ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent).
> 
> pbench data to follow in private comment

Comment 6 Mike Fiedler 2018-05-16 20:05:09 UTC
re:  comment 4.  Initially deployed on gluster-block, but problems with throughput found.  PVCs recreated on local NVME (see pbench data in comment 5).

Comment 7 Rich Megginson 2018-05-16 20:06:34 UTC
On second thought - we don't yet have a 3.10 fluentd image build which has all of the latest retry logic in it - this testing will establish a good baseline, and we can use it to confirm that the new retry logic will actually help if not fix this issue.

Comment 10 Mike Fiedler 2018-05-21 16:48:37 UTC
The new image shows significant improvement.   Successfully logged across 3000 namespaces with no lost messages.   If we get the scale cluster back in 3.10, will try to push it further.


With thousands of namespaces, it is critical to precreate the indices.   Quickly creating even 1500 indices is very expensive - drives elasticsearch to cpu usage of 30+ cores.   I will follow up with links to performance data.

Comment 12 Mike Fiedler 2018-06-05 18:55:29 UTC
Verified with logging-fluentd:v3.10.0-0.47.0.1

Comment 14 errata-xmlrpc 2018-07-30 19:15:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.