Description of problem: Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1541429, but opening a new bz since it is occurring at a different scale than that bug. In a 300 node scalability environment, increase the number of namepspaces and pods while keeping the overall message rate (per node and total in the cluster) constant. Scaling to 300/600/900 projects works fine, no missing messages in Elasticsearch and no buffers stuck on the fluentd nodes. The message rate is 100 messages/second/node. As the number of projects/pods increases, the per pod message rate is lowered to maintain 100 messages/second/node. At 1200 projects, Elasticsearch starts to show significant increases in bulk.rejects and eventually many of the fluentd pod become completely wedged - /var/lib/fluentd contains buffer files which are never indexed and never go away. Restarting fluentd does not seem to provide relief. Only removing the contents of /var/lib/fluentd Attaching logs from 2 fluentd pods in this condition, a tar of all of the "stuck" buffer files from one of those nodes, logs for the elasticsearch pods (/elasticsearch/persistent/logging-es/logs) Version-Release number of selected component (if applicable): 3.10.0-0.41.0 How reproducible: Always when running 1200 pods in 1200 projects on 300 nodes at a rate of 100 messages/node Steps to Reproduce: 1. Install 300 node cluster 2. Run the SVT logging tool (https://github.com/openshift/svt/blob/master/openshift_scalability/content/logtest/ocp_logtest-README.md) with the following configuration projects: - num: 1200 basename: logtest ifexists: delete tuning: default templates: - num: 1 file: ./content/logtest/logtest-rc.json parameters: - REPLICAS: "1" - INITIAL_FLAGS: "--num-lines 45000 --line-length 1024 --word-length 9 --rate 1500 --fixed-line\n" Actual results: Many indices in ES are missing messages, logging-fluentd nodes have unsent buffer files in /var/lib/fluentd. ES bulk queues consistently show full. I will add a link to pbench results from a run. Additional info: Logging inventory: # EFK logging stack variables openshift_logging_install_logging: true openshift_logging_es_cluster_size: 3 openshift_logging_es_pvc_dynamic: true openshift_logging_es_pvc_size: 50Gi openshift_logging_es_pvc_storage_class_name: "gluster-storage-block" openshift_logging_fluentd_read_from_head: false openshift_logging_use_mux: false openshift_logging_curator_nodeselector: {"region": "infra"} openshift_logging_kibana_nodeselector: {"region": "infra"} openshift_logging_es_nodeselector: {"region": "infra"}
Created attachment 1437550 [details] logs and buffer files logging-fluentd logs from two stuck nodes 3 ES server logs tar file of /var/lib/fluentd on a stuck node
tl;dr Not sure if this is a "bug" or just an indication that we've hit the limit on what Elasticsearch can do with the ootb config, and we need to scale up Elasticsearch - increase bulk queue size, increase RAM, increase CPU, increase disk speed. I don't see anything in the logs to indicate any problems. The slow log threshold indicates that it is taking a long time to send records to Elasticsearch, even when they do get through (as opposed to being rejected due to bulk rejection), which is to be expected if Elasticsearch is completely overloaded. I haven't looked at the log buffer files, but I'm assuming that the messages missing in Elasticsearch are in them. If you look at _cat/indices, do you see new indices for new projects/namespaces eventually being created? Do you see the doc counts going up? If you are monitoring the bulk thread_pool, do you see the bulkcompletion (bc) rate increasing, meaning that some of the bulk operations are being processed successfully? The pbench for elasticsearch will be interesting, but I'm assuming it will show that Elasticsearch is maxed out of RAM and/or CPU and/or disk I/O throughput.
Some more details, sorry. ES memory limit is 62GB ES storage is on local NVME ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent). pbench data to follow in private comment
(In reply to Mike Fiedler from comment #3) > Some more details, sorry. > > ES memory limit is 62GB > ES storage is on local NVME Logging inventory: # EFK logging stack variables openshift_logging_es_pvc_storage_class_name: "gluster-storage-block" > ES are running on 40 core systems with 140GB RAM (m4.10xlarge equivalent). > > pbench data to follow in private comment
re: comment 4. Initially deployed on gluster-block, but problems with throughput found. PVCs recreated on local NVME (see pbench data in comment 5).
On second thought - we don't yet have a 3.10 fluentd image build which has all of the latest retry logic in it - this testing will establish a good baseline, and we can use it to confirm that the new retry logic will actually help if not fix this issue.
The new image shows significant improvement. Successfully logged across 3000 namespaces with no lost messages. If we get the scale cluster back in 3.10, will try to push it further. With thousands of namespaces, it is critical to precreate the indices. Quickly creating even 1500 indices is very expensive - drives elasticsearch to cpu usage of 30+ cores. I will follow up with links to performance data.
Verified with logging-fluentd:v3.10.0-0.47.0.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816