Description of problem: While logging at high rates (1000 pods, each logging at 2700 msgs / minute in 4.4.5 , fluentd is consistently getting stuck not being able to send messages to elasticsearch. Messages flow for a while and then fluentd gets this pair of messages in the pod log: 2020-06-15 11:31:50 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block 2020-06-15 11:31:52 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=0 next_retry_seconds=2020-06-15 11:31:53 +0000 chunk="5a81da42735ed1155aef835175590e01" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-l ogging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Rejected Elasticsearch due to es_rejected_execution_exception" 2020-06-15 11:31:52 +0000 [warn]: suppressed same stacktrace 2020-06-15 11:32:06 +0000 [warn]: [clo_default_output_es] retry succeeded. chunk_id="5a81da42735ed1155aef835175590e01" 2020-06-15 11:32:07 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block 2020-06-15 11:32:07 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block 2020-06-15 11:32:09 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block Version-Release number of selected component (if applicable): NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-05-27-032519 True False 18d Cluster version is 4.4.0-0.nightly-2020-05-27-032519 elasticsearch: image: registry.redhat.io/openshift4/ose-logging-elasticsearch5@sha256:64b32d1607b06a9378ca9bb299802742e27d11e71ba85e8455b44d67cb9ad1b2 fluentd: image: registry.redhat.io/openshift4/ose-logging-fluentd@sha256:3f06254368618499d20d9598789e7b9936d23bda44f28f6b72fc29b8a643e14b How reproducible: Always, with 1000 pods logging at 2700 msgs / minute. Around 12 projects end up with significantly less messages indexed: oc exec -n openshift-logging -c elasticsearch $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasti csearch/secret/admin-key https://localhost:9200/_cat/indices?v | grep logtest | grep -v 81000 | wc -l 12 Steps to Reproduce: 1. Kick off 1000 projects, 1 pod each, logging 81000 at a rate of 2700 msgs / minute 2. Wait for them to finish, check the indexes in ES if they have 81000 messages for every namespace Actual results: Around 12-15 projects have roughly only half of the expected messages oc exec -n openshift-logging -c elasticsearch $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v | grep logtest | grep -v 81000 green open project.logtest999.29a045de-098f-4e97-bbb4-52208bfdddbd.2020.06.15 1BWQ5M3pQ7uV7Oq41luGlw 3 1 35046 0 48.9mb 24.4mb green open project.logtest907.817193a8-b1ba-4543-a4b8-ff6a58ebd81c.2020.06.15 NgjNyT4LSFeXALsxxql0WQ 3 1 34793 0 48.8mb 24.5mb green open project.logtest912.bf59f218-f083-4f80-82af-7a4d061f4776.2020.06.15 GSo1rnQ9Qmm_jdamSEjaeg 3 1 34862 0 48.9mb 24.4mb green open project.logtest596.a032d051-3b32-45a8-9ae1-9bf48453db58.2020.06.15 rIty7I1KSZO6tvv57G1P0Q 3 1 46184 0 65.9mb 32.7mb green open project.logtest904.c0d248c7-651a-44ff-a8d8-abf681f948df.2020.06.15 ZjWS6UrwSgeQiHmYUYKISQ 3 1 34839 0 49.8mb 24.5mb green open project.logtest805.a9a5a947-f8a9-473b-a1de-d329c8d0b849.2020.06.15 MxianchGTyiZeUwAr9C16Q 3 1 34908 0 50.2mb 25.1mb [...] Expected results: Every single project should have all of the messages indexed. Additional info: ES rejected a huge number of messages: oc exec -n openshift-logging -c elasticsearch $POD -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v\&h=name,host,completed,active,queue,rejected name host completed active queue rejected bulk 10.131.48.12 1468251 1 0 64673 fetch_shard_started 10.131.48.12 0 0 0 0 fetch_shard_store 10.131.48.12 1 0 0 0 flush 10.131.48.12 26145 0 0 0 force_merge 10.131.48.12 0 0 0 0 generic 10.131.48.12 28656 0 0 0 get 10.131.48.12 1018 0 0 0 index 10.131.48.12 1 0 0 0 listener 10.131.48.12 324 0 0 0 management 10.131.48.12 27738 2 0 0 refresh 10.131.48.12 2371014 0 0 0 search 10.131.48.12 2694 0 0 0 snapshot 10.131.48.12 0 0 0 0 warmer 10.131.48.12 1044308 0 0 0 bulk 10.130.48.12 1496055 2 0 11468 fetch_shard_started 10.130.48.12 0 0 0 0 fetch_shard_store 10.130.48.12 1 0 0 0 flush 10.130.48.12 25892 0 0 0 force_merge 10.130.48.12 0 0 0 0 generic 10.130.48.12 53439 0 0 0 get 10.130.48.12 2987 0 0 0 index 10.130.48.12 0 0 0 0 listener 10.130.48.12 324 0 0 0 management 10.130.48.12 25524 2 0 0 refresh 10.130.48.12 2374854 0 0 0 search 10.130.48.12 3383 0 0 0 snapshot 10.130.48.12 0 0 0 0 warmer 10.130.48.12 1026956 0 0 0 bulk 10.129.48.12 1921009 2 0 542 fetch_shard_started 10.129.48.12 0 0 0 0 fetch_shard_store 10.129.48.12 1 0 0 0 flush 10.129.48.12 25769 0 0 0 force_merge 10.129.48.12 0 0 0 0 generic 10.129.48.12 48500 0 0 0 get 10.129.48.12 2600 0 0 0 index 10.129.48.12 0 0 0 0 listener 10.129.48.12 1 0 0 0 management 10.129.48.12 26125 2 0 0 refresh 10.129.48.12 2378830 0 0 0 search 10.129.48.12 4325 0 0 0 snapshot 10.129.48.12 0 0 0 0 warmer 10.129.48.12 1009057 0 0 0 Working back from the logging pod to the fluentd pod and the host it ran on. A debug pod on that host reveals sh-4.4# ls -lrt /var/lib/fluentd/clo_default_output_es/ total 80 -rw-r--r--. 1 root root 1641 Jun 15 12:26 buffer.q5a81e8a478809d9ad00161b5ab4f5d4a.log -rw-r--r--. 1 root root 84 Jun 15 12:26 buffer.q5a81e8a478809d9ad00161b5ab4f5d4a.log.meta -rw-r--r--. 1 root root 1186 Jun 15 12:26 buffer.q5a81e8a6beba11a59c86975e96057f9f.log -rw-r--r--. 1 root root 51872 Jun 15 12:26 buffer.q5a81e8a56c7fb614edfd805ef6c9ef6c.log -rw-r--r--. 1 root root 221 Jun 15 12:26 buffer.q5a81e8a6beba11a59c86975e96057f9f.log.meta -rw-r--r--. 1 root root 84 Jun 15 12:26 buffer.q5a81e8a56c7fb614edfd805ef6c9ef6c.log.meta -rw-r--r--. 1 root root 221 Jun 15 12:26 buffer.b5a81e8a7b2c65e7f2a2553acb9eca9f4.log.meta -rw-r--r--. 1 root root 1186 Jun 15 12:26 buffer.b5a81e8a7b2c65e7f2a2553acb9eca9f4.log The infra nodes a generously sized, they have 392GiB of memory, 64 cores and write to a locally attached NVMe. The network is a 25GigE connection. During the test, we see a network throughput of around 1 Gbp/sec at 750 kpp/sec, the NVMe is at around 600 iop/sec and 60 MB/sec. The elasticsearch pods have 61 GB of memory and 16 cores assigned.
Please attach information about the logging environment [1] [1] https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh
must gather is at http://file.str.redhat.com/mkarg/must-gather-logging.tgz logging dump is at http://file.str.redhat.com/mkarg/logging.tgz
Moving to UpcomingSprint as unlikely to resolve before EOS
Put UpcomingSprint label, ongoing investigation as mentioned in [1], not conclusive solution yet. [1] https://coreos.slack.com/archives/CB3HXM2QK/p1593540395170500?thread_ts=1593308260.140300&cid=CB3HXM2QK
Closing per https://bugzilla.redhat.com/show_bug.cgi?id=1846174#c12