1847027 – [Logging 4.4.5] fluentd gets wedged/stops sending to ES at high rates with latest images

Bug 1847027 - [Logging 4.4.5] fluentd gets wedged/stops sending to ES at high rates with latest images

Summary: [Logging 4.4.5] fluentd gets wedged/stops sending to ES at high rates with la...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1833486 1846174 1872465 1877818
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-15 13:19 UTC by Marko Karg
Modified:	2020-09-10 15:52 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-07 19:56:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marko Karg 2020-06-15 13:19:29 UTC

Description of problem:

While logging at high rates (1000 pods, each logging at 2700 msgs / minute in 4.4.5 , fluentd is consistently getting stuck not being able to send messages to elasticsearch.   Messages flow for a while and then fluentd gets this pair of messages in the pod log:

2020-06-15 11:31:50 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block                                                   
2020-06-15 11:31:52 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=0 next_retry_seconds=2020-06-15 11:31:53 +0000 chunk="5a81da42735ed1155aef835175590e01" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-l
ogging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Rejected Elasticsearch due to es_rejected_execution_exception"  
  2020-06-15 11:31:52 +0000 [warn]: suppressed same stacktrace                                                                                                                
2020-06-15 11:32:06 +0000 [warn]: [clo_default_output_es] retry succeeded. chunk_id="5a81da42735ed1155aef835175590e01"                                                        
2020-06-15 11:32:07 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block                                                   
2020-06-15 11:32:07 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block                                                   
2020-06-15 11:32:09 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block

Version-Release number of selected component (if applicable):

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-05-27-032519   True        False         18d     Cluster version is 4.4.0-0.nightly-2020-05-27-032519
elasticsearch: image: registry.redhat.io/openshift4/ose-logging-elasticsearch5@sha256:64b32d1607b06a9378ca9bb299802742e27d11e71ba85e8455b44d67cb9ad1b2
fluentd: image: registry.redhat.io/openshift4/ose-logging-fluentd@sha256:3f06254368618499d20d9598789e7b9936d23bda44f28f6b72fc29b8a643e14b

How reproducible:

Always, with 1000 pods logging at 2700 msgs / minute. Around 12 projects end up with significantly less messages indexed:

oc exec -n openshift-logging -c elasticsearch $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasti
csearch/secret/admin-key https://localhost:9200/_cat/indices?v | grep logtest | grep -v 81000 | wc -l
12

Steps to Reproduce:
1. Kick off 1000 projects, 1 pod each, logging 81000 at a rate of 2700 msgs / minute 
2. Wait for them to finish, check the indexes in ES if they have 81000 messages for every namespace

Actual results:

Around 12-15 projects have roughly only half of the expected messages

oc exec -n openshift-logging -c elasticsearch $POD -- curl --connect-timeout 2 -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/indices?v | grep logtest | grep -v 81000        
green  open   project.logtest999.29a045de-098f-4e97-bbb4-52208bfdddbd.2020.06.15    1BWQ5M3pQ7uV7Oq41luGlw   3   1      35046            0     48.9mb         24.4mb
green  open   project.logtest907.817193a8-b1ba-4543-a4b8-ff6a58ebd81c.2020.06.15    NgjNyT4LSFeXALsxxql0WQ   3   1      34793            0     48.8mb         24.5mb
green  open   project.logtest912.bf59f218-f083-4f80-82af-7a4d061f4776.2020.06.15    GSo1rnQ9Qmm_jdamSEjaeg   3   1      34862            0     48.9mb         24.4mb
green  open   project.logtest596.a032d051-3b32-45a8-9ae1-9bf48453db58.2020.06.15    rIty7I1KSZO6tvv57G1P0Q   3   1      46184            0     65.9mb         32.7mb
green  open   project.logtest904.c0d248c7-651a-44ff-a8d8-abf681f948df.2020.06.15    ZjWS6UrwSgeQiHmYUYKISQ   3   1      34839            0     49.8mb         24.5mb
green  open   project.logtest805.a9a5a947-f8a9-473b-a1de-d329c8d0b849.2020.06.15    MxianchGTyiZeUwAr9C16Q   3   1      34908            0     50.2mb         25.1mb
[...]

Expected results:
Every single project should have all of the messages indexed. 

Additional info:
ES rejected a huge number of messages:

oc exec -n openshift-logging -c elasticsearch $POD --   curl -s -k --cert /etc/elasticsearch/secret/admin-cert   --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/thread_pool?v\&h=name,host,completed,active,queue,rejected
name                host         completed active queue rejected
bulk                10.131.48.12   1468251      1     0    64673
fetch_shard_started 10.131.48.12         0      0     0        0
fetch_shard_store   10.131.48.12         1      0     0        0
flush               10.131.48.12     26145      0     0        0
force_merge         10.131.48.12         0      0     0        0
generic             10.131.48.12     28656      0     0        0
get                 10.131.48.12      1018      0     0        0
index               10.131.48.12         1      0     0        0
listener            10.131.48.12       324      0     0        0
management          10.131.48.12     27738      2     0        0
refresh             10.131.48.12   2371014      0     0        0
search              10.131.48.12      2694      0     0        0
snapshot            10.131.48.12         0      0     0        0
warmer              10.131.48.12   1044308      0     0        0
bulk                10.130.48.12   1496055      2     0    11468
fetch_shard_started 10.130.48.12         0      0     0        0
fetch_shard_store   10.130.48.12         1      0     0        0
flush               10.130.48.12     25892      0     0        0
force_merge         10.130.48.12         0      0     0        0
generic             10.130.48.12     53439      0     0        0
get                 10.130.48.12      2987      0     0        0
index               10.130.48.12         0      0     0        0
listener            10.130.48.12       324      0     0        0
management          10.130.48.12     25524      2     0        0
refresh             10.130.48.12   2374854      0     0        0
search              10.130.48.12      3383      0     0        0
snapshot            10.130.48.12         0      0     0        0
warmer              10.130.48.12   1026956      0     0        0
bulk                10.129.48.12   1921009      2     0      542
fetch_shard_started 10.129.48.12         0      0     0        0
fetch_shard_store   10.129.48.12         1      0     0        0
flush               10.129.48.12     25769      0     0        0
force_merge         10.129.48.12         0      0     0        0
generic             10.129.48.12     48500      0     0        0
get                 10.129.48.12      2600      0     0        0
index               10.129.48.12         0      0     0        0
listener            10.129.48.12         1      0     0        0
management          10.129.48.12     26125      2     0        0
refresh             10.129.48.12   2378830      0     0        0
search              10.129.48.12      4325      0     0        0
snapshot            10.129.48.12         0      0     0        0
warmer              10.129.48.12   1009057      0     0        0


Working back from the logging pod to the fluentd pod and the host it ran on. A debug pod on that host reveals

sh-4.4# ls -lrt /var/lib/fluentd/clo_default_output_es/
total 80
-rw-r--r--. 1 root root  1641 Jun 15 12:26 buffer.q5a81e8a478809d9ad00161b5ab4f5d4a.log
-rw-r--r--. 1 root root    84 Jun 15 12:26 buffer.q5a81e8a478809d9ad00161b5ab4f5d4a.log.meta
-rw-r--r--. 1 root root  1186 Jun 15 12:26 buffer.q5a81e8a6beba11a59c86975e96057f9f.log
-rw-r--r--. 1 root root 51872 Jun 15 12:26 buffer.q5a81e8a56c7fb614edfd805ef6c9ef6c.log
-rw-r--r--. 1 root root   221 Jun 15 12:26 buffer.q5a81e8a6beba11a59c86975e96057f9f.log.meta
-rw-r--r--. 1 root root    84 Jun 15 12:26 buffer.q5a81e8a56c7fb614edfd805ef6c9ef6c.log.meta
-rw-r--r--. 1 root root   221 Jun 15 12:26 buffer.b5a81e8a7b2c65e7f2a2553acb9eca9f4.log.meta
-rw-r--r--. 1 root root  1186 Jun 15 12:26 buffer.b5a81e8a7b2c65e7f2a2553acb9eca9f4.log

The infra nodes a generously sized, they have 392GiB of memory, 64 cores and write to a locally attached NVMe. The network is a 25GigE connection. 

During the test, we see a network throughput of around 1 Gbp/sec at 750 kpp/sec, the NVMe is at around 600 iop/sec and 60 MB/sec. The elasticsearch pods have 61 GB of memory and 16 cores assigned.

Comment 1 Jeff Cantrill 2020-06-15 13:46:32 UTC

Please attach information about the logging environment [1]
[1] https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh

Comment 2 Marko Karg 2020-06-15 14:38:49 UTC

must gather is at http://file.str.redhat.com/mkarg/must-gather-logging.tgz

logging dump is at http://file.str.redhat.com/mkarg/logging.tgz

Comment 3 Jeff Cantrill 2020-06-18 15:55:49 UTC

Moving to UpcomingSprint as unlikely to resolve before EOS

Comment 4 Periklis Tsirakidis 2020-07-06 07:37:40 UTC

Put UpcomingSprint label, ongoing investigation as mentioned in [1], not conclusive solution yet.

[1] https://coreos.slack.com/archives/CB3HXM2QK/p1593540395170500?thread_ts=1593308260.140300&cid=CB3HXM2QK

Comment 5 Jeff Cantrill 2020-07-07 19:56:02 UTC

Closing per https://bugzilla.redhat.com/show_bug.cgi?id=1846174#c12

Note You need to log in before you can comment on or make changes to this bug.