Bug 1479369 - "buffer flush took longer" warnings in fluentd logs - takes a long time to write records to Elasticsearch
Summary: "buffer flush took longer" warnings in fluentd logs - takes a long time to wr...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1724792
TreeView+ depends on / blocked
 
Reported: 2017-08-08 12:54 UTC by Thom Carlin
Modified: 2020-09-10 11:10 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-18 16:12:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1451370 0 urgent CLOSED [DOCS] Elasticsearch and NFS Supportability 2021-09-09 12:18:26 UTC
Red Hat Bugzilla 1483020 0 unspecified CLOSED [Docs] Install Guide: Section 2.6.3.16.1: Don't recommend NFS as logging storage 2021-02-22 00:41:40 UTC

Internal Links: 1451370 1483020

Description Thom Carlin 2017-08-08 12:54:48 UTC
Description of problem:

I'm encountering an issue with a fully patched OpenShift 3.5 Cluster.

Version-Release number of selected component (if applicable):

3.5

How reproducible:

100% on this cluster

Steps to Reproduce:
1. Run journalctl --verify on each node
2. Check for errors in /var/log/messages
3. oc get pods -n logging

Actual results:

journalctl --verify
27a698: Data object references invalid entry at 11e9eb8                                                                                                    
File corruption detected at /run/log/journal/7c4e05a4e9ab41b4b4ca12dd2c996fc7/system.journal:11e9d18 (of 25165824 bytes, 74%).          
----------------------------------
Aug  8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="journal"
Aug  8 08:50:15 openshift-node1 journal:  2017-08-08 08:50:15 -0400 [warn]: suppressed same stacktrace
Aug  8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [error]: Exception emitting record: queue size exceeds limit
Aug  8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="kubernetes.journal.container.fluentd"
Aug  8 08:50:15 openshift-node1 journal:  2017-08-08 08:50:15 -0400 [warn]: suppressed same stacktrace
-------------------------------------
NAME                          READY     STATUS    RESTARTS   AGE
logging-curator-10-s9nzx      1/1       Running   916        5d
logging-es-v1i1axoy-4-c5vp8   1/1       Running   0          5d
logging-fluentd-2nkwz         1/1       Running   5          13d
logging-fluentd-3h5x7         1/1       Running   5          13d
logging-fluentd-4pcvp         1/1       Running   5          13d
logging-fluentd-ln9vx         1/1       Running   5          13d
logging-kibana-6-9h9l5        2/2       Running   0          6d


Expected results:

No errors, no restarts

Additional info:

sosreport from currently affected node to follow.
The data corruption comes and goes but is frequent enough to warrant this bz.

Comment 13 Thom Carlin 2017-08-17 18:12:55 UTC
In my case, the Elasticsearch cluster status was red.  After correcting the issues with the primary shards, the cluster status switched to yellow.

Working on switching cluster status to green...

Comment 14 Thom Carlin 2017-08-17 19:55:26 UTC
Correcting issues with replicas switched the cluster status to green

This eventually stopped the flood of "buffer flush took longer time than slow_flush_log_threshold" from 2 of the 3 pods.

The 3rd one is still coming....awaiting a bit to see if the backlog flushes through...

Comment 15 Rich Megginson 2017-08-17 20:13:19 UTC
(In reply to Thom Carlin from comment #14)
> Correcting issues with replicas switched the cluster status to green
> 
> This eventually stopped the flood of "buffer flush took longer time than
> slow_flush_log_threshold" from 2 of the 3 pods.
> 
> The 3rd one is still coming....awaiting a bit to see if the backlog flushes
> through...

How long does it take to do the buffer flush?  25 seconds? 250 seconds?  Is the number decreasing?

Comment 16 Thom Carlin 2017-08-17 20:22:47 UTC
2017-08-17 15:14:15 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=246.001323673 slow_flush_log_threshold=20.0
2017-08-17 15:16:31 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=149.990064719 slow_flush_log_threshold=20.0
2017-08-17 15:18:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=250.099273213 slow_flush_log_threshold=20.0
2017-08-17 15:19:06 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.89762791 slow_flush_log_threshold=20.0
2017-08-17 15:21:44 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.909664502 slow_flush_log_threshold=20.0
2017-08-17 15:22:32 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=244.299740517 slow_flush_log_threshold=20.0
2017-08-17 15:24:28 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=162.306563886 slow_flush_log_threshold=20.0
2017-08-17 15:26:51 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.803858187 slow_flush_log_threshold=20.0
2017-08-17 15:26:54 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=144.998355812 slow_flush_log_threshold=20.0
2017-08-17 15:29:31 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=155.90573981 slow_flush_log_threshold=20.0
2017-08-17 15:31:12 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.406557543 slow_flush_log_threshold=20.0
2017-08-17 15:32:02 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=150.295175093 slow_flush_log_threshold=20.0
2017-08-17 15:34:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.298758368 slow_flush_log_threshold=20.0
2017-08-17 15:35:24 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=250.604503891 slow_flush_log_threshold=20.0
2017-08-17 15:36:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=104.798759375 slow_flush_log_threshold=20.0
2017-08-17 15:38:37 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=192.095328529 slow_flush_log_threshold=20.0
2017-08-17 15:38:53 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=144.984546163 slow_flush_log_threshold=20.0
2017-08-17 15:41:28 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=154.500503306 slow_flush_log_threshold=20.0
2017-08-17 15:42:35 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=236.900520096 slow_flush_log_threshold=20.0
2017-08-17 15:44:03 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.003680078 slow_flush_log_threshold=20.0
2017-08-17 15:46:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.308895855 slow_flush_log_threshold=20.0
2017-08-17 15:46:50 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=253.894442444 slow_flush_log_threshold=20.0
2017-08-17 15:49:17 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=155.594732884 slow_flush_log_threshold=20.0
2017-08-17 15:50:51 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=239.999839277 slow_flush_log_threshold=20.0
2017-08-17 15:51:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=117.19267143 slow_flush_log_threshold=20.0
2017-08-17 15:53:32 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=135.401384603 slow_flush_log_threshold=20.0
2017-08-17 15:54:18 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=206.291535569 slow_flush_log_threshold=20.0
2017-08-17 15:56:00 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=146.192776582 slow_flush_log_threshold=20.0
2017-08-17 15:58:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.302029948 slow_flush_log_threshold=20.0
2017-08-17 15:58:38 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.603383885 slow_flush_log_threshold=20.0
2017-08-17 16:01:11 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=155.401761219 slow_flush_log_threshold=20.0
2017-08-17 16:02:58 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.600908303 slow_flush_log_threshold=20.0
2017-08-17 16:03:45 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.302507632 slow_flush_log_threshold=20.0
2017-08-17 16:06:25 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=158.702599969 slow_flush_log_threshold=20.0
2017-08-17 16:07:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=256.806270411 slow_flush_log_threshold=20.0
2017-08-17 16:08:59 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.804790642 slow_flush_log_threshold=20.0
2017-08-17 16:11:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=154.100538274 slow_flush_log_threshold=20.0
2017-08-17 16:11:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.194641157 slow_flush_log_threshold=20.0
2017-08-17 16:14:11 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.104507803 slow_flush_log_threshold=20.0
2017-08-17 16:15:51 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=255.596152934 slow_flush_log_threshold=20.0
2017-08-17 16:16:45 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.898364436 slow_flush_log_threshold=20.0
2017-08-17 16:19:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.394335116 slow_flush_log_threshold=20.0
2017-08-17 16:20:00 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=248.200898436 slow_flush_log_threshold=20.0
2017-08-17 16:21:47 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=146.108346553 slow_flush_log_threshold=20.0

As you see, it's fluctuating

Comment 18 Thom Carlin 2017-08-17 22:16:51 UTC
top - 22:13:43 up 6 days,  8:14,  0 users,  load average: 0.43, 0.75, 0.96
Tasks:   5 total,   1 running,   3 sleeping,   0 stopped,   1 zombie
%Cpu(s):  3.2 us,  3.2 sy,  0.0 ni, 93.4 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 20393808 total,  6646908 free,  5276676 used,  8470224 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13572440 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     1 1000050+  20   0 9935.5m 3.432g  20288 S  13.3 17.6  87:27.34 java
    13 1000050+  20   0       0      0      0 Z   0.0  0.0   0:00.01 sh
   189 1000050+  20   0   11764   1836   1448 S   0.0  0.0   0:00.21 sh
  2158 1000050+  20   0   11764   1860   1460 S   0.0  0.0   0:00.34 sh
  4792 1000050+  20   0   51752   1828   1400 R   0.0  0.0   0:00.01 top

top - 22:13:46 up 6 days,  8:14,  0 users,  load average: 0.55, 0.77, 0.96
Tasks:   5 total,   1 running,   3 sleeping,   0 stopped,   1 zombie
%Cpu(s):  8.6 us,  7.3 sy,  0.0 ni, 83.6 id,  0.1 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 20393808 total,  6649796 free,  5276532 used,  8467480 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13572600 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     1 1000050+  20   0 9935.5m 3.432g  20288 S  28.9 17.6  87:28.21 java
    13 1000050+  20   0       0      0      0 Z   0.0  0.0   0:00.01 sh
   189 1000050+  20   0   11764   1836   1448 S   0.0  0.0   0:00.21 sh
  2158 1000050+  20   0   11764   1860   1460 S   0.0  0.0   0:00.34 sh
  4792 1000050+  20   0   51760   1880   1432 R   0.0  0.0   0:00.01 top

top - 22:13:49 up 6 days,  8:14,  0 users,  load average: 0.55, 0.77, 0.96
Tasks:   5 total,   1 running,   3 sleeping,   0 stopped,   1 zombie
%Cpu(s):  5.8 us,  5.1 sy,  0.0 ni, 88.8 id,  0.1 wa,  0.0 hi,  0.2 si,  0.1 st
KiB Mem : 20393808 total,  6648944 free,  5277148 used,  8467716 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13571972 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     1 1000050+  20   0 9935.5m 3.432g  20288 S  23.3 17.6  87:28.91 java
  4792 1000050+  20   0   51760   1880   1432 R   0.3  0.0   0:00.02 top
    13 1000050+  20   0       0      0      0 Z   0.0  0.0   0:00.01 sh
   189 1000050+  20   0   11764   1836   1448 S   0.0  0.0   0:00.21 sh
  2158 1000050+  20   0   11764   1860   1460 S   0.0  0.0   0:00.34 sh

host         bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
<ip_address>          18849             0          0           0             50

Comment 19 Thom Carlin 2017-08-17 22:18:46 UTC
2017-08-17 16:24:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.697979682 slow_flush_log_threshold=20.0
2017-08-17 16:26:55 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=157.90796225 slow_flush_log_threshold=20.0
2017-08-17 16:28:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.902158075 slow_flush_log_threshold=20.0
2017-08-17 16:29:30 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=154.099400386 slow_flush_log_threshold=20.0
2017-08-17 16:32:08 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=157.300068405 slow_flush_log_threshold=20.0
2017-08-17 16:32:55 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=253.20043528 slow_flush_log_threshold=20.0
2017-08-17 16:34:42 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.50548935 slow_flush_log_threshold=20.0
2017-08-17 16:37:13 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.703941045 slow_flush_log_threshold=20.0
2017-08-17 16:37:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.298880483 slow_flush_log_threshold=20.0
2017-08-17 16:39:48 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=151.301171823 slow_flush_log_threshold=20.0
2017-08-17 16:41:33 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.70086703 slow_flush_log_threshold=20.0
2017-08-17 16:42:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=151.194774041 slow_flush_log_threshold=20.0
2017-08-17 16:44:58 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.099872266 slow_flush_log_threshold=20.0
2017-08-17 16:45:46 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=252.093410693 slow_flush_log_threshold=20.0
2017-08-17 16:47:25 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=145.899483185 slow_flush_log_threshold=20.0
2017-08-17 16:49:57 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=150.602060507 slow_flush_log_threshold=20.0
2017-08-17 16:50:00 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=251.99061783 slow_flush_log_threshold=20.0
2017-08-17 16:50:54 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=57.495522521 slow_flush_log_threshold=20.0
2017-08-17 16:53:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=199.403810753 slow_flush_log_threshold=20.0
2017-08-17 16:56:29 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=188.000272064 slow_flush_log_threshold=20.0
2017-08-17 16:59:38 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.291888091 slow_flush_log_threshold=20.0
2017-08-17 17:02:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=167.191999463 slow_flush_log_threshold=20.0
2017-08-17 17:04:56 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=148.805876286 slow_flush_log_threshold=20.0
2017-08-17 17:07:39 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=161.694220795 slow_flush_log_threshold=20.0
2017-08-17 17:10:44 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=184.203056929 slow_flush_log_threshold=20.0
2017-08-17 17:13:53 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.301750413 slow_flush_log_threshold=20.0
2017-08-17 17:16:56 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=182.298950691 slow_flush_log_threshold=20.0
2017-08-17 17:20:07 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=189.898434071 slow_flush_log_threshold=20.0
2017-08-17 17:23:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.900091319 slow_flush_log_threshold=20.0
2017-08-17 17:26:17 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=180.296828518 slow_flush_log_threshold=20.0
2017-08-17 17:29:23 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=184.399529706 slow_flush_log_threshold=20.0
2017-08-17 17:32:30 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=186.198004757 slow_flush_log_threshold=20.0
2017-08-17 17:35:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=188.598763135 slow_flush_log_threshold=20.0
2017-08-17 17:38:42 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=181.101618847 slow_flush_log_threshold=20.0
2017-08-17 17:41:46 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=183.000149539 slow_flush_log_threshold=20.0
2017-08-17 17:44:59 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=191.59765499 slow_flush_log_threshold=20.0
2017-08-17 17:48:10 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=189.896203059 slow_flush_log_threshold=20.0
2017-08-17 17:51:11 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=179.903857789 slow_flush_log_threshold=20.0
2017-08-17 17:54:21 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=189.501076467 slow_flush_log_threshold=20.0
2017-08-17 17:57:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=183.298757121 slow_flush_log_threshold=20.0
2017-08-17 18:00:29 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=182.499254473 slow_flush_log_threshold=20.0
2017-08-17 18:03:45 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=194.406760847 slow_flush_log_threshold=20.0
2017-08-17 18:06:53 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=186.698915934 slow_flush_log_threshold=20.0
2017-08-17 18:09:56 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=182.596800862 slow_flush_log_threshold=20.0
2017-08-17 18:13:09 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=191.899211195 slow_flush_log_threshold=20.0
2017-08-17 18:16:14 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=183.792888629 slow_flush_log_threshold=20.0

at the moment, the range appears to be smaller

Comment 20 Rich Megginson 2017-08-17 22:27:00 UTC
Can you provide the ES dc configurations?

oc get -n logging dc -l component=es -o yaml


and also the following information from all of the es pods:

for espod in $( oc get -n logging pod -l component=es -o jsonpath='{.items[*].metadata.name}' ) ; do
    oc logs -n logging $espod > $espod.log
    oc exec $espod -- cat /elasticsearch/logging-es/logs/logging-es.log >> $espod.log
    oc exec $espod -- top -b -n 3 > $espod.top
done

how many es nodes/pods are there?

Comment 21 Thom Carlin 2017-08-17 23:16:27 UTC
apiVersion: v1
items:
- apiVersion: v1
  kind: DeploymentConfig
  metadata:
    creationTimestamp: 2017-06-08T11:57:01Z
    generation: 21
    labels:
      component: es
      deployment: logging-es-v1i1axoy
      logging-infra: elasticsearch
      provider: openshift
    name: logging-es-v1i1axoy
    namespace: logging
    resourceVersion: "12578995"
    selfLink: /oapi/v1/namespaces/logging/deploymentconfigs/logging-es-v1i1axoy
    uid: 94ed372d-4c41-11e7-8da6-001a4a160152
  spec:
    replicas: 1
    selector:
      component: es
      deployment: logging-es-v1i1axoy
      logging-infra: elasticsearch
      provider: openshift
    strategy:
      activeDeadlineSeconds: 21600
      recreateParams:
        timeoutSeconds: 600
      resources: {}
      type: Recreate
    template:
      metadata:
        creationTimestamp: null
        labels:
          component: es
          deployment: logging-es-v1i1axoy
          logging-infra: elasticsearch
          provider: openshift
        name: logging-es-v1i1axoy
      spec:
        containers:
        - env:
          - name: NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: KUBERNETES_TRUST_CERT
            value: "true"
          - name: SERVICE_DNS
            value: logging-es-cluster
          - name: CLUSTER_NAME
            value: logging-es
          - name: INSTANCE_RAM
            value: 8Gi
          - name: NODE_QUORUM
            value: "1"
          - name: RECOVER_AFTER_NODES
            value: "0"
          - name: RECOVER_EXPECTED_NODES
            value: "1"
          - name: RECOVER_AFTER_TIME
            value: 5m
          image: registry.access.redhat.com/openshift3/logging-elasticsearch:3.5.0
          imagePullPolicy: Always
          name: elasticsearch
          ports:
          - containerPort: 9200
            name: restapi
            protocol: TCP
          - containerPort: 9300
            name: cluster
            protocol: TCP
          resources:
            limits:
              memory: 8Gi
            requests:
              memory: 512Mi
          terminationMessagePath: /dev/termination-log
          volumeMounts:
          - mountPath: /etc/elasticsearch/secret
            name: elasticsearch
            readOnly: true
          - mountPath: /usr/share/java/elasticsearch/config
            name: elasticsearch-config
            readOnly: true
          - mountPath: /elasticsearch/persistent
            name: elasticsearch-storage
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        securityContext:
          supplementalGroups:
          - 65534
        serviceAccount: aggregated-logging-elasticsearch
        serviceAccountName: aggregated-logging-elasticsearch
        terminationGracePeriodSeconds: 30
        volumes:
        - name: elasticsearch
          secret:
            defaultMode: 420
            secretName: logging-elasticsearch
        - configMap:
            defaultMode: 420
            name: logging-elasticsearch
          name: elasticsearch-config
        - name: elasticsearch-storage
          persistentVolumeClaim:
            claimName: logging-es-0
    test: false
    triggers:
    - type: ConfigChange
  status:
    availableReplicas: 0
    conditions:
    - lastTransitionTime: 2017-06-08T11:57:01Z
      lastUpdateTime: 2017-06-08T11:57:01Z
      message: Deployment config does not have minimum availability.
      status: "False"
      type: Available
    - lastTransitionTime: 2017-08-17T12:28:15Z
      lastUpdateTime: 2017-08-17T12:28:26Z
      message: replication controller "logging-es-v1i1axoy-5" successfully rolled
        out
      reason: NewReplicationControllerAvailable
      status: "True"
      type: Progressing
    details:
      causes:
      - type: Manual
      message: manual change
    latestVersion: 5
    observedGeneration: 21
    readyReplicas: 1
    replicas: 1
    unavailableReplicas: 1
    updatedReplicas: 1
kind: List
metadata: {}
resourceVersion: ""
selfLink: ""
-------------------------------------------------------------------------------
top - 23:14:12 up 6 days,  9:14,  0 users,  load average: 1.00, 0.79, 0.67
Tasks:   5 total,   1 running,   3 sleeping,   0 stopped,   1 zombie
%Cpu(s):  3.2 us,  3.2 sy,  0.0 ni, 93.3 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 20393808 total,  6430736 free,  5282352 used,  8680720 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13558596 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     1 1000050+  20   0 9949.0m 3.432g  20276 S  20.0 17.6 102:51.36 java
    13 1000050+  20   0       0      0      0 Z   0.0  0.0   0:00.01 sh
   189 1000050+  20   0   11764   1836   1448 S   0.0  0.0   0:00.21 sh
  2158 1000050+  20   0   11764   1860   1460 S   0.0  0.0   0:00.34 sh
  5155 1000050+  20   0   51752   1824   1400 R   0.0  0.0   0:00.01 top

top - 23:14:15 up 6 days,  9:14,  0 users,  load average: 1.00, 0.80, 0.67
Tasks:   5 total,   1 running,   3 sleeping,   0 stopped,   1 zombie
%Cpu(s):  4.4 us,  4.1 sy,  0.1 ni, 91.1 id,  0.1 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 20393808 total,  6430992 free,  5281836 used,  8680980 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13559084 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     1 1000050+  20   0 9949.0m 3.432g  20276 S  14.3 17.6 102:51.79 java
    13 1000050+  20   0       0      0      0 Z   0.0  0.0   0:00.01 sh
   189 1000050+  20   0   11764   1836   1448 S   0.0  0.0   0:00.21 sh
  2158 1000050+  20   0   11764   1860   1460 S   0.0  0.0   0:00.34 sh
  5155 1000050+  20   0   51760   1876   1432 R   0.0  0.0   0:00.01 top

top - 23:14:18 up 6 days,  9:14,  0 users,  load average: 0.92, 0.78, 0.67
Tasks:   5 total,   1 running,   3 sleeping,   0 stopped,   1 zombie
%Cpu(s):  7.7 us,  5.2 sy,  0.1 ni, 86.4 id,  0.3 wa,  0.0 hi,  0.3 si,  0.1 st
KiB Mem : 20393808 total,  6427620 free,  5282520 used,  8683668 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13558384 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     1 1000050+  20   0 9949.0m 3.432g  20276 S  31.7 17.6 102:52.74 java
    13 1000050+  20   0       0      0      0 Z   0.0  0.0   0:00.01 sh
   189 1000050+  20   0   11764   1836   1448 S   0.0  0.0   0:00.21 sh
  2158 1000050+  20   0   11764   1860   1460 S   0.0  0.0   0:00.34 sh
  5155 1000050+  20   0   51760   1876   1432 R   0.0  0.0   0:00.01 top
-------------------------------------------------------------------------------
The log is in a private attachment

1 es node/pod

Comment 23 Rich Megginson 2017-08-17 23:36:11 UTC
I'm not sure what else to look for atm - do any of you on the CC list have any ideas?

What is the storage being used in the pv?  Is it a local disk?  NFS?

Comment 24 Thom Carlin 2017-08-18 00:32:27 UTC
Currently, NFS.  We recently added CNS to the OpenShift Cluster so I could switch it over...

Comment 25 Rich Megginson 2017-08-18 02:09:21 UTC
Maybe NFS isn't fast enough to keep up with Elasticsearch (NFS is not supported as the backing filesystem for Elasticsearch), but if so I would expect to see a higher queue size.  Keep doing this curl periodically to see if there are increases in the queue size:

oc exec -n logging $espod -- \
  curl -s -k --cert /etc/elasticsearch/secret/admin-cert \
  --key /etc/elasticsearch/secret/admin-key \
https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize 

This would be the "bulk.queue" and "bulk.active" fields.

Comment 26 Thom Carlin 2017-08-18 02:10:29 UTC
ack, will do tomorrow and let you know.  Thanks for all your help today

Comment 27 Thom Carlin 2017-08-18 12:14:20 UTC
Fri Aug 18 07:49:57 EDT 2017
host         bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
<ip_address>         104183             0          0           0             50 
Fri Aug 18 07:55:07 EDT 2017
host         bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
<ip_address>         104724             0          0           0             50 
Fri Aug 18 08:00:19 EDT 2017
host         bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
<ip_address>         105266             0          0           0             50 
Fri Aug 18 08:06:57 EDT 2017
host         bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
<ip_address>         105955             0          0           0             50 
Fri Aug 18 08:12:53 EDT 2017
host         bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize 
<ip_address>         106569             0          0           0             50

Comment 28 Thom Carlin 2017-08-18 16:12:34 UTC
Rechecking all the nodes (including master), the last "[warn]: buffer flush took longer time than slow_flush_log_threshold" happened last night at 18:46:01.

Closing since fixing the underlying issue resolved the reported problem.

at the end, the buffer flush suddenly dropped off: 
2017-08-17 18:41:09 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.103059266 slow_flush_log_threshold=20.0
2017-08-17 18:44:19 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=188.895863691 slow_flush_log_threshold=20.0
2017-08-17 18:46:01 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=101.59663009 slow_flush_log_threshold=20.0 < Last error seen

So it may have had a huge backlog that took time to process through

Closing this one as NOTABUG


Note You need to log in before you can comment on or make changes to this bug.