Description of problem: I'm encountering an issue with a fully patched OpenShift 3.5 Cluster. Version-Release number of selected component (if applicable): 3.5 How reproducible: 100% on this cluster Steps to Reproduce: 1. Run journalctl --verify on each node 2. Check for errors in /var/log/messages 3. oc get pods -n logging Actual results: journalctl --verify 27a698: Data object references invalid entry at 11e9eb8 File corruption detected at /run/log/journal/7c4e05a4e9ab41b4b4ca12dd2c996fc7/system.journal:11e9d18 (of 25165824 bytes, 74%). ---------------------------------- Aug 8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="journal" Aug 8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [warn]: suppressed same stacktrace Aug 8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [error]: Exception emitting record: queue size exceeds limit Aug 8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [warn]: emit transaction failed: error_class=Fluent::BufferQueueLimitError error="queue size exceeds limit" tag="kubernetes.journal.container.fluentd" Aug 8 08:50:15 openshift-node1 journal: 2017-08-08 08:50:15 -0400 [warn]: suppressed same stacktrace ------------------------------------- NAME READY STATUS RESTARTS AGE logging-curator-10-s9nzx 1/1 Running 916 5d logging-es-v1i1axoy-4-c5vp8 1/1 Running 0 5d logging-fluentd-2nkwz 1/1 Running 5 13d logging-fluentd-3h5x7 1/1 Running 5 13d logging-fluentd-4pcvp 1/1 Running 5 13d logging-fluentd-ln9vx 1/1 Running 5 13d logging-kibana-6-9h9l5 2/2 Running 0 6d Expected results: No errors, no restarts Additional info: sosreport from currently affected node to follow. The data corruption comes and goes but is frequent enough to warrant this bz.
In my case, the Elasticsearch cluster status was red. After correcting the issues with the primary shards, the cluster status switched to yellow. Working on switching cluster status to green...
Correcting issues with replicas switched the cluster status to green This eventually stopped the flood of "buffer flush took longer time than slow_flush_log_threshold" from 2 of the 3 pods. The 3rd one is still coming....awaiting a bit to see if the backlog flushes through...
(In reply to Thom Carlin from comment #14) > Correcting issues with replicas switched the cluster status to green > > This eventually stopped the flood of "buffer flush took longer time than > slow_flush_log_threshold" from 2 of the 3 pods. > > The 3rd one is still coming....awaiting a bit to see if the backlog flushes > through... How long does it take to do the buffer flush? 25 seconds? 250 seconds? Is the number decreasing?
2017-08-17 15:14:15 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=246.001323673 slow_flush_log_threshold=20.0 2017-08-17 15:16:31 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=149.990064719 slow_flush_log_threshold=20.0 2017-08-17 15:18:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=250.099273213 slow_flush_log_threshold=20.0 2017-08-17 15:19:06 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.89762791 slow_flush_log_threshold=20.0 2017-08-17 15:21:44 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.909664502 slow_flush_log_threshold=20.0 2017-08-17 15:22:32 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=244.299740517 slow_flush_log_threshold=20.0 2017-08-17 15:24:28 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=162.306563886 slow_flush_log_threshold=20.0 2017-08-17 15:26:51 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.803858187 slow_flush_log_threshold=20.0 2017-08-17 15:26:54 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=144.998355812 slow_flush_log_threshold=20.0 2017-08-17 15:29:31 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=155.90573981 slow_flush_log_threshold=20.0 2017-08-17 15:31:12 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.406557543 slow_flush_log_threshold=20.0 2017-08-17 15:32:02 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=150.295175093 slow_flush_log_threshold=20.0 2017-08-17 15:34:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.298758368 slow_flush_log_threshold=20.0 2017-08-17 15:35:24 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=250.604503891 slow_flush_log_threshold=20.0 2017-08-17 15:36:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=104.798759375 slow_flush_log_threshold=20.0 2017-08-17 15:38:37 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=192.095328529 slow_flush_log_threshold=20.0 2017-08-17 15:38:53 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=144.984546163 slow_flush_log_threshold=20.0 2017-08-17 15:41:28 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=154.500503306 slow_flush_log_threshold=20.0 2017-08-17 15:42:35 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=236.900520096 slow_flush_log_threshold=20.0 2017-08-17 15:44:03 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.003680078 slow_flush_log_threshold=20.0 2017-08-17 15:46:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.308895855 slow_flush_log_threshold=20.0 2017-08-17 15:46:50 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=253.894442444 slow_flush_log_threshold=20.0 2017-08-17 15:49:17 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=155.594732884 slow_flush_log_threshold=20.0 2017-08-17 15:50:51 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=239.999839277 slow_flush_log_threshold=20.0 2017-08-17 15:51:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=117.19267143 slow_flush_log_threshold=20.0 2017-08-17 15:53:32 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=135.401384603 slow_flush_log_threshold=20.0 2017-08-17 15:54:18 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=206.291535569 slow_flush_log_threshold=20.0 2017-08-17 15:56:00 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=146.192776582 slow_flush_log_threshold=20.0 2017-08-17 15:58:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.302029948 slow_flush_log_threshold=20.0 2017-08-17 15:58:38 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.603383885 slow_flush_log_threshold=20.0 2017-08-17 16:01:11 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=155.401761219 slow_flush_log_threshold=20.0 2017-08-17 16:02:58 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.600908303 slow_flush_log_threshold=20.0 2017-08-17 16:03:45 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.302507632 slow_flush_log_threshold=20.0 2017-08-17 16:06:25 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=158.702599969 slow_flush_log_threshold=20.0 2017-08-17 16:07:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=256.806270411 slow_flush_log_threshold=20.0 2017-08-17 16:08:59 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.804790642 slow_flush_log_threshold=20.0 2017-08-17 16:11:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=154.100538274 slow_flush_log_threshold=20.0 2017-08-17 16:11:34 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.194641157 slow_flush_log_threshold=20.0 2017-08-17 16:14:11 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.104507803 slow_flush_log_threshold=20.0 2017-08-17 16:15:51 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=255.596152934 slow_flush_log_threshold=20.0 2017-08-17 16:16:45 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.898364436 slow_flush_log_threshold=20.0 2017-08-17 16:19:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=153.394335116 slow_flush_log_threshold=20.0 2017-08-17 16:20:00 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=248.200898436 slow_flush_log_threshold=20.0 2017-08-17 16:21:47 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=146.108346553 slow_flush_log_threshold=20.0 As you see, it's fluctuating
top - 22:13:43 up 6 days, 8:14, 0 users, load average: 0.43, 0.75, 0.96 Tasks: 5 total, 1 running, 3 sleeping, 0 stopped, 1 zombie %Cpu(s): 3.2 us, 3.2 sy, 0.0 ni, 93.4 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 20393808 total, 6646908 free, 5276676 used, 8470224 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13572440 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1000050+ 20 0 9935.5m 3.432g 20288 S 13.3 17.6 87:27.34 java 13 1000050+ 20 0 0 0 0 Z 0.0 0.0 0:00.01 sh 189 1000050+ 20 0 11764 1836 1448 S 0.0 0.0 0:00.21 sh 2158 1000050+ 20 0 11764 1860 1460 S 0.0 0.0 0:00.34 sh 4792 1000050+ 20 0 51752 1828 1400 R 0.0 0.0 0:00.01 top top - 22:13:46 up 6 days, 8:14, 0 users, load average: 0.55, 0.77, 0.96 Tasks: 5 total, 1 running, 3 sleeping, 0 stopped, 1 zombie %Cpu(s): 8.6 us, 7.3 sy, 0.0 ni, 83.6 id, 0.1 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem : 20393808 total, 6649796 free, 5276532 used, 8467480 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13572600 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1000050+ 20 0 9935.5m 3.432g 20288 S 28.9 17.6 87:28.21 java 13 1000050+ 20 0 0 0 0 Z 0.0 0.0 0:00.01 sh 189 1000050+ 20 0 11764 1836 1448 S 0.0 0.0 0:00.21 sh 2158 1000050+ 20 0 11764 1860 1460 S 0.0 0.0 0:00.34 sh 4792 1000050+ 20 0 51760 1880 1432 R 0.0 0.0 0:00.01 top top - 22:13:49 up 6 days, 8:14, 0 users, load average: 0.55, 0.77, 0.96 Tasks: 5 total, 1 running, 3 sleeping, 0 stopped, 1 zombie %Cpu(s): 5.8 us, 5.1 sy, 0.0 ni, 88.8 id, 0.1 wa, 0.0 hi, 0.2 si, 0.1 st KiB Mem : 20393808 total, 6648944 free, 5277148 used, 8467716 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13571972 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1000050+ 20 0 9935.5m 3.432g 20288 S 23.3 17.6 87:28.91 java 4792 1000050+ 20 0 51760 1880 1432 R 0.3 0.0 0:00.02 top 13 1000050+ 20 0 0 0 0 Z 0.0 0.0 0:00.01 sh 189 1000050+ 20 0 11764 1836 1448 S 0.0 0.0 0:00.21 sh 2158 1000050+ 20 0 11764 1860 1460 S 0.0 0.0 0:00.34 sh host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize <ip_address> 18849 0 0 0 50
2017-08-17 16:24:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.697979682 slow_flush_log_threshold=20.0 2017-08-17 16:26:55 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=157.90796225 slow_flush_log_threshold=20.0 2017-08-17 16:28:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=258.902158075 slow_flush_log_threshold=20.0 2017-08-17 16:29:30 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=154.099400386 slow_flush_log_threshold=20.0 2017-08-17 16:32:08 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=157.300068405 slow_flush_log_threshold=20.0 2017-08-17 16:32:55 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=253.20043528 slow_flush_log_threshold=20.0 2017-08-17 16:34:42 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.50548935 slow_flush_log_threshold=20.0 2017-08-17 16:37:13 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.703941045 slow_flush_log_threshold=20.0 2017-08-17 16:37:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=152.298880483 slow_flush_log_threshold=20.0 2017-08-17 16:39:48 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=151.301171823 slow_flush_log_threshold=20.0 2017-08-17 16:41:33 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=257.70086703 slow_flush_log_threshold=20.0 2017-08-17 16:42:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=151.194774041 slow_flush_log_threshold=20.0 2017-08-17 16:44:58 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=156.099872266 slow_flush_log_threshold=20.0 2017-08-17 16:45:46 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=252.093410693 slow_flush_log_threshold=20.0 2017-08-17 16:47:25 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=145.899483185 slow_flush_log_threshold=20.0 2017-08-17 16:49:57 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=150.602060507 slow_flush_log_threshold=20.0 2017-08-17 16:50:00 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=251.99061783 slow_flush_log_threshold=20.0 2017-08-17 16:50:54 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:fc7344" elapsed_time=57.495522521 slow_flush_log_threshold=20.0 2017-08-17 16:53:20 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=199.403810753 slow_flush_log_threshold=20.0 2017-08-17 16:56:29 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=188.000272064 slow_flush_log_threshold=20.0 2017-08-17 16:59:38 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.291888091 slow_flush_log_threshold=20.0 2017-08-17 17:02:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=167.191999463 slow_flush_log_threshold=20.0 2017-08-17 17:04:56 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=148.805876286 slow_flush_log_threshold=20.0 2017-08-17 17:07:39 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=161.694220795 slow_flush_log_threshold=20.0 2017-08-17 17:10:44 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=184.203056929 slow_flush_log_threshold=20.0 2017-08-17 17:13:53 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.301750413 slow_flush_log_threshold=20.0 2017-08-17 17:16:56 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=182.298950691 slow_flush_log_threshold=20.0 2017-08-17 17:20:07 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=189.898434071 slow_flush_log_threshold=20.0 2017-08-17 17:23:16 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.900091319 slow_flush_log_threshold=20.0 2017-08-17 17:26:17 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=180.296828518 slow_flush_log_threshold=20.0 2017-08-17 17:29:23 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=184.399529706 slow_flush_log_threshold=20.0 2017-08-17 17:32:30 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=186.198004757 slow_flush_log_threshold=20.0 2017-08-17 17:35:40 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=188.598763135 slow_flush_log_threshold=20.0 2017-08-17 17:38:42 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=181.101618847 slow_flush_log_threshold=20.0 2017-08-17 17:41:46 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=183.000149539 slow_flush_log_threshold=20.0 2017-08-17 17:44:59 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=191.59765499 slow_flush_log_threshold=20.0 2017-08-17 17:48:10 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=189.896203059 slow_flush_log_threshold=20.0 2017-08-17 17:51:11 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=179.903857789 slow_flush_log_threshold=20.0 2017-08-17 17:54:21 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=189.501076467 slow_flush_log_threshold=20.0 2017-08-17 17:57:26 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=183.298757121 slow_flush_log_threshold=20.0 2017-08-17 18:00:29 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=182.499254473 slow_flush_log_threshold=20.0 2017-08-17 18:03:45 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=194.406760847 slow_flush_log_threshold=20.0 2017-08-17 18:06:53 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=186.698915934 slow_flush_log_threshold=20.0 2017-08-17 18:09:56 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=182.596800862 slow_flush_log_threshold=20.0 2017-08-17 18:13:09 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=191.899211195 slow_flush_log_threshold=20.0 2017-08-17 18:16:14 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=183.792888629 slow_flush_log_threshold=20.0 at the moment, the range appears to be smaller
Can you provide the ES dc configurations? oc get -n logging dc -l component=es -o yaml and also the following information from all of the es pods: for espod in $( oc get -n logging pod -l component=es -o jsonpath='{.items[*].metadata.name}' ) ; do oc logs -n logging $espod > $espod.log oc exec $espod -- cat /elasticsearch/logging-es/logs/logging-es.log >> $espod.log oc exec $espod -- top -b -n 3 > $espod.top done how many es nodes/pods are there?
apiVersion: v1 items: - apiVersion: v1 kind: DeploymentConfig metadata: creationTimestamp: 2017-06-08T11:57:01Z generation: 21 labels: component: es deployment: logging-es-v1i1axoy logging-infra: elasticsearch provider: openshift name: logging-es-v1i1axoy namespace: logging resourceVersion: "12578995" selfLink: /oapi/v1/namespaces/logging/deploymentconfigs/logging-es-v1i1axoy uid: 94ed372d-4c41-11e7-8da6-001a4a160152 spec: replicas: 1 selector: component: es deployment: logging-es-v1i1axoy logging-infra: elasticsearch provider: openshift strategy: activeDeadlineSeconds: 21600 recreateParams: timeoutSeconds: 600 resources: {} type: Recreate template: metadata: creationTimestamp: null labels: component: es deployment: logging-es-v1i1axoy logging-infra: elasticsearch provider: openshift name: logging-es-v1i1axoy spec: containers: - env: - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: KUBERNETES_TRUST_CERT value: "true" - name: SERVICE_DNS value: logging-es-cluster - name: CLUSTER_NAME value: logging-es - name: INSTANCE_RAM value: 8Gi - name: NODE_QUORUM value: "1" - name: RECOVER_AFTER_NODES value: "0" - name: RECOVER_EXPECTED_NODES value: "1" - name: RECOVER_AFTER_TIME value: 5m image: registry.access.redhat.com/openshift3/logging-elasticsearch:3.5.0 imagePullPolicy: Always name: elasticsearch ports: - containerPort: 9200 name: restapi protocol: TCP - containerPort: 9300 name: cluster protocol: TCP resources: limits: memory: 8Gi requests: memory: 512Mi terminationMessagePath: /dev/termination-log volumeMounts: - mountPath: /etc/elasticsearch/secret name: elasticsearch readOnly: true - mountPath: /usr/share/java/elasticsearch/config name: elasticsearch-config readOnly: true - mountPath: /elasticsearch/persistent name: elasticsearch-storage dnsPolicy: ClusterFirst restartPolicy: Always securityContext: supplementalGroups: - 65534 serviceAccount: aggregated-logging-elasticsearch serviceAccountName: aggregated-logging-elasticsearch terminationGracePeriodSeconds: 30 volumes: - name: elasticsearch secret: defaultMode: 420 secretName: logging-elasticsearch - configMap: defaultMode: 420 name: logging-elasticsearch name: elasticsearch-config - name: elasticsearch-storage persistentVolumeClaim: claimName: logging-es-0 test: false triggers: - type: ConfigChange status: availableReplicas: 0 conditions: - lastTransitionTime: 2017-06-08T11:57:01Z lastUpdateTime: 2017-06-08T11:57:01Z message: Deployment config does not have minimum availability. status: "False" type: Available - lastTransitionTime: 2017-08-17T12:28:15Z lastUpdateTime: 2017-08-17T12:28:26Z message: replication controller "logging-es-v1i1axoy-5" successfully rolled out reason: NewReplicationControllerAvailable status: "True" type: Progressing details: causes: - type: Manual message: manual change latestVersion: 5 observedGeneration: 21 readyReplicas: 1 replicas: 1 unavailableReplicas: 1 updatedReplicas: 1 kind: List metadata: {} resourceVersion: "" selfLink: "" ------------------------------------------------------------------------------- top - 23:14:12 up 6 days, 9:14, 0 users, load average: 1.00, 0.79, 0.67 Tasks: 5 total, 1 running, 3 sleeping, 0 stopped, 1 zombie %Cpu(s): 3.2 us, 3.2 sy, 0.0 ni, 93.3 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 20393808 total, 6430736 free, 5282352 used, 8680720 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13558596 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1000050+ 20 0 9949.0m 3.432g 20276 S 20.0 17.6 102:51.36 java 13 1000050+ 20 0 0 0 0 Z 0.0 0.0 0:00.01 sh 189 1000050+ 20 0 11764 1836 1448 S 0.0 0.0 0:00.21 sh 2158 1000050+ 20 0 11764 1860 1460 S 0.0 0.0 0:00.34 sh 5155 1000050+ 20 0 51752 1824 1400 R 0.0 0.0 0:00.01 top top - 23:14:15 up 6 days, 9:14, 0 users, load average: 1.00, 0.80, 0.67 Tasks: 5 total, 1 running, 3 sleeping, 0 stopped, 1 zombie %Cpu(s): 4.4 us, 4.1 sy, 0.1 ni, 91.1 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 20393808 total, 6430992 free, 5281836 used, 8680980 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13559084 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1000050+ 20 0 9949.0m 3.432g 20276 S 14.3 17.6 102:51.79 java 13 1000050+ 20 0 0 0 0 Z 0.0 0.0 0:00.01 sh 189 1000050+ 20 0 11764 1836 1448 S 0.0 0.0 0:00.21 sh 2158 1000050+ 20 0 11764 1860 1460 S 0.0 0.0 0:00.34 sh 5155 1000050+ 20 0 51760 1876 1432 R 0.0 0.0 0:00.01 top top - 23:14:18 up 6 days, 9:14, 0 users, load average: 0.92, 0.78, 0.67 Tasks: 5 total, 1 running, 3 sleeping, 0 stopped, 1 zombie %Cpu(s): 7.7 us, 5.2 sy, 0.1 ni, 86.4 id, 0.3 wa, 0.0 hi, 0.3 si, 0.1 st KiB Mem : 20393808 total, 6427620 free, 5282520 used, 8683668 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13558384 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1000050+ 20 0 9949.0m 3.432g 20276 S 31.7 17.6 102:52.74 java 13 1000050+ 20 0 0 0 0 Z 0.0 0.0 0:00.01 sh 189 1000050+ 20 0 11764 1836 1448 S 0.0 0.0 0:00.21 sh 2158 1000050+ 20 0 11764 1860 1460 S 0.0 0.0 0:00.34 sh 5155 1000050+ 20 0 51760 1876 1432 R 0.0 0.0 0:00.01 top ------------------------------------------------------------------------------- The log is in a private attachment 1 es node/pod
I'm not sure what else to look for atm - do any of you on the CC list have any ideas? What is the storage being used in the pv? Is it a local disk? NFS?
Currently, NFS. We recently added CNS to the OpenShift Cluster so I could switch it over...
Maybe NFS isn't fast enough to keep up with Elasticsearch (NFS is not supported as the backing filesystem for Elasticsearch), but if so I would expect to see a higher queue size. Keep doing this curl periodically to see if there are increases in the queue size: oc exec -n logging $espod -- \ curl -s -k --cert /etc/elasticsearch/secret/admin-cert \ --key /etc/elasticsearch/secret/admin-key \ https://localhost:9200/_cat/thread_pool?v\&h=host,bulk.completed,bulk.rejected,bulk.queue,bulk.active,bulk.queueSize This would be the "bulk.queue" and "bulk.active" fields.
ack, will do tomorrow and let you know. Thanks for all your help today
Fri Aug 18 07:49:57 EDT 2017 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize <ip_address> 104183 0 0 0 50 Fri Aug 18 07:55:07 EDT 2017 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize <ip_address> 104724 0 0 0 50 Fri Aug 18 08:00:19 EDT 2017 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize <ip_address> 105266 0 0 0 50 Fri Aug 18 08:06:57 EDT 2017 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize <ip_address> 105955 0 0 0 50 Fri Aug 18 08:12:53 EDT 2017 host bulk.completed bulk.rejected bulk.queue bulk.active bulk.queueSize <ip_address> 106569 0 0 0 50
Rechecking all the nodes (including master), the last "[warn]: buffer flush took longer time than slow_flush_log_threshold" happened last night at 18:46:01. Closing since fixing the underlying issue resolved the reported problem. at the end, the buffer flush suddenly dropped off: 2017-08-17 18:41:09 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=187.103059266 slow_flush_log_threshold=20.0 2017-08-17 18:44:19 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=188.895863691 slow_flush_log_threshold=20.0 2017-08-17 18:46:01 -0400 [warn]: buffer flush took longer time than slow_flush_log_threshold: plugin_id="object:10be6d0" elapsed_time=101.59663009 slow_flush_log_threshold=20.0 < Last error seen So it may have had a huge backlog that took time to process through Closing this one as NOTABUG