On starter-ca-central-1, v3.7.0-0.178.2, we encountered the following error in a fluentd pod: 2017-11-08 01:24:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-08 01:24:47 +0000 error_class="MessagePack::MalformedFormatError" error="invalid byte" plugin_id="object:1387ed4" 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `each' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `block in msgpack_each' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:71:in `open' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:120:in `msgpack_each' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5.1/lib/fluent/plugin/out_elasticsearch.rb:296:in `write_objects' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:490:in `write' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:354:in `write_chunk' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:333:in `pop' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:342:in `try_flush' 2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:149:in `run' 2017-11-08 01:24:47 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-08 01:24:49 +0000 error_class="MessagePack::MalformedFormatError" error="invalid byte" plugin_id="object:1387ed4" 2017-11-08 01:24:47 +0000 [warn]: suppressed same stacktrace 2017-11-08 01:24:51 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-08 01:24:53 +0000 error_class="MessagePack::MalformedFormatError" error="invalid byte" plugin_id="object:1387ed4" 2017-11-08 01:24:51 +0000 [warn]: suppressed same stacktrace 2017-11-08 01:24:53 +0000 [error]: fluentd main process died unexpectedly. restarting.
We found the bad on-disk queue entry, but using msgpack-inspect [1] did not show anything bad about the data. Removing that file does allow the rest of the data to be indexed. Debugging this will require understanding why msgpack-inspect does not report the same error condition as fluentd does. The mystery deepens... [1] gem install msgpack-inspect
Looks like some sort of corruption. The msgpack-inspect tool was used to dump the data from the problematic buffer file: { "key": { "format": "fixstr", "header": "0xaf", "length": 15, "data": "0x53454c494e55585f434f92ce59f6db", "value": "SELINUX_CO\x92\xCEY\xF6\xDB" }, "value": { "format": "never_used", "header": "0xc1", , "data": "0xc1" } }, The value of the "key" should be SELINUX_CONTEXT, but instead it looks as though the beginning of the buffer was inserted. The value 0xc1 should never be used as a msgpack type, and it appears that this is causing the MessagePack::MalformedFormatError exception. Could be a race condition or some other sort of corruption. msgpack encodes a buffer chunk as an array consisting of a timestamp - either unix time_t 32-bit integer (ce means a 32-bit unsigned int value follows) # od -t x1 buffer-output-es-ops-config.output_ops_tag.b55d93458837dabc1.log | head -2 0000000 92 ce 5a 04 c4 e7 87 a7 73 79 73 74 65 6d 64 82 0000020 a1 74 8e a7 42 4f 4f 54 5f 49 44 d9 20 61 35 31 0x5a04c4e7 == 1510261991 == Thu Nov 9 21:13:11 UTC 2017 - or a 64-bit floating point value with microseconds as the decimal). e.g. # od -t x1 buffer-output-es-ops-config.output_ops_tag.b55d84bc61d0870e5.log | head -2 0000000 92 cb 41 d6 80 f4 42 49 26 ea 86 a8 68 6f 73 74 0000020 6e 61 6d 65 bc 69 70 2d 31 37 32 2d 31 38 2d 34 0x41d680f4424926ea == 1510199561.1430001 == Thu Nov 9 03:52:41.1430001 UTC 2017
https://github.com/msgpack/msgpack-ruby/blob/7d8317c5dbe77a8f8f2ff95612e5a99445b17e0a/ext/msgpack/unpacker.c#L628 https://github.com/msgpack/msgpack-ruby/blob/7d8317c5dbe77a8f8f2ff95612e5a99445b17e0a/doclib/msgpack/unpacker.rb#L81
*** Bug 1567265 has been marked as a duplicate of this bug. ***
This will be resolved by Fixed with fluent-plugin-elasticsearch v0.14.0 Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1562004
@Jeff, All buffer file are swallowed, but I couldn't find any file named "dlq", Could you confirm my test step ? logging-fluentd/images/v3.7.44-4 # gem list |grep elastic elasticsearch (2.0.2) elasticsearch-api (2.0.2) elasticsearch-transport (2.0.2) fluent-plugin-elasticsearch (1.14.0 1. Turn off fluentd, put the attached archive log under buffer direct /var/lib/fluentd and clear the .pos files. There are 34 buffer files. 2. Turn on fluentd. 3. Waiting for a while, check the index and buffer file. Index are created and all buffer file are swallowed.
the dlq hander have been revert to use router.emit_error_event instead. The fix is not in logging-fluentd/images/v3.7.44-4
Really need fluent-plugin-elasticsearch 1.16.1 along with https://github.com/openshift/origin-aggregated-logging/pull/1180
Verified on openshift3-logging-fluentd-v3.7.61-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2337
*** Bug 1714716 has been marked as a duplicate of this bug. ***