Bug 1511116 - error_class="MessagePack::MalformedFormatError" error="invalid byte" encountered in fluentd pod
Summary: error_class="MessagePack::MalformedFormatError" error="invalid byte" encounte...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.7.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 3.7.z
Assignee: Jeff Cantrill
QA Contact: Qiaoling Tang
URL:
Whiteboard:
: 1567265 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-08 17:43 UTC by Peter Portante
Modified: 2021-12-10 15:23 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Corrupt File Chunk Buffers Consequence: Fluentd is blocked processing message until the buffer is removed Fix: Introduce handler to remove corrupt messages from the processing work flow Result: Corrupt messages are sent to a 'dead letter queue' while continuing to process other messages and the pipleline is no longer blocked
Clone Of:
Environment:
Last Closed: 2018-08-09 22:14:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2337 0 None None None 2018-08-09 22:15:05 UTC

Description Peter Portante 2017-11-08 17:43:01 UTC
On starter-ca-central-1, v3.7.0-0.178.2, we encountered the following error in a fluentd pod:

2017-11-08 01:24:46 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-08 01:24:47 +0000 error_class="MessagePack::MalformedFormatError" error="invalid byte" plugin_id="object:1387ed4"
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `each'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:123:in `block in msgpack_each'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/plugin/buf_file.rb:71:in `open'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:120:in `msgpack_each'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-elasticsearch-1.9.5.1/lib/fluent/plugin/out_elasticsearch.rb:296:in `write_objects'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:490:in `write'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:354:in `write_chunk'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/buffer.rb:333:in `pop'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:342:in `try_flush'
  2017-11-08 01:24:46 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.39/lib/fluent/output.rb:149:in `run'
2017-11-08 01:24:47 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-08 01:24:49 +0000 error_class="MessagePack::MalformedFormatError" error="invalid byte" plugin_id="object:1387ed4"
  2017-11-08 01:24:47 +0000 [warn]: suppressed same stacktrace
2017-11-08 01:24:51 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2017-11-08 01:24:53 +0000 error_class="MessagePack::MalformedFormatError" error="invalid byte" plugin_id="object:1387ed4"
  2017-11-08 01:24:51 +0000 [warn]: suppressed same stacktrace
2017-11-08 01:24:53 +0000 [error]: fluentd main process died unexpectedly. restarting.

Comment 1 Peter Portante 2017-11-09 04:17:20 UTC
We found the bad on-disk queue entry, but using msgpack-inspect [1] did not show anything bad about the data.

Removing that file does allow the rest of the data to be indexed.

Debugging this will require understanding why msgpack-inspect does not report the same error condition as fluentd does.

The mystery deepens...

[1] gem install msgpack-inspect

Comment 3 Rich Megginson 2017-11-09 21:56:41 UTC
Looks like some sort of corruption.  The msgpack-inspect tool was used to dump the data from the problematic buffer file:

{ "key":
    { "format": "fixstr",
      "header": "0xaf",
      "length": 15,
      "data": "0x53454c494e55585f434f92ce59f6db",
      "value": "SELINUX_CO\x92\xCEY\xF6\xDB"
    },
  "value":
    { "format": "never_used",
      "header": "0xc1",
,
      "data": "0xc1"
    }
},

The value of the "key" should be SELINUX_CONTEXT, but instead it looks as though the beginning of the buffer was inserted.  The value 0xc1 should never be used as a msgpack type, and it appears that this is causing the MessagePack::MalformedFormatError exception.  Could be a race condition or some other sort of corruption.

msgpack encodes a buffer chunk as an array consisting of a timestamp - either unix time_t 32-bit integer (ce means a 32-bit unsigned int value follows)

# od -t x1 buffer-output-es-ops-config.output_ops_tag.b55d93458837dabc1.log | head -2
0000000 92 ce 5a 04 c4 e7 87 a7 73 79 73 74 65 6d 64 82
0000020 a1 74 8e a7 42 4f 4f 54 5f 49 44 d9 20 61 35 31

0x5a04c4e7 == 1510261991 == Thu Nov  9 21:13:11 UTC 2017

- or a 64-bit floating point value with microseconds as the decimal).  e.g.

# od -t x1 buffer-output-es-ops-config.output_ops_tag.b55d84bc61d0870e5.log | head -2
0000000 92 cb 41 d6 80 f4 42 49 26 ea 86 a8 68 6f 73 74
0000020 6e 61 6d 65 bc 69 70 2d 31 37 32 2d 31 38 2d 34

0x41d680f4424926ea == 1510199561.1430001 == Thu Nov  9 03:52:41.1430001 UTC 2017

Comment 5 Jeff Cantrill 2018-04-13 17:27:06 UTC
*** Bug 1567265 has been marked as a duplicate of this bug. ***

Comment 6 Jeff Cantrill 2018-04-13 17:27:55 UTC
This will be resolved by Fixed with fluent-plugin-elasticsearch v0.14.0  Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1562004

Comment 8 Anping Li 2018-04-19 11:20:47 UTC
@Jeff, All buffer file are swallowed, but I couldn't find any file named "dlq", Could you confirm my test step ?

logging-fluentd/images/v3.7.44-4

# gem list |grep elastic
elasticsearch (2.0.2)
elasticsearch-api (2.0.2)
elasticsearch-transport (2.0.2)
fluent-plugin-elasticsearch (1.14.0



1.  Turn off fluentd, put the attached archive log under buffer direct /var/lib/fluentd and clear the .pos files. 
There are 34 buffer files.
 
2. Turn on fluentd.

3. Waiting for a while, check the index and buffer file. 
Index are created and all buffer file are swallowed.

Comment 9 Anping Li 2018-04-20 15:12:25 UTC
the dlq hander have been revert to use router.emit_error_event instead. The fix is not in logging-fluentd/images/v3.7.44-4

Comment 12 Jeff Cantrill 2018-05-18 18:50:03 UTC
Really need fluent-plugin-elasticsearch 1.16.1 along with https://github.com/openshift/origin-aggregated-logging/pull/1180

Comment 14 Qiaoling Tang 2018-08-02 04:41:55 UTC
Verified on openshift3-logging-fluentd-v3.7.61-1

Comment 16 errata-xmlrpc 2018-08-09 22:14:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2337

Comment 17 Jeff Cantrill 2019-05-28 17:40:49 UTC
*** Bug 1714716 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.