Bug 1511116
Summary: | error_class="MessagePack::MalformedFormatError" error="invalid byte" encountered in fluentd pod | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.7.0 | CC: | anli, aos-bugs, jcantril, pweil, rkant, rmeggins, smunilla |
Target Milestone: | --- | ||
Target Release: | 3.7.z | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Corrupt File Chunk Buffers
Consequence: Fluentd is blocked processing message until the buffer is removed
Fix: Introduce handler to remove corrupt messages from the processing work flow
Result: Corrupt messages are sent to a 'dead letter queue' while continuing to process other messages and the pipleline is no longer blocked
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-09 22:14:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter Portante
2017-11-08 17:43:01 UTC
We found the bad on-disk queue entry, but using msgpack-inspect [1] did not show anything bad about the data. Removing that file does allow the rest of the data to be indexed. Debugging this will require understanding why msgpack-inspect does not report the same error condition as fluentd does. The mystery deepens... [1] gem install msgpack-inspect Looks like some sort of corruption. The msgpack-inspect tool was used to dump the data from the problematic buffer file: { "key": { "format": "fixstr", "header": "0xaf", "length": 15, "data": "0x53454c494e55585f434f92ce59f6db", "value": "SELINUX_CO\x92\xCEY\xF6\xDB" }, "value": { "format": "never_used", "header": "0xc1", , "data": "0xc1" } }, The value of the "key" should be SELINUX_CONTEXT, but instead it looks as though the beginning of the buffer was inserted. The value 0xc1 should never be used as a msgpack type, and it appears that this is causing the MessagePack::MalformedFormatError exception. Could be a race condition or some other sort of corruption. msgpack encodes a buffer chunk as an array consisting of a timestamp - either unix time_t 32-bit integer (ce means a 32-bit unsigned int value follows) # od -t x1 buffer-output-es-ops-config.output_ops_tag.b55d93458837dabc1.log | head -2 0000000 92 ce 5a 04 c4 e7 87 a7 73 79 73 74 65 6d 64 82 0000020 a1 74 8e a7 42 4f 4f 54 5f 49 44 d9 20 61 35 31 0x5a04c4e7 == 1510261991 == Thu Nov 9 21:13:11 UTC 2017 - or a 64-bit floating point value with microseconds as the decimal). e.g. # od -t x1 buffer-output-es-ops-config.output_ops_tag.b55d84bc61d0870e5.log | head -2 0000000 92 cb 41 d6 80 f4 42 49 26 ea 86 a8 68 6f 73 74 0000020 6e 61 6d 65 bc 69 70 2d 31 37 32 2d 31 38 2d 34 0x41d680f4424926ea == 1510199561.1430001 == Thu Nov 9 03:52:41.1430001 UTC 2017 https://github.com/msgpack/msgpack-ruby/blob/7d8317c5dbe77a8f8f2ff95612e5a99445b17e0a/ext/msgpack/unpacker.c#L628 https://github.com/msgpack/msgpack-ruby/blob/7d8317c5dbe77a8f8f2ff95612e5a99445b17e0a/doclib/msgpack/unpacker.rb#L81 *** Bug 1567265 has been marked as a duplicate of this bug. *** This will be resolved by Fixed with fluent-plugin-elasticsearch v0.14.0 Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1562004 @Jeff, All buffer file are swallowed, but I couldn't find any file named "dlq", Could you confirm my test step ? logging-fluentd/images/v3.7.44-4 # gem list |grep elastic elasticsearch (2.0.2) elasticsearch-api (2.0.2) elasticsearch-transport (2.0.2) fluent-plugin-elasticsearch (1.14.0 1. Turn off fluentd, put the attached archive log under buffer direct /var/lib/fluentd and clear the .pos files. There are 34 buffer files. 2. Turn on fluentd. 3. Waiting for a while, check the index and buffer file. Index are created and all buffer file are swallowed. the dlq hander have been revert to use router.emit_error_event instead. The fix is not in logging-fluentd/images/v3.7.44-4 Really need fluent-plugin-elasticsearch 1.16.1 along with https://github.com/openshift/origin-aggregated-logging/pull/1180 Verified on openshift3-logging-fluentd-v3.7.61-1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2337 *** Bug 1714716 has been marked as a duplicate of this bug. *** |