Bug 1593310

Summary: Fluent pipeline stuck because records in request do not equal response
Product: OpenShift Container Platform Reporter: Jeff Cantrill <jcantril>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: anli, aos-bugs, juzhao, rmeggins, smunilla
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: fluent-plugin-elasticsearch improperly handled book keeping of the records being submitted Consequence: Fluent was stuck processing a chuck even though there was a valid request and response. Fix: Properly account for the records submitted to Elasticsearch. Result: The pipeline is no longer stuck
Story Points: ---
Clone Of: 1593297 Environment:
Last Closed: 2018-07-18 09:18:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1593297, 1593312    
Bug Blocks: 1562004    

Description Jeff Cantrill 2018-06-20 14:12:58 UTC
+++ This bug was initially created as a clone of Bug #1593297 +++

The query are still blocked in openshift3/logging-fluentd/images/v3.9.31-2.

Steps:
1. Put the attached buffer file under /var/lib/fluentd
2. oc set env ds/logging-fluentd ENABLE_UTF8_FILTER=true
   oc set env ds/logging-fluentd DEBUG=true VERBOSE=true DEBUG_TRANSPORTER=true TRANSPORTER_LOG_LEVEL=debug
   oc set env ds/logging-fluentd BUFFER_QUEUE_LIMIT=40  FILE_BUFFER_LIMIT=320Mi
3. watch the buffer query size
# ls /var/lib/fluentd/buffer-output-es-config.output_tag.* |wc -l
36

4. watch the log
    
2018-06-19 15:30:32 +0800 [warn]: fluent/root_agent.rb:187:emit_error_event: dump an error event: error_class=ArgumentError error="invalid byte sequence in UTF-8" location="/usr/share/gems/gems/fluent-plugin-elasticsearch-1.16.1/lib/fluent/plugin/out_elasticsearch.rb:383:in `downcase'" tag="output_tag" time=1522079957 record={"systemd"=>{"t"=>{"BOOT_ID"=>"f4becb7fb2ae42b8abb3076654c926cc", "CAP_EFFECTIVE"=>"1fffffffff", "CMDLINE"=>"/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --log-driver json-file --log-opt max-size=50m --signature-verification=False --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true --storage-opt dm.basesize=3G --mtu=8951 --add-registry registry.access.redhat.com --add-registry registry.access.redhat.com", "COMM"=>"dockerd-current", "EXE"=>"/usr/bin/dockerd-current", "GID"=>"0", "MACHINE_ID"=>"d52c597d0f1a42aeb01b5a7d71e63f24", "PID"=>"2961", "SELINUX_CONTEXT"=>"system_u:system_r:init_t:s0", "SYSTEMD_CGROUP"=>"/system.slice/docker.service", "SYSTEMD_SLICE"=>"system.slice", "SYSTEMD_UNIT"=>"docker.service", "TRANSPORT"=>"stdout", "UID"=>"0"}, "u"=>{"SYSLOG_FACILITY"=>"3", "SYSLOG_IDENTIFIER"=>"dockerd-current"}}, "level"=>"info", "message"=>"time=\"2018-03-26T15:59:17.716724325Z\" level=error msg=\"Failed to log msg \\\"  2018-03-26 15:58:49 +0000 [warn]: suppressed same stacktrace\\\" for logger json-file: write /var/lib/docker/containers/a3fca2c0a2f4f44321a7d8165b0cd70c041360ae31960e0959ac8cc7f25dae46/a3fca2c0a2f4f44321a7d8165b0cd70c041360ae31960e0959ac8cc7f25dae46-json.log: no space left on device\"", "hostname"=>"ip-172-31-31-141.us-west-1.compute.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"10.131.60.182", "ipaddr6"=>"fe80::c802:5fff:fe7a:629a", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2018-03-26T16:00:23.009320+00:00", "version"=>"0.12.42 1.6.0"}}, "@timestamp"=>"2018-03-26T15:59:17.000000+00:00", 90=>"\u0018\u05C8\xA7systemd\x82\xA1t\x8E\xA7BOOT_ID\xC4 "}
2018-06-19 15:30:34 +0800 [warn]: fluent/output.rb:381:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2018-06-19 15:35:31 +0800 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="The number of records submitted 6837 do not match the number returned 6835. Unable to process bulk response." plugin_id="object:3fdebe407fa4"
  2018-06-19 15:30:34 +0800 [warn]: suppressed same stacktrace
2018-06-19 15:30:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:31:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:31:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:32:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:32:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:33:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:33:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:34:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:34:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:35:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:35:32 +0800 [warn]: fluent/root_agent.rb:187:emit_error_event: dump an error event: error_class=ArgumentError error="invalid byte sequence in UTF-8" location="/usr/share/gems/gems/fluent-plugin-elasticsearch-1.16.1/lib/fluent/plugin/out_elasticsearch.rb:383:in `downcase'" tag="output_tag" time=1522079957 record={"systemd"=>{"t"=>{"BOOT_ID"=>"f4becb7fb2ae42b8abb3076654c926cc", "CAP_EFFECTIVE"=>"1fffffffff", "CMDLINE"=>"/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --log-driver json-file --log-opt max-size=50m --signature-verification=False --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker_vg-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true --storage-opt dm.basesize=3G --mtu=8951 --add-registry registry.access.redhat.com --add-registry registry.access.redhat.com", "COMM"=>"dockerd-current", "EXE"=>"/usr/bin/dockerd-current", "GID"=>"0", "MACHINE_ID"=>"d52c597d0f1a42aeb01b5a7d71e63f24", "PID"=>"2961", "SELINUX_CONTEXT"=>"system_u:system_r:init_t:s0", "SYSTEMD_CGROUP"=>"/system.slice/docker.service", "SYSTEMD_SLICE"=>"system.slice", "SYSTEMD_UNIT"=>"docker.service", "TRANSPORT"=>"stdout", "UID"=>"0"}, "u"=>{"SYSLOG_FACILITY"=>"3", "SYSLOG_IDENTIFIER"=>"dockerd-current"}}, "level"=>"info", "message"=>"time=\"2018-03-26T15:59:17.716724325Z\" level=error msg=\"Failed to log msg \\\"  2018-03-26 15:58:49 +0000 [warn]: suppressed same stacktrace\\\" for logger json-file: write /var/lib/docker/containers/a3fca2c0a2f4f44321a7d8165b0cd70c041360ae31960e0959ac8cc7f25dae46/a3fca2c0a2f4f44321a7d8165b0cd70c041360ae31960e0959ac8cc7f25dae46-json.log: no space left on device\"", "hostname"=>"ip-172-31-31-141.us-west-1.compute.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"10.131.60.182", "ipaddr6"=>"fe80::c802:5fff:fe7a:629a", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2018-03-26T16:00:23.009320+00:00", "version"=>"0.12.42 1.6.0"}}, "@timestamp"=>"2018-03-26T15:59:17.000000+00:00", 90=>"\u0018\u05C8\xA7systemd\x82\xA1t\x8E\xA7BOOT_ID\xC4 "}
2018-06-19 15:35:34 +0800 [warn]: fluent/output.rb:381:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2018-06-19 15:40:31 +0800 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="The number of records submitted 6837 do not match the number returned 6835. Unable to process bulk response." plugin_id="object:3fdebe407fa4"
  2018-06-19 15:35:34 +0800 [warn]: suppressed same stacktrace
2018-06-19 15:35:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:36:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:36:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:37:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:37:43 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-19 15:38:13 +0800 [info]: plugin/filter_kubernetes_metadata.rb:109:dump_stats: stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3

--- Additional comment from Jeff Cantrill on 2018-06-20 10:12:15 EDT ---

requires an official image build with rubygem-fluent-plugin-elasticsearch-1.17.0-1.el7

Comment 2 Anping Li 2018-06-28 09:58:56 UTC
Waiting new packages.

Comment 4 Anping Li 2018-07-11 09:58:49 UTC
The fix was not in openshift3/logging-fluentd/images/v3.9.33-2. 

2018-07-11 05:56:15 -0400 [warn]: fluent/output.rb:381:rescue in try_flush: temporarily failed to flush the buffer. next_retry=2018-07-11 05:57:14 -0400 error_class="Fluent::ElasticsearchErrorHandler::ElasticsearchError" error="The number of records submitted 6837 do not match the number returned 6835. Unable to process bulk response." plugin_id="object:3fdb3c8de274"

Comment 5 Jeff Cantrill 2018-07-11 15:01:28 UTC
Moving this back to modified in anticipation of a corrected image

Comment 7 Junqi Zhao 2018-07-13 06:44:29 UTC
used buffer file from https://bugzilla.redhat.com/attachment.cgi?id=1417859
and followed steps in Comment 0, all buffer record are read without the no equal response error in fluentd pods logs. Moved to verified

logging images version: v3.9.33-3

Comment 9 errata-xmlrpc 2018-07-18 09:18:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2213