https://github.com/openshift/origin-aggregated-logging/pull/991
@anli - have you tried 3.7.44 yet? I am using my usual inventory to install (see below) and logging-fluentd is stuck in a crash loop. Log looks like this: 2018-04-19 18:57:31 +0000 [info]: reading config file path="/etc/fluent/fluent.conf" 2018-04-19 18:57:32 +0000 [error]: unexpected error error="Connection refused - connect(2)" 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/net/http.rb:878:in `initialize' 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/net/http.rb:878:in `open' 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/net/http.rb:878:in `block in connect' 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/timeout.rb:52:in `timeout' 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/net/http.rb:877:in `connect' 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/net/http.rb:862:in `do_start' 2018-04-19 18:57:32 +0000 [error]: /usr/share/ruby/net/http.rb:851:in `start' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:328:in `block in api' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:58:in `handle_exception' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:327:in `api' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:322:in `api_valid?' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:227:in `configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:145:in `add_filter' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:62:in `block in configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:57:in `each' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/agent.rb:57:in `configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `block in configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `each' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/root_agent.rb:83:in `configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/engine.rb:129:in `configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/engine.rb:103:in `run_configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:489:in `run_configure' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:174:in `block in start' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:366:in `call' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:366:in `main_process' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/supervisor.rb:170:in `start' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/command/fluentd.rb:173:in `<top (required)>' 2018-04-19 18:57:32 +0000 [error]: /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require' 2018-04-19 18:57:32 +0000 [error]: /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:55:in `require' 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluentd-0.12.42/bin/fluentd:8:in `<top (required)>' 2018-04-19 18:57:32 +0000 [error]: /usr/bin/fluentd:23:in `load' 2018-04-19 18:57:32 +0000 [error]: /usr/bin/fluentd:23:in `<main>' /etc/fluent/fluent.conf inside the pod looks OK and I am able to connect to elasticsearch ok. All 3 ES pods show Ready: NAME READY STATUS RESTARTS AGE logging-curator-1-f6jkv 1/1 Running 0 16m logging-es-data-master-6x48bl74-1-wxx59 2/2 Running 0 15m logging-es-data-master-d38zgrey-1-njcdl 2/2 Running 0 15m logging-es-data-master-rpldk38n-1-swr6t 2/2 Running 0 15m logging-fluentd-c7p7g 0/1 CrashLoopBackOff 7 15m logging-fluentd-n2cwg 0/1 CrashLoopBackOff 7 16m logging-fluentd-n46f2 0/1 CrashLoopBackOff 7 15m logging-fluentd-t7sfq 0/1 CrashLoopBackOff 7 15m logging-fluentd-tc496 0/1 CrashLoopBackOff 7 16m logging-fluentd-wgl4v 0/1 CrashLoopBackOff 7 15m logging-kibana-1-4mb4b 2/2 Running 0 17m Inventory: [OSEv3:children] masters etcd [masters] ip-172-31-29-37.us-west-2.compute.internal [etcd] ip-172-31-29-37.us-west-2.compute.internal [OSEv3:vars] deployment_type=openshift-enterprise openshift_deployment_type=openshift-enterprise openshift_release=v3.7 openshift_docker_additional_registries=registry.reg-aws.openshift.com openshift_logging_install_logging=true openshift_logging_master_url=https://ec2-34-215-227-156.us-west-2.compute.amazonaws.com:8443 openshift_logging_master_public_url=https://ec2-34-215-227-156.us-west-2.compute.amazonaws.com:8443 openshift_logging_kibana_hostname=kibana.18.236.141.70.xip.io openshift_logging_namespace=logging openshift_logging_image_prefix=registry.reg-aws.openshift.com:443/openshift3/ openshift_logging_image_version=v3.7 openshift_logging_es_cluster_size=3 openshift_logging_es_pvc_dynamic=true openshift_logging_es_pvc_size=25Gi openshift_logging_es_pvc_storage_class_name=gp2 openshift_logging_fluentd_read_from_head=false openshift_logging_use_mux=false openshift_logging_curator_nodeselector={"region": "infra"} openshift_logging_kibana_nodeselector={"region": "infra"} openshift_logging_es_nodeselector={"region": "infra"}
I'm going to hand QA back temporarily for verification that 3.7.44 is working ok for the logging QE team. If you can recreate the issue in comment 4, please re-open the bz. Thanks.
Since the error is coming from 2018-04-19 18:57:32 +0000 [error]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:227:in `configure' That means fluentd is unable to connect to the kubernetes api server. I'm assuming it is indeed running. Is this the "too many connections to k8s" problem?
Sorry, user error. I hadn't realized this cluster was installed on 443 instead of 8443. False alarm - thanks for the pointer, @richm. Taking QA back for verification of the original issue.
Verified on 3.7.44. Sent 6 million messages across 360 namespaces at an aggregate rate of 600 message/second. This triggered 10,985 bulk request rejections as seen in the ES bulk index API and many of the messages below in the logging-fluentd logs. All 6 million messages were received with no duplicates and no message loss. 2018-04-20 14:44:18 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-20 14:44:19 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:12d2f34" 2018-04-20 14:44:18 +0000 [warn]: suppressed same stacktrace 2018-04-20 14:44:20 +0000 [warn]: retry succeeded. plugin_id="object:12d2f34" 2018-04-20 14:44:35 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-20 14:44:35 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:12d2f34" 2018-04-20 14:44:35 +0000 [warn]: suppressed same stacktrace 2018-04-20 14:44:36 +0000 [warn]: retry succeeded. plugin_id="object:12d2f34" 2018-04-20 14:44:44 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-20 14:44:44 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:12d2f34" 2018-04-20 14:44:44 +0000 [warn]: suppressed same stacktrace 2018-04-20 14:44:45 +0000 [warn]: retry succeeded. plugin_id="object:12d2f34" 2018-04-20 14:45:25 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-20 14:45:26 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:12d2f34" 2018-04-20 14:45:25 +0000 [warn]: suppressed same stacktrace 2018-04-20 14:45:26 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-20 14:45:28 +0000 error_class="Fluent::ElasticsearchErrorHandler::BulkIndexQueueFull" error="Bulk index queue is full, retrying" plugin_id="object:12d2f34" 2018-04-20 14:45:26 +0000 [warn]: suppressed same stacktrace 2018-04-20 14:45:28 +0000 [warn]: retry succeeded. plugin_id="object:12d2f34"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1231