It looks like fetching pod metadata has hung up the pipeline of fluentd. The last log entry for the pod is below, which appears to be 8 hours old. This situation will cause logs to be lost. 2018-03-24 02:36:22 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-ns9jl_logging_fluentd-elasticsearch-65a4462cbb51d220d0b1708fe0ea7a043a71f991a8ab6834d12d2aed8466eb5a.log" 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `initialize' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `open' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `block in connect' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/timeout.rb:52:in `timeout' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:877:in `connect' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:862:in `do_start' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:851:in `start' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:204:in `block in get_entity' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:58:in `handle_exception' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:202:in `get_entity' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:103:in `block (2 levels) in define_entity_methods' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:78:in `fetch_pod_metadata' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/kubernetes_metadata_cache_strategy.rb:42:in `get_pod_metadata' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:267:in `get_metadata_for_record' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:297:in `filter_stream_from_files' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:152:in `block in emit' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:151:in `each' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:151:in `emit' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:90:in `emit_stream' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:332:in `receive_lines' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:450:in `call' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:450:in `wrap_receive_lines' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:661:in `call' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:661:in `on_notify' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:476:in `on_notify' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:577:in `call' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:577:in `on_timer' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run' 2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:316:in `run' 2018-03-24 02:36:42 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-ns9jl_logging_fluentd-elasticsearch-65a4462cbb51d220d0b1708fe0ea7a043a71f991a8ab6834d12d2aed8466eb5a.log" 2018-03-24 02:36:42 +0000 [warn]: suppressed same stacktrace
what environment was this happening on? I don't think there is anywhere near enough information here to understand what went wrong. Should a fluent pod stuck in this state be passing its health checks? Is that at least something we could fix?
Peter, What would the expected action for Fluentd be in cases where it is unable to fetch metadata from K8S? I think we would need to update the logic in the kubernetes_metadata_filter plugin...
(In reply to ewolinet from comment #2) > Peter, > > What would the expected action for Fluentd be in cases where it is unable to > fetch metadata from K8S? > > I think we would need to update the logic in the kubernetes_metadata_filter > plugin... I think it should add as much metadata as it can from the context and write the record to the .orphaned index. I believe there is already logic in the fluent-plugin-kubernetes_metadata_plugin to do this.
And after doing what Rich suggests, we could have a tool that re-reads the .orphan index and fetches k8s for what it can to re-index correctly.
https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/127
PR for version we depend on: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/132
Locking gem PR https://github.com/openshift/origin-aggregated-logging/pull/1167
@Peter We don't know how to verify this defect, we did not meet this error before, can you share more details?
The container logs are sent to .orphaned* index when master is not reachable. the document are as following. { "_index" : ".orphaned.2018.06.04", "_type" : "com.redhat.viaq.common", "_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3", "_score" : 1.0, "_source" : { "docker" : { "container_id" : "f7c343828361330a2321401dfc61b53d7af8180daec9213c9db1d22800a6994d" }, "kubernetes" : { "container_name" : "controller-manager", "namespace_name" : ".orphaned", "pod_name" : "controller-manager-hvhvh", "orphaned_namespace" : "kube-service-catalog", "namespace_id" : "orphaned" }, "message" : "E0604 02:31:40.365882 1 leaderelection.go:224] error retrieving resource lock kube-service-catalog/service-catalog-controller-manager: Get https://172.30.0.1:443/api/v1/namespaces/kube-service-catalog/configmaps/service-catalog-controller-manager: dial tcp 172.30.0.1:443: getsockopt: connection refused\n", "level" : "err", "hostname" : "ip-172-18-6-64.ec2.internal", "pipeline_metadata" : { "collector" : { "ipaddr4" : "10.128.0.14", "ipaddr6" : "fe80::68d9:5bff:fe23:ebc6", "inputname" : "fluent-plugin-systemd", "name" : "fluentd", "received_at" : "2018-06-04T02:32:14.081351+00:00", "version" : "0.12.42 1.6.0" } }, "@timestamp" : "2018-06-04T02:31:40.366050+00:00", "viaq_msg_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3" } } { "_index" : ".orphaned.2018.06.04", "_type" : "com.redhat.viaq.common", "_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5", "_score" : 1.0, "_source" : { "docker" : { "container_id" : "9a966c9086f0732df2c7f6d054c4186f43e11bc6a5e0d1092b31a765bcc0db09" }, "kubernetes" : { "container_name" : "centos-logtest", "namespace_name" : ".orphaned", "pod_name" : "centos-logtest-nghwr", "orphaned_namespace" : "systlog", "namespace_id" : "orphaned" }, "message" : "2018-06-04 03:34:38,142 - SVTLogger - INFO - centos-logtest-nghwr : 5168 : uFq2mniRE Hns9HxlWt kIOiEStXm uhj5klKKN ADELqK3xD byadZV8U0 u2lsJ7KpJ anMUsV8P7 lbIqoZ7kq ogbrJ8kcu k2Kkq3aYr x6t65Uj7E weKVL2Mml CAEQGxpiM cKsgSuEzT CqDLBrVJ5 DaDarYAVC Uj4EhGa6K cJT5XUk4I pWcOYko4q \n", "level" : "err", "hostname" : "ip-172-18-14-187.ec2.internal", "pipeline_metadata" : { "collector" : { "ipaddr4" : "10.129.0.21", "ipaddr6" : "fe80::c029:beff:fe09:48fa", "inputname" : "fluent-plugin-systemd", "name" : "fluentd", "received_at" : "2018-06-04T03:34:38.694839+00:00", "version" : "0.12.42 1.6.0" } }, "@timestamp" : "2018-06-04T03:34:38.143207+00:00", "viaq_msg_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5" } }
The follow message are reported in fluentd logs. 2018-06-03 22:35:44 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:140:rescue in fetch_namespace_metadata: Exception 'Connection refused - connect(2)' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local 2018-06-03 22:35:45 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:98:rescue in fetch_pod_metadata: Exception 'Connection refused - connect(2)' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1796
*** Bug 1591452 has been marked as a duplicate of this bug. ***