Bug 1560170

Summary:	Fluentd unable to send logs to Elasticsearch with socket errors talking to Kube
Product:	OpenShift Container Platform	Reporter:	Peter Portante <pportant>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, ewolinet, jforrest, juzhao, pportant, rmeggins, smunilla, stwalter
Target Milestone:	---
Target Release:	3.9.z
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Plugin only caught KubeException but not more general exceptions Consequence: Consumers are stuck cycling until API server can be contacted. Fix: Catch more general exception and return gracefully Result: Metadata fetch is more relaxed and will gracefully catch exception, returning no meta, and subsequently the record will be orphaned.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-06 15:46:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Portante 2018-03-24 12:05:49 UTC

It looks like fetching pod metadata has hung up the pipeline of fluentd.  The last log entry for the pod is below, which appears to be 8 hours old.

This situation will cause logs to be lost.

2018-03-24 02:36:22 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-ns9jl_logging_fluentd-elasticsearch-65a4462cbb51d220d0b1708fe0ea7a043a71f991a8ab6834d12d2aed8466eb5a.log"
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `initialize'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `open'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `block in connect'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/timeout.rb:52:in `timeout'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:877:in `connect'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:862:in `do_start'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:851:in `start'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:204:in `block in get_entity'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:58:in `handle_exception'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:202:in `get_entity'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:103:in `block (2 levels) in define_entity_methods'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:78:in `fetch_pod_metadata'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/kubernetes_metadata_cache_strategy.rb:42:in `get_pod_metadata'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:267:in `get_metadata_for_record'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:297:in `filter_stream_from_files'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:152:in `block in emit'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:151:in `each'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:151:in `emit'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:90:in `emit_stream'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:332:in `receive_lines'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:450:in `call'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:450:in `wrap_receive_lines'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:661:in `call'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:661:in `on_notify'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:476:in `on_notify'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:577:in `call'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:577:in `on_timer'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:316:in `run'
2018-03-24 02:36:42 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-ns9jl_logging_fluentd-elasticsearch-65a4462cbb51d220d0b1708fe0ea7a043a71f991a8ab6834d12d2aed8466eb5a.log"
  2018-03-24 02:36:42 +0000 [warn]: suppressed same stacktrace

Comment 1 Jessica Forrester 2018-03-28 14:46:24 UTC

what environment was this happening on? I don't think there is anywhere near enough information here to understand what went wrong.

Should a fluent pod stuck in this state be passing its health checks? Is that at least something we could fix?

Comment 2 ewolinet 2018-04-06 21:04:09 UTC

Peter,

What would the expected action for Fluentd be in cases where it is unable to fetch metadata from K8S?

I think we would need to update the logic in the kubernetes_metadata_filter plugin...

Comment 3 Rich Megginson 2018-04-06 22:03:56 UTC

(In reply to ewolinet from comment #2)
> Peter,
> 
> What would the expected action for Fluentd be in cases where it is unable to
> fetch metadata from K8S?
> 
> I think we would need to update the logic in the kubernetes_metadata_filter
> plugin...

I think it should add as much metadata as it can from the context and write the record to the .orphaned index.  I believe there is already logic in the fluent-plugin-kubernetes_metadata_plugin to do this.

Comment 4 Peter Portante 2018-04-09 14:49:25 UTC

And after doing what Rich suggests, we could have a tool that re-reads the .orphan index and fetches k8s for what it can to re-index correctly.

Comment 5 Jeff Cantrill 2018-05-11 20:00:29 UTC

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/127

Comment 6 Jeff Cantrill 2018-05-14 17:57:33 UTC

PR for version we depend on: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/132

Comment 7 Jeff Cantrill 2018-05-14 18:02:47 UTC

Locking gem PR https://github.com/openshift/origin-aggregated-logging/pull/1167

Comment 10 Junqi Zhao 2018-06-01 06:47:36 UTC

@Peter

We don't know how to verify this defect, we did not meet this error before, can you share more details?

Comment 11 Anping Li 2018-06-04 04:42:46 UTC

The container logs are sent to .orphaned* index when master is not reachable.  the document are as following. 


{
      "_index" : ".orphaned.2018.06.04",
      "_type" : "com.redhat.viaq.common",
      "_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3",
      "_score" : 1.0,
      "_source" : {
        "docker" : {
          "container_id" : "f7c343828361330a2321401dfc61b53d7af8180daec9213c9db1d22800a6994d"
        },
        "kubernetes" : {
          "container_name" : "controller-manager",
          "namespace_name" : ".orphaned",
          "pod_name" : "controller-manager-hvhvh",
          "orphaned_namespace" : "kube-service-catalog",
          "namespace_id" : "orphaned"
        },
        "message" : "E0604 02:31:40.365882       1 leaderelection.go:224] error retrieving resource lock kube-service-catalog/service-catalog-controller-manager: Get https://172.30.0.1:443/api/v1/namespaces/kube-service-catalog/configmaps/service-catalog-controller-manager: dial tcp 172.30.0.1:443: getsockopt: connection refused\n",
        "level" : "err",
        "hostname" : "ip-172-18-6-64.ec2.internal",
        "pipeline_metadata" : {
          "collector" : {
            "ipaddr4" : "10.128.0.14",
            "ipaddr6" : "fe80::68d9:5bff:fe23:ebc6",
            "inputname" : "fluent-plugin-systemd",
            "name" : "fluentd",
            "received_at" : "2018-06-04T02:32:14.081351+00:00",
            "version" : "0.12.42 1.6.0"
          }
        },
        "@timestamp" : "2018-06-04T02:31:40.366050+00:00",
        "viaq_msg_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3"
      }
    }

{
      "_index" : ".orphaned.2018.06.04",
      "_type" : "com.redhat.viaq.common",
      "_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5",
      "_score" : 1.0,
      "_source" : {
        "docker" : {
          "container_id" : "9a966c9086f0732df2c7f6d054c4186f43e11bc6a5e0d1092b31a765bcc0db09"
        },
        "kubernetes" : {
          "container_name" : "centos-logtest",
          "namespace_name" : ".orphaned",
          "pod_name" : "centos-logtest-nghwr",
          "orphaned_namespace" : "systlog",
          "namespace_id" : "orphaned"
        },
        "message" : "2018-06-04 03:34:38,142 - SVTLogger - INFO - centos-logtest-nghwr : 5168 : uFq2mniRE Hns9HxlWt kIOiEStXm uhj5klKKN ADELqK3xD byadZV8U0 u2lsJ7KpJ anMUsV8P7 lbIqoZ7kq ogbrJ8kcu k2Kkq3aYr x6t65Uj7E weKVL2Mml CAEQGxpiM cKsgSuEzT CqDLBrVJ5 DaDarYAVC Uj4EhGa6K cJT5XUk4I pWcOYko4q \n",
        "level" : "err",
        "hostname" : "ip-172-18-14-187.ec2.internal",
        "pipeline_metadata" : {
          "collector" : {
            "ipaddr4" : "10.129.0.21",
            "ipaddr6" : "fe80::c029:beff:fe09:48fa",
            "inputname" : "fluent-plugin-systemd",
            "name" : "fluentd",
            "received_at" : "2018-06-04T03:34:38.694839+00:00",
            "version" : "0.12.42 1.6.0"
          }
        },
        "@timestamp" : "2018-06-04T03:34:38.143207+00:00",
        "viaq_msg_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5"
      }
    }

Comment 12 Anping Li 2018-06-04 05:11:20 UTC

The follow message are reported in fluentd logs.
2018-06-03 22:35:44 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:140:rescue in fetch_namespace_metadata: Exception 'Connection refused - connect(2)' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local
2018-06-03 22:35:45 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:98:rescue in fetch_pod_metadata: Exception 'Connection refused - connect(2)' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local

Comment 14 errata-xmlrpc 2018-06-06 15:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1796

Comment 15 Jeff Cantrill 2018-06-14 18:41:42 UTC

*** Bug 1591452 has been marked as a duplicate of this bug. ***