Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1560170 - Fluentd unable to send logs to Elasticsearch with socket errors talking to Kube
Fluentd unable to send logs to Elasticsearch with socket errors talking to Kube
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.9.0
All Linux
high Severity high
: ---
: 3.9.z
Assigned To: Jeff Cantrill
Anping Li
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-24 08:05 EDT by Peter Portante
Modified: 2018-06-14 14:41 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Plugin only caught KubeException but not more general exceptions Consequence: Consumers are stuck cycling until API server can be contacted. Fix: Catch more general exception and return gracefully Result: Metadata fetch is more relaxed and will gracefully catch exception, returning no meta, and subsequently the record will be orphaned.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-06-06 11:46:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/127 None None None 2018-05-11 16:00 EDT
Red Hat Product Errata RHBA-2018:1796 None None None 2018-06-06 11:47 EDT

  None (edit)
Description Peter Portante 2018-03-24 08:05:49 EDT
It looks like fetching pod metadata has hung up the pipeline of fluentd.  The last log entry for the pod is below, which appears to be 8 hours old.

This situation will cause logs to be lost.

2018-03-24 02:36:22 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-ns9jl_logging_fluentd-elasticsearch-65a4462cbb51d220d0b1708fe0ea7a043a71f991a8ab6834d12d2aed8466eb5a.log"
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `initialize'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `open'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:878:in `block in connect'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/timeout.rb:52:in `timeout'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:877:in `connect'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:862:in `do_start'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/ruby/net/http.rb:851:in `start'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/rest-client-2.0.2/lib/restclient/resource.rb:51:in `get'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:204:in `block in get_entity'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:58:in `handle_exception'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:202:in `get_entity'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/kubeclient-1.1.4/lib/kubeclient/common.rb:103:in `block (2 levels) in define_entity_methods'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:78:in `fetch_pod_metadata'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/kubernetes_metadata_cache_strategy.rb:42:in `get_pod_metadata'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:267:in `get_metadata_for_record'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluent-plugin-kubernetes_metadata_filter-1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb:297:in `filter_stream_from_files'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:152:in `block in emit'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:151:in `each'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:151:in `emit'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/event_router.rb:90:in `emit_stream'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:332:in `receive_lines'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:450:in `call'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:450:in `wrap_receive_lines'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:661:in `call'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:661:in `on_notify'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:476:in `on_notify'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:577:in `call'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:577:in `on_timer'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2018-03-24 02:36:22 +0000 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/in_tail.rb:316:in `run'
2018-03-24 02:36:42 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-ns9jl_logging_fluentd-elasticsearch-65a4462cbb51d220d0b1708fe0ea7a043a71f991a8ab6834d12d2aed8466eb5a.log"
  2018-03-24 02:36:42 +0000 [warn]: suppressed same stacktrace
Comment 1 Jessica Forrester 2018-03-28 10:46:24 EDT
what environment was this happening on? I don't think there is anywhere near enough information here to understand what went wrong.

Should a fluent pod stuck in this state be passing its health checks? Is that at least something we could fix?
Comment 2 ewolinet 2018-04-06 17:04:09 EDT
Peter,

What would the expected action for Fluentd be in cases where it is unable to fetch metadata from K8S?

I think we would need to update the logic in the kubernetes_metadata_filter plugin...
Comment 3 Rich Megginson 2018-04-06 18:03:56 EDT
(In reply to ewolinet from comment #2)
> Peter,
> 
> What would the expected action for Fluentd be in cases where it is unable to
> fetch metadata from K8S?
> 
> I think we would need to update the logic in the kubernetes_metadata_filter
> plugin...

I think it should add as much metadata as it can from the context and write the record to the .orphaned index.  I believe there is already logic in the fluent-plugin-kubernetes_metadata_plugin to do this.
Comment 4 Peter Portante 2018-04-09 10:49:25 EDT
And after doing what Rich suggests, we could have a tool that re-reads the .orphan index and fetches k8s for what it can to re-index correctly.
Comment 6 Jeff Cantrill 2018-05-14 13:57:33 EDT
PR for version we depend on: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/132
Comment 7 Jeff Cantrill 2018-05-14 14:02:47 EDT
Locking gem PR https://github.com/openshift/origin-aggregated-logging/pull/1167
Comment 10 Junqi Zhao 2018-06-01 02:47:36 EDT
@Peter

We don't know how to verify this defect, we did not meet this error before, can you share more details?
Comment 11 Anping Li 2018-06-04 00:42:46 EDT
The container logs are sent to .orphaned* index when master is not reachable.  the document are as following. 


{
      "_index" : ".orphaned.2018.06.04",
      "_type" : "com.redhat.viaq.common",
      "_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3",
      "_score" : 1.0,
      "_source" : {
        "docker" : {
          "container_id" : "f7c343828361330a2321401dfc61b53d7af8180daec9213c9db1d22800a6994d"
        },
        "kubernetes" : {
          "container_name" : "controller-manager",
          "namespace_name" : ".orphaned",
          "pod_name" : "controller-manager-hvhvh",
          "orphaned_namespace" : "kube-service-catalog",
          "namespace_id" : "orphaned"
        },
        "message" : "E0604 02:31:40.365882       1 leaderelection.go:224] error retrieving resource lock kube-service-catalog/service-catalog-controller-manager: Get https://172.30.0.1:443/api/v1/namespaces/kube-service-catalog/configmaps/service-catalog-controller-manager: dial tcp 172.30.0.1:443: getsockopt: connection refused\n",
        "level" : "err",
        "hostname" : "ip-172-18-6-64.ec2.internal",
        "pipeline_metadata" : {
          "collector" : {
            "ipaddr4" : "10.128.0.14",
            "ipaddr6" : "fe80::68d9:5bff:fe23:ebc6",
            "inputname" : "fluent-plugin-systemd",
            "name" : "fluentd",
            "received_at" : "2018-06-04T02:32:14.081351+00:00",
            "version" : "0.12.42 1.6.0"
          }
        },
        "@timestamp" : "2018-06-04T02:31:40.366050+00:00",
        "viaq_msg_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3"
      }
    }

{
      "_index" : ".orphaned.2018.06.04",
      "_type" : "com.redhat.viaq.common",
      "_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5",
      "_score" : 1.0,
      "_source" : {
        "docker" : {
          "container_id" : "9a966c9086f0732df2c7f6d054c4186f43e11bc6a5e0d1092b31a765bcc0db09"
        },
        "kubernetes" : {
          "container_name" : "centos-logtest",
          "namespace_name" : ".orphaned",
          "pod_name" : "centos-logtest-nghwr",
          "orphaned_namespace" : "systlog",
          "namespace_id" : "orphaned"
        },
        "message" : "2018-06-04 03:34:38,142 - SVTLogger - INFO - centos-logtest-nghwr : 5168 : uFq2mniRE Hns9HxlWt kIOiEStXm uhj5klKKN ADELqK3xD byadZV8U0 u2lsJ7KpJ anMUsV8P7 lbIqoZ7kq ogbrJ8kcu k2Kkq3aYr x6t65Uj7E weKVL2Mml CAEQGxpiM cKsgSuEzT CqDLBrVJ5 DaDarYAVC Uj4EhGa6K cJT5XUk4I pWcOYko4q \n",
        "level" : "err",
        "hostname" : "ip-172-18-14-187.ec2.internal",
        "pipeline_metadata" : {
          "collector" : {
            "ipaddr4" : "10.129.0.21",
            "ipaddr6" : "fe80::c029:beff:fe09:48fa",
            "inputname" : "fluent-plugin-systemd",
            "name" : "fluentd",
            "received_at" : "2018-06-04T03:34:38.694839+00:00",
            "version" : "0.12.42 1.6.0"
          }
        },
        "@timestamp" : "2018-06-04T03:34:38.143207+00:00",
        "viaq_msg_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5"
      }
    }
Comment 12 Anping Li 2018-06-04 01:11:20 EDT
The follow message are reported in fluentd logs.
2018-06-03 22:35:44 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:140:rescue in fetch_namespace_metadata: Exception 'Connection refused - connect(2)' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local
2018-06-03 22:35:45 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:98:rescue in fetch_pod_metadata: Exception 'Connection refused - connect(2)' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local
Comment 14 errata-xmlrpc 2018-06-06 11:46:20 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1796
Comment 15 Jeff Cantrill 2018-06-14 14:41:42 EDT
*** Bug 1591452 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.