Bug 1560170
Summary: | Fluentd unable to send logs to Elasticsearch with socket errors talking to Kube | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.9.0 | CC: | aos-bugs, ewolinet, jforrest, juzhao, pportant, rmeggins, smunilla, stwalter |
Target Milestone: | --- | ||
Target Release: | 3.9.z | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Plugin only caught KubeException but not more general exceptions
Consequence: Consumers are stuck cycling until API server can be contacted.
Fix: Catch more general exception and return gracefully
Result: Metadata fetch is more relaxed and will gracefully catch exception, returning no meta, and subsequently the record will be orphaned.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-06 15:46:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter Portante
2018-03-24 12:05:49 UTC
what environment was this happening on? I don't think there is anywhere near enough information here to understand what went wrong. Should a fluent pod stuck in this state be passing its health checks? Is that at least something we could fix? Peter, What would the expected action for Fluentd be in cases where it is unable to fetch metadata from K8S? I think we would need to update the logic in the kubernetes_metadata_filter plugin... (In reply to ewolinet from comment #2) > Peter, > > What would the expected action for Fluentd be in cases where it is unable to > fetch metadata from K8S? > > I think we would need to update the logic in the kubernetes_metadata_filter > plugin... I think it should add as much metadata as it can from the context and write the record to the .orphaned index. I believe there is already logic in the fluent-plugin-kubernetes_metadata_plugin to do this. And after doing what Rich suggests, we could have a tool that re-reads the .orphan index and fetches k8s for what it can to re-index correctly. PR for version we depend on: https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/pull/132 @Peter We don't know how to verify this defect, we did not meet this error before, can you share more details? The container logs are sent to .orphaned* index when master is not reachable. the document are as following. { "_index" : ".orphaned.2018.06.04", "_type" : "com.redhat.viaq.common", "_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3", "_score" : 1.0, "_source" : { "docker" : { "container_id" : "f7c343828361330a2321401dfc61b53d7af8180daec9213c9db1d22800a6994d" }, "kubernetes" : { "container_name" : "controller-manager", "namespace_name" : ".orphaned", "pod_name" : "controller-manager-hvhvh", "orphaned_namespace" : "kube-service-catalog", "namespace_id" : "orphaned" }, "message" : "E0604 02:31:40.365882 1 leaderelection.go:224] error retrieving resource lock kube-service-catalog/service-catalog-controller-manager: Get https://172.30.0.1:443/api/v1/namespaces/kube-service-catalog/configmaps/service-catalog-controller-manager: dial tcp 172.30.0.1:443: getsockopt: connection refused\n", "level" : "err", "hostname" : "ip-172-18-6-64.ec2.internal", "pipeline_metadata" : { "collector" : { "ipaddr4" : "10.128.0.14", "ipaddr6" : "fe80::68d9:5bff:fe23:ebc6", "inputname" : "fluent-plugin-systemd", "name" : "fluentd", "received_at" : "2018-06-04T02:32:14.081351+00:00", "version" : "0.12.42 1.6.0" } }, "@timestamp" : "2018-06-04T02:31:40.366050+00:00", "viaq_msg_id" : "N2UyN2IxYTEtZWY0NC00ZmJkLTk1YWUtNGI0ZGUxZjRmNDE3" } } { "_index" : ".orphaned.2018.06.04", "_type" : "com.redhat.viaq.common", "_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5", "_score" : 1.0, "_source" : { "docker" : { "container_id" : "9a966c9086f0732df2c7f6d054c4186f43e11bc6a5e0d1092b31a765bcc0db09" }, "kubernetes" : { "container_name" : "centos-logtest", "namespace_name" : ".orphaned", "pod_name" : "centos-logtest-nghwr", "orphaned_namespace" : "systlog", "namespace_id" : "orphaned" }, "message" : "2018-06-04 03:34:38,142 - SVTLogger - INFO - centos-logtest-nghwr : 5168 : uFq2mniRE Hns9HxlWt kIOiEStXm uhj5klKKN ADELqK3xD byadZV8U0 u2lsJ7KpJ anMUsV8P7 lbIqoZ7kq ogbrJ8kcu k2Kkq3aYr x6t65Uj7E weKVL2Mml CAEQGxpiM cKsgSuEzT CqDLBrVJ5 DaDarYAVC Uj4EhGa6K cJT5XUk4I pWcOYko4q \n", "level" : "err", "hostname" : "ip-172-18-14-187.ec2.internal", "pipeline_metadata" : { "collector" : { "ipaddr4" : "10.129.0.21", "ipaddr6" : "fe80::c029:beff:fe09:48fa", "inputname" : "fluent-plugin-systemd", "name" : "fluentd", "received_at" : "2018-06-04T03:34:38.694839+00:00", "version" : "0.12.42 1.6.0" } }, "@timestamp" : "2018-06-04T03:34:38.143207+00:00", "viaq_msg_id" : "YWI5ZjNkZDItMmFjYy00NzIyLTg0ZDQtY2NmY2JkMWRhNzU5" } } The follow message are reported in fluentd logs. 2018-06-03 22:35:44 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:140:rescue in fetch_namespace_metadata: Exception 'Connection refused - connect(2)' encountered fetching namespace metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local 2018-06-03 22:35:45 -0400 [debug]: plugin/filter_kubernetes_metadata.rb:98:rescue in fetch_pod_metadata: Exception 'Connection refused - connect(2)' encountered fetching pod metadata from Kubernetes API v1 endpoint https://kubernetes.default.svc.cluster.local Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1796 *** Bug 1591452 has been marked as a duplicate of this bug. *** |