Bug 1768762

Summary: Fluentd: "Could not communicate to Elasticsearch" when http proxy enabled in the cluster.
Product: OpenShift Container Platform Reporter: Qiaoling Tang <qitang>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, bparees, dkulkarn, jcantril, nhosoi, rmeggins
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1777477 (view as bug list) Environment:
Last Closed: 2020-05-04 11:14:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1777477    
Attachments:
Description Flags
The logs and resource out none

Comment 2 Qiaoling Tang 2019-11-06 00:36:14 UTC
The test cluster was enabled HTTP_PROXY, and there was no trustedCA in the proxy. 

About the trustedCA, seems this field would always be in the proxy, I checked a cluster without any proxy enabled, there still have a trustedCA field.

$ oc get proxy cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2019-11-05T23:23:02Z"
  generation: 1
  name: cluster
  resourceVersion: "433"
  selfLink: /apis/config.openshift.io/v1/proxies/cluster
  uid: d47d9096-65c2-409c-ba94-989a034d6caf
spec:
  trustedCA:
    name: ""
status: {}

Comment 8 Anping Li 2019-11-25 12:53:43 UTC
The fluentd could connected to the elasticsearch in the expected times.  I didn't see the HTTP_PROXY Env are enabled in fluentd pod, and I didn't see confirm map fluentd-trusted-ca-bundle was mounted into the pod.

[anli@preserve-docker-slave anli43gcp]$ oc exec fluentd-p4jlb -- logs
2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 14. Retry to communicate after 2 second(s).
2019-11-25 12:40:26 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:26 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 13. Retry to communicate after 4 second(s).
2019-11-25 12:40:35 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:35 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 12. Retry to communicate after 8 second(s).
2019-11-25 12:40:52 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:52 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 11. Retry to communicate after 16 second(s).
2019-11-25 12:42:13 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-cpc-host-port-943a07f12db859b475b8cab9138dd89ee45fa3c7471e1d565b0a09c71eab9036.log:default" location=nil tag="kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-cpc-host-port-943a07f12db859b475b8cab9138dd89ee45fa3c7471e1d565b0a09c71eab9036.log" time=2019-11-25 12:42:13.066705727 +0000 record={"time"=>"2019-11-25T12:16:22.288733868+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10357 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-apiserver-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-apiserver_setup-061062f41b239d2501c350c533ef9b88302d061d70f999092c54a9cceb520d55.log:default" location=nil tag="kubernetes.var.log.containers.kube-apiserver-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-apiserver_setup-061062f41b239d2501c350c533ef9b88302d061d70f999092c54a9cceb520d55.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:29:16.346910916+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Fixing audit permissions.\\'\\'Waiting for port :6443 and :6080 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.openshift-kube-scheduler-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-scheduler_wait-for-host-port-bf98d93bd2c648a0d4e4a2fc6ff3711ea7d62c1cb06e666ce9ffd998adb860f7.log:default" location=nil tag="kubernetes.var.log.containers.openshift-kube-scheduler-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-scheduler_wait-for-host-port-bf98d93bd2c648a0d4e4a2fc6ff3711ea7d62c1cb06e666ce9ffd998adb860f7.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:15:23.698894271+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10259 and :10251 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-host-port-c411e5e2c2fa1ded3e330a22829d69cb5ccbf0629c92b72ce95bb9255b7d58cd.log:default" location=nil tag="kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-host-port-c411e5e2c2fa1ded3e330a22829d69cb5ccbf0629c92b72ce95bb9255b7d58cd.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:15:34.254766719+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10257 to be released."}

Comment 9 Anping Li 2019-11-25 12:55:48 UTC
Created attachment 1639470 [details]
The logs and resource out

Comment 10 Anping Li 2019-11-25 13:39:13 UTC
When shall we configure HTTP_PROXY and mount the configmap fluentd-trusted-ca-bundle?

Comment 12 Ben Parees 2019-11-25 20:11:26 UTC
> 2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)

this connection should not be going through the proxy.  there should be a NO_PROXY env var set on the fluentd pods that includes the 172.30.0.0/16 network.  

1) Need to confirm this is getting set on the fluentd pods correctly (the CL operator gets the value from the proxy config resource, so you can also confirm the proxy config resource on the cluster has the correct value in status.noProxy)

2) Need to confirm the fluentd transport libraries support CIDR format noproxy values.  If they do not, we are going to need to find another way to ensure the requests from fluentd to elastic are not proxied.


The fact that you didn't see an HTTP_PROXY var or configmap mounted is definitely a problem though.  Can you share the proxy config resource from your cluster?  (oc get proxy -o yaml)

Comment 13 Anping Li 2019-11-26 01:50:49 UTC
The proxy is as following. By the ways, The proxy Envs have been configured in the CLO.

[anli@preserve-docker-slave anli43gcp]$ oc get proxy -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Proxy
  metadata:
    creationTimestamp: "2019-11-25T11:57:59Z"
    generation: 1
    name: cluster
    resourceVersion: "431"
    selfLink: /apis/config.openshift.io/v1/proxies/cluster
    uid: a5f9e2c6-d55f-4deb-96a4-6995fe1cb6a3
  spec:
    httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    trustedCA:
      name: ""
  status:
    httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-0.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-1.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-2.qeanlip2.qe.gcp.devcluster.openshift.com,localhost,metadata,metadata.google.internal,metadata.google.internal.

Comment 18 Qiaoling Tang 2019-12-09 03:03:58 UTC
Verified with ose-cluster-logging-operator-v4.4.0-201912080435

Comment 21 errata-xmlrpc 2020-05-04 11:14:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581