Bug 1768762 - Fluentd: "Could not communicate to Elasticsearch" when http proxy enabled in the cluster.
Summary: Fluentd: "Could not communicate to Elasticsearch" when http proxy enabled in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: ewolinet
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1777477
TreeView+ depends on / blocked
 
Reported: 2019-11-05 08:22 UTC by Qiaoling Tang
Modified: 2023-09-07 20:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1777477 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:14:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The logs and resource out (210.00 KB, application/x-tar)
2019-11-25 12:55 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 255 0 'None' closed Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time. ` when http_proxy enab... 2021-02-18 09:25:16 UTC
Github openshift cluster-logging-operator pull 305 0 'None' closed Bug 1768762: Persisting proxy config and trustbundle CM info 2021-02-18 09:25:16 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:15:11 UTC

Comment 2 Qiaoling Tang 2019-11-06 00:36:14 UTC
The test cluster was enabled HTTP_PROXY, and there was no trustedCA in the proxy. 

About the trustedCA, seems this field would always be in the proxy, I checked a cluster without any proxy enabled, there still have a trustedCA field.

$ oc get proxy cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2019-11-05T23:23:02Z"
  generation: 1
  name: cluster
  resourceVersion: "433"
  selfLink: /apis/config.openshift.io/v1/proxies/cluster
  uid: d47d9096-65c2-409c-ba94-989a034d6caf
spec:
  trustedCA:
    name: ""
status: {}

Comment 8 Anping Li 2019-11-25 12:53:43 UTC
The fluentd could connected to the elasticsearch in the expected times.  I didn't see the HTTP_PROXY Env are enabled in fluentd pod, and I didn't see confirm map fluentd-trusted-ca-bundle was mounted into the pod.

[anli@preserve-docker-slave anli43gcp]$ oc exec fluentd-p4jlb -- logs
2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 14. Retry to communicate after 2 second(s).
2019-11-25 12:40:26 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:26 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 13. Retry to communicate after 4 second(s).
2019-11-25 12:40:35 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:35 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 12. Retry to communicate after 8 second(s).
2019-11-25 12:40:52 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:52 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 11. Retry to communicate after 16 second(s).
2019-11-25 12:42:13 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-cpc-host-port-943a07f12db859b475b8cab9138dd89ee45fa3c7471e1d565b0a09c71eab9036.log:default" location=nil tag="kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-cpc-host-port-943a07f12db859b475b8cab9138dd89ee45fa3c7471e1d565b0a09c71eab9036.log" time=2019-11-25 12:42:13.066705727 +0000 record={"time"=>"2019-11-25T12:16:22.288733868+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10357 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-apiserver-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-apiserver_setup-061062f41b239d2501c350c533ef9b88302d061d70f999092c54a9cceb520d55.log:default" location=nil tag="kubernetes.var.log.containers.kube-apiserver-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-apiserver_setup-061062f41b239d2501c350c533ef9b88302d061d70f999092c54a9cceb520d55.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:29:16.346910916+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Fixing audit permissions.\\'\\'Waiting for port :6443 and :6080 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.openshift-kube-scheduler-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-scheduler_wait-for-host-port-bf98d93bd2c648a0d4e4a2fc6ff3711ea7d62c1cb06e666ce9ffd998adb860f7.log:default" location=nil tag="kubernetes.var.log.containers.openshift-kube-scheduler-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-scheduler_wait-for-host-port-bf98d93bd2c648a0d4e4a2fc6ff3711ea7d62c1cb06e666ce9ffd998adb860f7.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:15:23.698894271+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10259 and :10251 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-host-port-c411e5e2c2fa1ded3e330a22829d69cb5ccbf0629c92b72ce95bb9255b7d58cd.log:default" location=nil tag="kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-host-port-c411e5e2c2fa1ded3e330a22829d69cb5ccbf0629c92b72ce95bb9255b7d58cd.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:15:34.254766719+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10257 to be released."}

Comment 9 Anping Li 2019-11-25 12:55:48 UTC
Created attachment 1639470 [details]
The logs and resource out

Comment 10 Anping Li 2019-11-25 13:39:13 UTC
When shall we configure HTTP_PROXY and mount the configmap fluentd-trusted-ca-bundle?

Comment 12 Ben Parees 2019-11-25 20:11:26 UTC
> 2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)

this connection should not be going through the proxy.  there should be a NO_PROXY env var set on the fluentd pods that includes the 172.30.0.0/16 network.  

1) Need to confirm this is getting set on the fluentd pods correctly (the CL operator gets the value from the proxy config resource, so you can also confirm the proxy config resource on the cluster has the correct value in status.noProxy)

2) Need to confirm the fluentd transport libraries support CIDR format noproxy values.  If they do not, we are going to need to find another way to ensure the requests from fluentd to elastic are not proxied.


The fact that you didn't see an HTTP_PROXY var or configmap mounted is definitely a problem though.  Can you share the proxy config resource from your cluster?  (oc get proxy -o yaml)

Comment 13 Anping Li 2019-11-26 01:50:49 UTC
The proxy is as following. By the ways, The proxy Envs have been configured in the CLO.

[anli@preserve-docker-slave anli43gcp]$ oc get proxy -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Proxy
  metadata:
    creationTimestamp: "2019-11-25T11:57:59Z"
    generation: 1
    name: cluster
    resourceVersion: "431"
    selfLink: /apis/config.openshift.io/v1/proxies/cluster
    uid: a5f9e2c6-d55f-4deb-96a4-6995fe1cb6a3
  spec:
    httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    trustedCA:
      name: ""
  status:
    httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
    noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-0.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-1.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-2.qeanlip2.qe.gcp.devcluster.openshift.com,localhost,metadata,metadata.google.internal,metadata.google.internal.

Comment 18 Qiaoling Tang 2019-12-09 03:03:58 UTC
Verified with ose-cluster-logging-operator-v4.4.0-201912080435

Comment 21 errata-xmlrpc 2020-05-04 11:14:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.