Bug 1768762
| Summary: | Fluentd: "Could not communicate to Elasticsearch" when http proxy enabled in the cluster. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> | ||||
| Component: | Logging | Assignee: | ewolinet | ||||
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.3.0 | CC: | aos-bugs, bparees, dkulkarn, jcantril, nhosoi, rmeggins | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.4.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1777477 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-05-04 11:14:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1777477 | ||||||
| Attachments: |
|
||||||
The fluentd could connected to the elasticsearch in the expected times. I didn't see the HTTP_PROXY Env are enabled in fluentd pod, and I didn't see confirm map fluentd-trusted-ca-bundle was mounted into the pod.
[anli@preserve-docker-slave anli43gcp]$ oc exec fluentd-p4jlb -- logs
2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 14. Retry to communicate after 2 second(s).
2019-11-25 12:40:26 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:26 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 13. Retry to communicate after 4 second(s).
2019-11-25 12:40:35 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:35 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 12. Retry to communicate after 8 second(s).
2019-11-25 12:40:52 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
2019-11-25 12:40:52 +0000 [warn]: [retry_clo_default_output_es] Remaining retry: 11. Retry to communicate after 16 second(s).
2019-11-25 12:42:13 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-cpc-host-port-943a07f12db859b475b8cab9138dd89ee45fa3c7471e1d565b0a09c71eab9036.log:default" location=nil tag="kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-cpc-host-port-943a07f12db859b475b8cab9138dd89ee45fa3c7471e1d565b0a09c71eab9036.log" time=2019-11-25 12:42:13.066705727 +0000 record={"time"=>"2019-11-25T12:16:22.288733868+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10357 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-apiserver-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-apiserver_setup-061062f41b239d2501c350c533ef9b88302d061d70f999092c54a9cceb520d55.log:default" location=nil tag="kubernetes.var.log.containers.kube-apiserver-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-apiserver_setup-061062f41b239d2501c350c533ef9b88302d061d70f999092c54a9cceb520d55.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:29:16.346910916+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Fixing audit permissions.\\'\\'Waiting for port :6443 and :6080 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.openshift-kube-scheduler-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-scheduler_wait-for-host-port-bf98d93bd2c648a0d4e4a2fc6ff3711ea7d62c1cb06e666ce9ffd998adb860f7.log:default" location=nil tag="kubernetes.var.log.containers.openshift-kube-scheduler-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-scheduler_wait-for-host-port-bf98d93bd2c648a0d4e4a2fc6ff3711ea7d62c1cb06e666ce9ffd998adb860f7.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:15:23.698894271+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10259 and :10251 to be released."}
2019-11-25 12:42:14 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-host-port-c411e5e2c2fa1ded3e330a22829d69cb5ccbf0629c92b72ce95bb9255b7d58cd.log:default" location=nil tag="kubernetes.var.log.containers.kube-controller-manager-qeanli-zg8tp-m-1.c.openshift-qe.internal_openshift-kube-controller-manager_wait-for-host-port-c411e5e2c2fa1ded3e330a22829d69cb5ccbf0629c92b72ce95bb9255b7d58cd.log" time=2019-11-25 12:42:14.065368339 +0000 record={"time"=>"2019-11-25T12:15:34.254766719+00:00", "stream"=>"stdout", "logtag"=>"P", "log"=>"Waiting for port :10257 to be released."}
Created attachment 1639470 [details]
The logs and resource out
When shall we configure HTTP_PROXY and mount the configmap fluentd-trusted-ca-bundle? > 2019-11-25 12:40:21 +0000 [warn]: [retry_clo_default_output_es] Could not communicate to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) for 172.30.73.24:9200 (Errno::ECONNREFUSED)
this connection should not be going through the proxy. there should be a NO_PROXY env var set on the fluentd pods that includes the 172.30.0.0/16 network.
1) Need to confirm this is getting set on the fluentd pods correctly (the CL operator gets the value from the proxy config resource, so you can also confirm the proxy config resource on the cluster has the correct value in status.noProxy)
2) Need to confirm the fluentd transport libraries support CIDR format noproxy values. If they do not, we are going to need to find another way to ensure the requests from fluentd to elastic are not proxied.
The fact that you didn't see an HTTP_PROXY var or configmap mounted is definitely a problem though. Can you share the proxy config resource from your cluster? (oc get proxy -o yaml)
The proxy is as following. By the ways, The proxy Envs have been configured in the CLO.
[anli@preserve-docker-slave anli43gcp]$ oc get proxy -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
creationTimestamp: "2019-11-25T11:57:59Z"
generation: 1
name: cluster
resourceVersion: "431"
selfLink: /apis/config.openshift.io/v1/proxies/cluster
uid: a5f9e2c6-d55f-4deb-96a4-6995fe1cb6a3
spec:
httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
trustedCA:
name: ""
status:
httpProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
httpsProxy: http://proxy-user1:JYgU8qRZV4DY4PXJbxJK@10.0.0.2:3128
noProxy: .cluster.local,.svc,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-0.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-1.qeanlip2.qe.gcp.devcluster.openshift.com,etcd-2.qeanlip2.qe.gcp.devcluster.openshift.com,localhost,metadata,metadata.google.internal,metadata.google.internal.
Verified with ose-cluster-logging-operator-v4.4.0-201912080435 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |
The test cluster was enabled HTTP_PROXY, and there was no trustedCA in the proxy. About the trustedCA, seems this field would always be in the proxy, I checked a cluster without any proxy enabled, there still have a trustedCA field. $ oc get proxy cluster -oyaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2019-11-05T23:23:02Z" generation: 1 name: cluster resourceVersion: "433" selfLink: /apis/config.openshift.io/v1/proxies/cluster uid: d47d9096-65c2-409c-ba94-989a034d6caf spec: trustedCA: name: "" status: {}