Bug 1915448

Summary: Collector blocked pushing logs to internal storage because of global proxy settings
Product: OpenShift Container Platform Reporter: Apurva Nisal <anisal>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED NEXTRELEASE QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, bleanhar, cpassare, igreen, jcantril, lmartinh, mmohan, nnosenzo, npentche, ocasalsa, puraut, stwalter, ykarajag
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-core
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* If you did not set `.proxy` in the cluster installation configuration, and then configured a global proxy on the installed cluster, a bug prevented Fluentd from forwarding logs to Elasticsearch. To work around this issue, in the proxy/cluster configuration, set `no_proxy` to `.svc.cluster.local` so it skips internal traffic. The current release fixes the proxy configuration issue. Now, if you configure the global proxy after installing an OpenShift cluster, Fluentd forwards logs to Elasticsearch. (*LOG-1187*, *BZ#1915448*)
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-12 20:31:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 3 Jeff Cantrill 2021-01-12 20:41:10 UTC
What happens when you bounce the fluent pods (e.g. oc delete pods -l component=fluentd)?  Do the logs begin to flow again?  There is [1] which reduces the backoff time for when logs are unable to be pushed that has not landed yet in 4.5.  Researching the issue has some indicators this is caused by a disconnect at the remote end

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1894640

Comment 8 Jeff Cantrill 2021-01-14 18:21:16 UTC
The artifacts from https://bugzilla.redhat.com/show_bug.cgi?id=1915448#c2 look like the global proxy is enabled.  Unable to find the ref but I thought I saw in the artifacts indicators of 403 as if the traffic is going through the proxy even though it should not.  Is it possible to remove the global proxy config from the cluster to see if this issue goes away and further isolate the problem?  The default ES store otherwise looks healthy with plenty of CPU and RAM

The artifacts from https://bugzilla.redhat.com/show_bug.cgi?id=1915448#c1 show similar issues but I am unable to verify if global proxy is configured.  There are indicators of network issues but I am unable to determine if those are transient related to maybe the ES nodes either being upgraded or rolled by the EO.  

The EOFError is a ruby IO error in the network stack which leads me to believe the connection is being dropped on store end.  Some of these artifacts have indicators of high GC but again the CPU and memory seem acceptable, especially given the number of shards are low.  Can you describe the write characteristics of the attached volumes?  Maybe the PV have poor write performance or the customer is using some NFS volume?

Comment 11 Yogiraj Karajagi 2021-01-15 11:21:53 UTC
(In reply to Jeff Cantrill from comment #9)
> The following may also assist you in testing connectivity from fluent to ES
> https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-
> fluentd-to-es-connectivity or
> https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-
> kibana-to-es-connectivity from kibana to ES

Hello Jeff, 

I have asked the customer to send the outputs. I will update the Bugzilla once I get the update.

Comment 14 Nicolas Nosenzo 2021-01-21 12:46:14 UTC
Some notes,
Elasticsearch proxy container complains about the following:

2021-01-13T07:21:32.491109601Z 2021/01/13 07:21:32 http: TLS handshake error from 100.127.10.1:46620: remote error: tls: unknown certificate authority

I'm wondering if the fact that the ijected CA is shaped with windows end of lines is causing the EOFError, Jeff ?

Comment 17 Jeff Cantrill 2021-01-27 21:02:05 UTC
(In reply to Nicolas Nosenzo from comment #15)
> E.g. CU ca:
> 
> 
> ...
>   ca-bundle.crt: "-----BEGIN

Who is providing this CA? If it is the customer, might they try removing those EOL characters and see if that resolves the issue?

Comment 18 Nicolas Nosenzo 2021-01-28 13:59:28 UTC
(In reply to Jeff Cantrill from comment #17)
> (In reply to Nicolas Nosenzo from comment #15)
> > E.g. CU ca:
> > 
> > 
> > ...
> >   ca-bundle.crt: "-----BEGIN
> 
> Who is providing this CA? If it is the customer, might they try removing
> those EOL characters and see if that resolves the issue?

Actually I realized they're using the same CA in other deployments running the same version and this is not a problem. Also, seems some of the logs are indeed being delivered. 

That said, looking at the Elasticsearch Operator logs, I found some timeouts when trying to reach out the ES service, followed by EOF errors:

~~~
time="2021-01-27T12:41:46Z" level=warning msg="Unable to clear transient shard allocation for \"openshift-logging\" \"openshift-logging\": Response: , Error: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: dial tcp 100.126.136.53:9200: i/o timeout"
time="2021-01-27T12:44:28Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: EOF\r\n"
~~~

I'm wondering whether those EOF error are likely to be the same ones we see in the fluentd pod. There are some chances to be hitting  https://bugzilla.redhat.com/show_bug.cgi?id=1908707

I will try to confirm so.

Comment 19 Jeff Cantrill 2021-01-28 17:31:30 UTC
@Nicolas I see you are also involved in https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32  which I suspect may be the same issue.  Are you able to confirm if the EOF issues disappeared or do disappear when you revert the proxy image?

Comment 20 Nicolas Nosenzo 2021-01-28 18:07:12 UTC
(In reply to Jeff Cantrill from comment #19)
> @Nicolas I see you are also involved in
> https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32  which I suspect may
> be the same issue.  Are you able to confirm if the EOF issues disappeared or
> do disappear when you revert the proxy image?

@Jeff, I'm yet to confirm so, https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32 was referring to another case/environment. I've requested to switch back to the elasticsearch-proxy image prior to the offending code change and will let you know how it goes.

Comment 21 Jeff Cantrill 2021-02-03 19:37:08 UTC
Nicolas,

Any updates? The changes to the proxy have merged to address https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32 and is pending release. Additionally it is possible to modify the timeouts so it is possible to go "unmanaged" and fiddle with the timeouts if we think that is still an issue.

Comment 22 Nicolas Nosenzo 2021-02-03 21:28:46 UTC
jeff, 
We tried setting back the elasticsearch-proxy to the latest image version before the hardcoded timeout was introduced, but it was not the real problem. The CLO kept complaining about connection refused attempts. 

Seems we finally managed to solve it by adding 'elasticsearch' to the no_proxy value of the cluster-wide proxy, which seems a bit weird tbh, I thought only the service FQDN was in use, although this is the second time I found the same scenario.

Comment 24 Jeff Cantrill 2021-02-05 17:34:15 UTC
(In reply to Nicolas Nosenzo from comment #22)
> jeff, 
> We tried setting back the elasticsearch-proxy to the latest image version
> before the hardcoded timeout was introduced, but it was not the real
> problem. The CLO kept complaining about connection refused attempts. 

You mean fluentd kept complaining of unable to connect?

> 
> Seems we finally managed to solve it by adding 'elasticsearch' to the
> no_proxy value of the cluster-wide proxy, which seems a bit weird tbh, I
> thought only the service FQDN was in use, although this is the second time I
> found the same scenario.

What is the exact value you added to no_proxy? Was it 'elasticsearch' ?

Comment 25 Jeff Cantrill 2021-02-05 21:45:21 UTC
Doing some investigation around connecting to managed ES and 'no_proxy' setting leads me to the following commentary reproduces here from slack:


* likely broken by https://github.com/openshift/cluster-logging-operator/pull/749 which enabled typhoeus as a backend
* faraday is a wrapper around several http client libs. this is what is used by the ES transport library.  
* https://github.com/typhoeus/typhoeus/issues/528 claims the no_proxy setting is not honored and was never implemented because httpclient supports it.  
* latest version of https://github.com/uken/fluent-plugin-elasticsearch/blob/master/lib/fluent/plugin/out_elasticsearch.rb#L154 does not support it as a backend

The work around is to add elasticsearch to the no_proxy as advised by https://bugzilla.redhat.com/show_bug.cgi?id=1915448#c22

Comment 26 Jeff Cantrill 2021-02-12 18:20:42 UTC
Following up to #c25, that information is immaterial.  Tested by:

* Deploy logging with ephemeral ES 
* Set unmanaged
* update ds/fluentd to have HTTPS_PROXY var
* Delete ES pod and verify no indexed documents 
* update ds/fluentd to have NO_PROXY to include domain ".svc.cluster.local"
* Delete EX pod and verify documents are indexed.

This confirms that proxy works at the component level.  Following up with installer team to see if we should expect internal traffic to have "no_proxy" auto populated with the internal domain

Comment 28 Jeff Cantrill 2021-02-12 20:26:59 UTC
Updated title to reflect the change based on the actual cause.  User's can work around earlier releases by  updating the global proxy config to NO_PROXY internal traffic

Comment 29 Jeff Cantrill 2021-02-12 20:31:57 UTC
Closing in favor of https://issues.redhat.com/browse/LOG-1125 which will be issue tracker for logging 5.0 and later

Comment 30 Yogiraj Karajagi 2021-02-15 10:24:26 UTC
Hello Jeff, 

In my case, the Customer is not using any global proxy. Still, he's facing the issue.

Comment 33 Yogiraj Karajagi 2021-02-18 15:11:28 UTC
(In reply to Yogiraj Karajagi from comment #30)
> Hello Jeff, 
> 
> In my case, the Customer is not using any global proxy. Still, he's facing
> the issue.

Hello Jeff, 

Let me know your thoughts on it. Waiting for your response.

Comment 34 Nicolas Nosenzo 2021-02-19 12:44:57 UTC
Jeff, the issue reported here ("end of file reached (EOFError)") is still there even after applying the following workarounds:

- Removed 'http_backend typhoeus' as per https://bugzilla.redhat.com/show_bug.cgi?id=1930360
- ES-proxy image was set to ose-elasticsearch-proxy:v4.5.0-202012120433.p0 as per https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32