What happens when you bounce the fluent pods (e.g. oc delete pods -l component=fluentd)? Do the logs begin to flow again? There is  which reduces the backoff time for when logs are unable to be pushed that has not landed yet in 4.5. Researching the issue has some indicators this is caused by a disconnect at the remote end
The artifacts from https://bugzilla.redhat.com/show_bug.cgi?id=1915448#c2 look like the global proxy is enabled. Unable to find the ref but I thought I saw in the artifacts indicators of 403 as if the traffic is going through the proxy even though it should not. Is it possible to remove the global proxy config from the cluster to see if this issue goes away and further isolate the problem? The default ES store otherwise looks healthy with plenty of CPU and RAM
The artifacts from https://bugzilla.redhat.com/show_bug.cgi?id=1915448#c1 show similar issues but I am unable to verify if global proxy is configured. There are indicators of network issues but I am unable to determine if those are transient related to maybe the ES nodes either being upgraded or rolled by the EO.
The EOFError is a ruby IO error in the network stack which leads me to believe the connection is being dropped on store end. Some of these artifacts have indicators of high GC but again the CPU and memory seem acceptable, especially given the number of shards are low. Can you describe the write characteristics of the attached volumes? Maybe the PV have poor write performance or the customer is using some NFS volume?
The following may also assist you in testing connectivity from fluent to ES https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-fluentd-to-es-connectivity or https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity from kibana to ES
(In reply to Jeff Cantrill from comment #9)
> The following may also assist you in testing connectivity from fluent to ES
> fluentd-to-es-connectivity or
> kibana-to-es-connectivity from kibana to ES
I have asked the customer to send the outputs. I will update the Bugzilla once I get the update.
Elasticsearch proxy container complains about the following:
2021-01-13T07:21:32.491109601Z 2021/01/13 07:21:32 http: TLS handshake error from 100.127.10.1:46620: remote error: tls: unknown certificate authority
I'm wondering if the fact that the ijected CA is shaped with windows end of lines is causing the EOFError, Jeff ?
(In reply to Nicolas Nosenzo from comment #15)
> E.g. CU ca:
> ca-bundle.crt: "-----BEGIN
Who is providing this CA? If it is the customer, might they try removing those EOL characters and see if that resolves the issue?
(In reply to Jeff Cantrill from comment #17)
> (In reply to Nicolas Nosenzo from comment #15)
> > E.g. CU ca:
> > ...
> > ca-bundle.crt: "-----BEGIN
> Who is providing this CA? If it is the customer, might they try removing
> those EOL characters and see if that resolves the issue?
Actually I realized they're using the same CA in other deployments running the same version and this is not a problem. Also, seems some of the logs are indeed being delivered.
That said, looking at the Elasticsearch Operator logs, I found some timeouts when trying to reach out the ES service, followed by EOF errors:
time="2021-01-27T12:41:46Z" level=warning msg="Unable to clear transient shard allocation for \"openshift-logging\" \"openshift-logging\": Response: , Error: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: dial tcp 100.126.136.53:9200: i/o timeout"
time="2021-01-27T12:44:28Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: EOF\r\n"
I'm wondering whether those EOF error are likely to be the same ones we see in the fluentd pod. There are some chances to be hitting https://bugzilla.redhat.com/show_bug.cgi?id=1908707
I will try to confirm so.
@Nicolas I see you are also involved in https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32 which I suspect may be the same issue. Are you able to confirm if the EOF issues disappeared or do disappear when you revert the proxy image?
(In reply to Jeff Cantrill from comment #19)
> @Nicolas I see you are also involved in
> https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32 which I suspect may
> be the same issue. Are you able to confirm if the EOF issues disappeared or
> do disappear when you revert the proxy image?
@Jeff, I'm yet to confirm so, https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32 was referring to another case/environment. I've requested to switch back to the elasticsearch-proxy image prior to the offending code change and will let you know how it goes.
Any updates? The changes to the proxy have merged to address https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32 and is pending release. Additionally it is possible to modify the timeouts so it is possible to go "unmanaged" and fiddle with the timeouts if we think that is still an issue.
We tried setting back the elasticsearch-proxy to the latest image version before the hardcoded timeout was introduced, but it was not the real problem. The CLO kept complaining about connection refused attempts.
Seems we finally managed to solve it by adding 'elasticsearch' to the no_proxy value of the cluster-wide proxy, which seems a bit weird tbh, I thought only the service FQDN was in use, although this is the second time I found the same scenario.
(In reply to Nicolas Nosenzo from comment #22)
> We tried setting back the elasticsearch-proxy to the latest image version
> before the hardcoded timeout was introduced, but it was not the real
> problem. The CLO kept complaining about connection refused attempts.
You mean fluentd kept complaining of unable to connect?
> Seems we finally managed to solve it by adding 'elasticsearch' to the
> no_proxy value of the cluster-wide proxy, which seems a bit weird tbh, I
> thought only the service FQDN was in use, although this is the second time I
> found the same scenario.
What is the exact value you added to no_proxy? Was it 'elasticsearch' ?
Doing some investigation around connecting to managed ES and 'no_proxy' setting leads me to the following commentary reproduces here from slack:
* likely broken by https://github.com/openshift/cluster-logging-operator/pull/749 which enabled typhoeus as a backend
* faraday is a wrapper around several http client libs. this is what is used by the ES transport library.
* https://github.com/typhoeus/typhoeus/issues/528 claims the no_proxy setting is not honored and was never implemented because httpclient supports it.
* latest version of https://github.com/uken/fluent-plugin-elasticsearch/blob/master/lib/fluent/plugin/out_elasticsearch.rb#L154 does not support it as a backend
The work around is to add elasticsearch to the no_proxy as advised by https://bugzilla.redhat.com/show_bug.cgi?id=1915448#c22
Following up to #c25, that information is immaterial. Tested by:
* Deploy logging with ephemeral ES
* Set unmanaged
* update ds/fluentd to have HTTPS_PROXY var
* Delete ES pod and verify no indexed documents
* update ds/fluentd to have NO_PROXY to include domain ".svc.cluster.local"
* Delete EX pod and verify documents are indexed.
This confirms that proxy works at the component level. Following up with installer team to see if we should expect internal traffic to have "no_proxy" auto populated with the internal domain
Updated title to reflect the change based on the actual cause. User's can work around earlier releases by updating the global proxy config to NO_PROXY internal traffic
Closing in favor of https://issues.redhat.com/browse/LOG-1125 which will be issue tracker for logging 5.0 and later
In my case, the Customer is not using any global proxy. Still, he's facing the issue.
(In reply to Yogiraj Karajagi from comment #30)
> Hello Jeff,
> In my case, the Customer is not using any global proxy. Still, he's facing
> the issue.
Let me know your thoughts on it. Waiting for your response.
Jeff, the issue reported here ("end of file reached (EOFError)") is still there even after applying the following workarounds:
- Removed 'http_backend typhoeus' as per https://bugzilla.redhat.com/show_bug.cgi?id=1930360
- ES-proxy image was set to ose-elasticsearch-proxy:v4.5.0-202012120433.p0 as per https://bugzilla.redhat.com/show_bug.cgi?id=1908707#c32