Bug 1908707

Summary:	fluentd fails to deliver message with Server returned nothing (no headers, no data)
Product:	OpenShift Container Platform	Reporter:	Nicolas Nosenzo <nnosenzo>
Component:	Logging	Assignee:	Sergey Yedrikov <syedriko>
Status:	CLOSED ERRATA	QA Contact:	Qiaoling Tang <qitang>
Severity:	urgent	Docs Contact:	Rolfe Dlugy-Hegwer <rdlugyhe>
Priority:	high
Version:	4.6	CC:	akaris, andcosta, anli, aos-bugs, apaladug, cruhm, hchatter, igreen, jcantril, jswensso, mabajodu, mmohan, openshift-bugs-escalate, periklis, puraut, rcarrier, rdlugyhe, stwalter, syedriko, ykarajag
Target Milestone:	---	Flags:	puraut: needinfo- puraut: needinfo-
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-core
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Previously, in some bulk insertion situations, the Elasticsearch proxy timed out connections between fluentd and Elasticsearch. As a result, fluentd failed to deliver messages and logged a `Server returned nothing (no headers, no data)` error. The current release fixes this issue: It increases the default HTTP read and write timeouts in the Elasticsearch proxy from five seconds to one minute. It also provides command-line options in the Elasticsearch proxy to control HTTP timeouts in the field. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1908707[BZ#1908707])	Story Points:	---
Clone Of:
Clones:	1924258 (view as bug list)		Environment:
Last Closed:	2021-02-24 11:22:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1924258

Description Nicolas Nosenzo 2020-12-17 11:59:36 UTC

Description of problem:
Fluentd fails to deliver logs with the following message:

2020-12-17 10:55:20 +0000 [warn]: [default] failed to flush the buffer. retry_time=0 next_retry_seconds=2020-12-17 10:55:21 +0000 chunk="5b6a6c201a4409bc30579ed2a74b3f6f" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Server returned nothing (no headers, no data)"

Some times it re-tries and succeeds:

2020-12-17 10:55:23 +0000 [warn]: [default] retry succeeded. chunk_id="5b6a6c201a4409bc30579ed2a74b3f6f"

But most of the times it will fail again and again:

$ cat fluentd-qsqd6/fluentd/fluentd/logs/current.log | grep 5b62db722e5837bf6c2dc9dce9f22126
[....]
2020-12-15T06:20:29.915373137Z 2020-12-15 06:20:29 +0000 [warn]: [default] failed to flush the buffer. retry_time=130 next_retry_seconds=2020-12-15 06:25:40 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] "
2020-12-15T06:25:52.223400327Z 2020-12-15 06:25:52 +0000 [warn]: [default] failed to flush the buffer. retry_time=131 next_retry_seconds=2020-12-15 06:30:24 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] "
2020-12-15T06:30:42.120947177Z 2020-12-15 06:30:42 +0000 [warn]: [default] failed to flush the buffer. retry_time=132 next_retry_seconds=2020-12-15 06:35:42 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] "
2020-12-15T06:40:25.422546235Z 2020-12-15 06:40:25 +0000 [warn]: [default] failed to flush the buffer. retry_time=136 next_retry_seconds=2020-12-15 06:46:02 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] "
2020-12-15T06:46:19.343552411Z 2020-12-15 06:46:19 +0000 [warn]: [default] failed to flush the buffer. retry_time=137 next_retry_seconds=2020-12-15 06:50:46 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] "
2020-12-15T06:52:39.195238539Z 2020-12-15 06:52:39 +0000 [warn]: [default] failed to flush the buffer. retry_time=140 next_retry_seconds=2020-12-15 06:57:31 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Server returned nothing (no headers, no data)"
[....]

cat fluentd-qsqd6/fluentd/fluentd/logs/current.log | grep 5b62db722e5837bf6c2dc9dce9f22126 | wc -l
70

Version-Release number of selected component (if applicable):
4.6

How reproducible:
100%

Steps to Reproduce:
1. Deploy fresh 4.6
2. Check fluentd logs

Expected:
Fluentd should flush logs to ES without issues.

Actual:
Fluentd is stopping to send logs to ES showing above error message.
Additional info:
Cluster has been set up with a custom-ca

Comment 2 Jeff Cantrill 2020-12-17 17:47:41 UTC

Proxy issue?  https://bugzilla.redhat.com/show_bug.cgi?id=1825923#c13

Comment 3 Nicolas Nosenzo 2020-12-21 10:45:35 UTC

(In reply to Jeff Cantrill from comment #2)
> Proxy issue?  https://bugzilla.redhat.com/show_bug.cgi?id=1825923#c13

I don't think so, I reproduced the error message without deploying a cluster-wide proxy. Anyway I will check disabling the clusterLogFowarder API in the affected cluster.

Comment 4 Anand Paladugu 2021-01-04 21:35:57 UTC

The customer tried redeploying the stack again but is still seeing the same results.

-------------------
We did try as well with initial request and limit memory to 32GB

1. Remove logging and elastic operator
2. Remove CRD
3. Deleted the local PV, flushed the journal,
4. Remove the Openshift-logging namespace
5. Rebooted infra nodes
6. Reinstall operator, with 32GB and default cluster logging settings.
-------------------

I have asked to see if the proxy has any rate-limiting set that could cause on and off issues like this, and waiting for customers' response. Any other things we can try to isolate the issue?

Thanks

Anand

Comment 11 Nicolas Nosenzo 2021-01-19 15:37:39 UTC

Hi Sergey,

- Logging stack has been redeployed, 
- fluentd resources has been increased

Anyhow, logs stop coming through some time after the pods are started. We can confirm logs are delivered at the beginning so that would discard a proxy issue. 

Fluentd resources:

    resources:
      limits:
        cpu: 800m
        memory: 1Gi
      requests:
        cpu: 800m
        memory: 1Gi


ES resources:

    resources:
      limits:
        cpu: '1'
        memory: 32Gi
      requests:
        cpu: '1'
        memory: 32Gi


Fluentd keep failing with:

~~~
2021-01-19T12:13:03.945959470Z 2021-01-19 12:13:03 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=1527 next_retry_seconds=2021-01-19 12:13:04 +0000 chunk="5b8d132c4425f5cfdc314bef0431c7ba" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
~~~

Worth to highlight this is a relatively small cluster (14 nodes).

Comment 13 Sergey Yedrikov 2021-01-19 15:55:02 UTC

Hi Nicolas,

From going through the latest must-gather, I don't think this is a proxy issue either. The ES cluster appears to be in a RED state and I'm looking into why that is.

Regards,
Sergey.

Comment 14 Nicolas Nosenzo 2021-01-19 16:11:00 UTC

Hi Sergey, as per the latest must-gather the cluster state seems to be "green", looking at the logs I can't find any change to "red" status:

$ grep red indices; cat health
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1611063156 13:32:36  elasticsearch green           3         3    232 116    0    0        0             0                  -                100.0%

Comment 15 Sergey Yedrikov 2021-01-19 16:36:53 UTC

I was looking at
quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/elasticsearch-cdm-lwys2i79-1-76fd8b584f-zf5g9/elasticsearch/elasticsearch/logs/current.log
2021-01-14T12:49:59.143078593Z Clusterstate: RED

quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/elasticsearch-cdm-lwys2i79-2-68b7cd968d-lvfw5/elasticsearch/elasticsearch/logs/current.log
2021-01-14T12:49:55.970216249Z Clusterstate: RED

quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/elasticsearch-cdm-lwys2i79-3-5449cd75f-6mzqg/elasticsearch/elasticsearch/logs/current.log
2021-01-14T12:50:07.300625692Z Clusterstate: RED

and there's no transition out of the red state logged, but yep, quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/cluster-logging/es/cluster-elasticsearch suggests an OK shape of the ES cluster.

I'm finding quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/kibana-7d499b554d-bb7sl/kibana/kibana/logs/current.log interesting, it's as if ES is refusing new connections or randomly closing existing ones.

Comment 17 Nicolas Nosenzo 2021-01-20 09:59:19 UTC

Hi Sergey, there is a new must-gather report with fluentd log level in trace mode. Following up with a chunk reporting the failure mentioned, I can see the following log records:

$ omg logs fluentd-2qqtz -c fluentd | grep 5b87fdf012402e688865ec290b496424
2021-01-19T19:32:57.671581806Z 2021-01-19 19:32:57 +0000 [debug]: [retry_default] taking back chunk for errors. chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:32:57.671668146Z 2021-01-19 19:32:57 +0000 [trace]: [retry_default] taking back a chunk instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424"
2021-01-19T19:32:57.671701331Z 2021-01-19 19:32:57 +0000 [trace]: [retry_default] chunk taken back instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="retry_default", variables=nil>
2021-01-19T19:32:57.671816412Z 2021-01-19 19:32:57 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=37 next_retry_seconds=2021-01-19 19:32:58 +0000 chunk="5b87fdf012402e688865ec290b496424" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
2021-01-19T19:32:58.688589250Z 2021-01-19 19:32:58 +0000 [trace]: [retry_default] trying flush for a chunk chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:32:58.688628957Z 2021-01-19 19:32:58 +0000 [trace]: [retry_default] executing sync write chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:26.717765013Z 2021-01-19 19:33:26 +0000 [debug]: [retry_default] taking back chunk for errors. chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:26.717778218Z 2021-01-19 19:33:26 +0000 [trace]: [retry_default] taking back a chunk instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:26.717822522Z 2021-01-19 19:33:26 +0000 [trace]: [retry_default] chunk taken back instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="retry_default", variables=nil>
2021-01-19T19:33:26.717959471Z 2021-01-19 19:33:26 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=38 next_retry_seconds=2021-01-19 19:33:27 +0000 chunk="5b87fdf012402e688865ec290b496424" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
2021-01-19T19:33:27.728119921Z 2021-01-19 19:33:27 +0000 [trace]: [retry_default] trying flush for a chunk chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:27.728274969Z 2021-01-19 19:33:27 +0000 [trace]: [retry_default] executing sync write chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:01.721093919Z 2021-01-19 19:34:01 +0000 [debug]: [retry_default] taking back chunk for errors. chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:01.721126606Z 2021-01-19 19:34:01 +0000 [trace]: [retry_default] taking back a chunk instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:01.721181296Z 2021-01-19 19:34:01 +0000 [trace]: [retry_default] chunk taken back instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="retry_default", variables=nil>
2021-01-19T19:34:01.721329778Z 2021-01-19 19:34:01 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=42 next_retry_seconds=2021-01-19 19:34:02 +0000 chunk="5b87fdf012402e688865ec290b496424" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
2021-01-19T19:34:02.799551658Z 2021-01-19 19:34:02 +0000 [trace]: [retry_default] trying flush for a chunk chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:02.799587287Z 2021-01-19 19:34:02 +0000 [trace]: [retry_default] executing sync write chunk="5b87fdf012402e688865ec290b496424"

Comment 18 Sergey Yedrikov 2021-01-20 18:55:43 UTC

Hi Nicolas,

The root cause has been identified, fix is coming up.

Regards,
Sergey.

Comment 41 Qiaoling Tang 2021-02-08 01:18:48 UTC

Tested with elasticsearch-operator.5.0.0-43, the fix is in it, the issue is gone.

Comment 47 Brett Jones 2021-02-22 20:25:26 UTC

*** Bug 1929139 has been marked as a duplicate of this bug. ***

Comment 49 errata-xmlrpc 2021-02-24 11:22:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652