Bug 1908707
| Summary: | fluentd fails to deliver message with Server returned nothing (no headers, no data) | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Nicolas Nosenzo <nnosenzo> | |
| Component: | Logging | Assignee: | Sergey Yedrikov <syedriko> | |
| Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> | |
| Severity: | urgent | Docs Contact: | Rolfe Dlugy-Hegwer <rdlugyhe> | |
| Priority: | high | |||
| Version: | 4.6 | CC: | akaris, andcosta, anli, aos-bugs, apaladug, cruhm, hchatter, igreen, jcantril, jswensso, mabajodu, mmohan, openshift-bugs-escalate, periklis, puraut, rcarrier, rdlugyhe, stwalter, syedriko, ykarajag | |
| Target Milestone: | --- | Flags: | puraut:
needinfo-
puraut: needinfo- |
|
| Target Release: | 4.7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | logging-core | |||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
* Previously, in some bulk insertion situations, the Elasticsearch proxy timed out connections between fluentd and Elasticsearch. As a result, fluentd failed to deliver messages and logged a `Server returned nothing (no headers, no data)` error. The current release fixes this issue: It increases the default HTTP read and write timeouts in the Elasticsearch proxy from five seconds to one minute. It also provides command-line options in the Elasticsearch proxy to control HTTP timeouts in the field. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1908707[*BZ#1908707*])
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1924258 (view as bug list) | Environment: | ||
| Last Closed: | 2021-02-24 11:22:30 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1924258 | |||
(In reply to Jeff Cantrill from comment #2) > Proxy issue? https://bugzilla.redhat.com/show_bug.cgi?id=1825923#c13 I don't think so, I reproduced the error message without deploying a cluster-wide proxy. Anyway I will check disabling the clusterLogFowarder API in the affected cluster. The customer tried redeploying the stack again but is still seeing the same results. ------------------- We did try as well with initial request and limit memory to 32GB 1. Remove logging and elastic operator 2. Remove CRD 3. Deleted the local PV, flushed the journal, 4. Remove the Openshift-logging namespace 5. Rebooted infra nodes 6. Reinstall operator, with 32GB and default cluster logging settings. ------------------- I have asked to see if the proxy has any rate-limiting set that could cause on and off issues like this, and waiting for customers' response. Any other things we can try to isolate the issue? Thanks Anand Hi Sergey,
- Logging stack has been redeployed,
- fluentd resources has been increased
Anyhow, logs stop coming through some time after the pods are started. We can confirm logs are delivered at the beginning so that would discard a proxy issue.
Fluentd resources:
resources:
limits:
cpu: 800m
memory: 1Gi
requests:
cpu: 800m
memory: 1Gi
ES resources:
resources:
limits:
cpu: '1'
memory: 32Gi
requests:
cpu: '1'
memory: 32Gi
Fluentd keep failing with:
~~~
2021-01-19T12:13:03.945959470Z 2021-01-19 12:13:03 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=1527 next_retry_seconds=2021-01-19 12:13:04 +0000 chunk="5b8d132c4425f5cfdc314bef0431c7ba" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
~~~
Worth to highlight this is a relatively small cluster (14 nodes).
Hi Nicolas, From going through the latest must-gather, I don't think this is a proxy issue either. The ES cluster appears to be in a RED state and I'm looking into why that is. Regards, Sergey. Hi Sergey, as per the latest must-gather the cluster state seems to be "green", looking at the logs I can't find any change to "red" status: $ grep red indices; cat health epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1611063156 13:32:36 elasticsearch green 3 3 232 116 0 0 0 0 - 100.0% I was looking at quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/elasticsearch-cdm-lwys2i79-1-76fd8b584f-zf5g9/elasticsearch/elasticsearch/logs/current.log 2021-01-14T12:49:59.143078593Z Clusterstate: RED quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/elasticsearch-cdm-lwys2i79-2-68b7cd968d-lvfw5/elasticsearch/elasticsearch/logs/current.log 2021-01-14T12:49:55.970216249Z Clusterstate: RED quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/elasticsearch-cdm-lwys2i79-3-5449cd75f-6mzqg/elasticsearch/elasticsearch/logs/current.log 2021-01-14T12:50:07.300625692Z Clusterstate: RED and there's no transition out of the red state logged, but yep, quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/cluster-logging/es/cluster-elasticsearch suggests an OK shape of the ES cluster. I'm finding quay-io-openshift-origin-cluster-logging-operator-sha256-3ec304f2d7261249c094d33361a3fa2729cba68e4573ccd334a91793534e6535/namespaces/openshift-logging/pods/kibana-7d499b554d-bb7sl/kibana/kibana/logs/current.log interesting, it's as if ES is refusing new connections or randomly closing existing ones. Hi Sergey, there is a new must-gather report with fluentd log level in trace mode. Following up with a chunk reporting the failure mentioned, I can see the following log records:
$ omg logs fluentd-2qqtz -c fluentd | grep 5b87fdf012402e688865ec290b496424
2021-01-19T19:32:57.671581806Z 2021-01-19 19:32:57 +0000 [debug]: [retry_default] taking back chunk for errors. chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:32:57.671668146Z 2021-01-19 19:32:57 +0000 [trace]: [retry_default] taking back a chunk instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424"
2021-01-19T19:32:57.671701331Z 2021-01-19 19:32:57 +0000 [trace]: [retry_default] chunk taken back instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="retry_default", variables=nil>
2021-01-19T19:32:57.671816412Z 2021-01-19 19:32:57 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=37 next_retry_seconds=2021-01-19 19:32:58 +0000 chunk="5b87fdf012402e688865ec290b496424" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
2021-01-19T19:32:58.688589250Z 2021-01-19 19:32:58 +0000 [trace]: [retry_default] trying flush for a chunk chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:32:58.688628957Z 2021-01-19 19:32:58 +0000 [trace]: [retry_default] executing sync write chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:26.717765013Z 2021-01-19 19:33:26 +0000 [debug]: [retry_default] taking back chunk for errors. chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:26.717778218Z 2021-01-19 19:33:26 +0000 [trace]: [retry_default] taking back a chunk instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:26.717822522Z 2021-01-19 19:33:26 +0000 [trace]: [retry_default] chunk taken back instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="retry_default", variables=nil>
2021-01-19T19:33:26.717959471Z 2021-01-19 19:33:26 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=38 next_retry_seconds=2021-01-19 19:33:27 +0000 chunk="5b87fdf012402e688865ec290b496424" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
2021-01-19T19:33:27.728119921Z 2021-01-19 19:33:27 +0000 [trace]: [retry_default] trying flush for a chunk chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:33:27.728274969Z 2021-01-19 19:33:27 +0000 [trace]: [retry_default] executing sync write chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:01.721093919Z 2021-01-19 19:34:01 +0000 [debug]: [retry_default] taking back chunk for errors. chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:01.721126606Z 2021-01-19 19:34:01 +0000 [trace]: [retry_default] taking back a chunk instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:01.721181296Z 2021-01-19 19:34:01 +0000 [trace]: [retry_default] chunk taken back instance=69822409041840 chunk_id="5b87fdf012402e688865ec290b496424" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag="retry_default", variables=nil>
2021-01-19T19:34:01.721329778Z 2021-01-19 19:34:01 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=42 next_retry_seconds=2021-01-19 19:34:02 +0000 chunk="5b87fdf012402e688865ec290b496424" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Failure when receiving data from the peer"
2021-01-19T19:34:02.799551658Z 2021-01-19 19:34:02 +0000 [trace]: [retry_default] trying flush for a chunk chunk="5b87fdf012402e688865ec290b496424"
2021-01-19T19:34:02.799587287Z 2021-01-19 19:34:02 +0000 [trace]: [retry_default] executing sync write chunk="5b87fdf012402e688865ec290b496424"
Hi Nicolas, The root cause has been identified, fix is coming up. Regards, Sergey. Tested with elasticsearch-operator.5.0.0-43, the fix is in it, the issue is gone. *** Bug 1929139 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652 |
Description of problem: Fluentd fails to deliver logs with the following message: 2020-12-17 10:55:20 +0000 [warn]: [default] failed to flush the buffer. retry_time=0 next_retry_seconds=2020-12-17 10:55:21 +0000 chunk="5b6a6c201a4409bc30579ed2a74b3f6f" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Server returned nothing (no headers, no data)" Some times it re-tries and succeeds: 2020-12-17 10:55:23 +0000 [warn]: [default] retry succeeded. chunk_id="5b6a6c201a4409bc30579ed2a74b3f6f" But most of the times it will fail again and again: $ cat fluentd-qsqd6/fluentd/fluentd/logs/current.log | grep 5b62db722e5837bf6c2dc9dce9f22126 [....] 2020-12-15T06:20:29.915373137Z 2020-12-15 06:20:29 +0000 [warn]: [default] failed to flush the buffer. retry_time=130 next_retry_seconds=2020-12-15 06:25:40 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] " 2020-12-15T06:25:52.223400327Z 2020-12-15 06:25:52 +0000 [warn]: [default] failed to flush the buffer. retry_time=131 next_retry_seconds=2020-12-15 06:30:24 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] " 2020-12-15T06:30:42.120947177Z 2020-12-15 06:30:42 +0000 [warn]: [default] failed to flush the buffer. retry_time=132 next_retry_seconds=2020-12-15 06:35:42 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] " 2020-12-15T06:40:25.422546235Z 2020-12-15 06:40:25 +0000 [warn]: [default] failed to flush the buffer. retry_time=136 next_retry_seconds=2020-12-15 06:46:02 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] " 2020-12-15T06:46:19.343552411Z 2020-12-15 06:46:19 +0000 [warn]: [default] failed to flush the buffer. retry_time=137 next_retry_seconds=2020-12-15 06:50:46 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [502] " 2020-12-15T06:52:39.195238539Z 2020-12-15 06:52:39 +0000 [warn]: [default] failed to flush the buffer. retry_time=140 next_retry_seconds=2020-12-15 06:57:31 +0000 chunk="5b62db722e5837bf6c2dc9dce9f22126" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): Server returned nothing (no headers, no data)" [....] cat fluentd-qsqd6/fluentd/fluentd/logs/current.log | grep 5b62db722e5837bf6c2dc9dce9f22126 | wc -l 70 Version-Release number of selected component (if applicable): 4.6 How reproducible: 100% Steps to Reproduce: 1. Deploy fresh 4.6 2. Check fluentd logs Expected: Fluentd should flush logs to ES without issues. Actual: Fluentd is stopping to send logs to ES showing above error message. Additional info: Cluster has been set up with a custom-ca