Bug 1559404

Summary:	Change the fluentd request timeout to be essentially infinite
Product:	OpenShift Container Platform	Reporter:	Peter Portante <pportant>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, jcantril, nhosoi, rmeggins, tkatarki, willemveerman
Target Milestone:	---
Target Release:	3.9.z
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	logging-fluentd-docker-v3.9.22-2	Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:
Clones:	1569548 1569550 (view as bug list)		Environment:
Last Closed:	2018-05-17 06:43:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1569548, 1569550

Description Peter Portante 2018-03-22 13:34:43 UTC

Currently we default the indexing request timeout for fluentd to 600 seconds.  That is about 10 minutes.  In certain situations, this timeout is not long enough.  If we make this essentially infinite, we'll avoid fluentd pods re-submitting requests unnecessarily.

There should be no concerns about the infinite timeout.  If Elasticsearch is not responding, but has the connection open, then resubmitting the request will not help.

This can adversely affect large clusters with lots of indices, where certain maintenance operations can take longer than 10 minutes.

Comment 1 ewolinet 2018-03-23 16:50:25 UTC

https://github.com/openshift/origin-aggregated-logging/pull/1035

Comment 4 Anping Li 2018-04-17 04:52:19 UTC

The fix is not in logging-fluentd/images/v3.9.22-1

Comment 10 errata-xmlrpc 2018-05-17 06:43:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1566

Comment 11 Willem Veerman 2020-09-30 07:49:29 UTC

Sorry to dig up such an old issue.

Can you explain why exactly you need to set the timeout to such a large figure? This will mean that fluentd's requests essentially never time out.

The bug report suggests that sometimes the timeout needs to be > 10 minutes, but it doesn't explain why.

I know that when it's congested, ES can take a long time to respond, but I would never expect it to take longer than 10 minutes to respond to a request.

We have seen some issues around the fluentd ES plugin's retry behaviour, and we're wondering whether we should also apply this settting so its requests never time out.

Comment 12 Peter Portante 2020-09-30 11:45:21 UTC

If an Elasticsearch instance has 10 clients sending it data, and for whatever reason, it takes longer for Elasticsearch to respond to a PUT or POST operation, when the client times out it's request, Elasticsearch now has to take one of two actions: 1. detect the dropped connection, find all the outstanding operations in flight related to that request (could be broken up across multiple clusters), and try to cancel them; 2. allow the in-flight operation to proceed to completion (I believe this is what all version of Elasticsearch do today).

Depending on the reason for the slow operation, cleaning up in-flight requests is only going to add more work to what Elasticsearch has to do, and allowing in-flight operations to proceed means the existing request work will have to be completed before more capacity is available for a new request.

This is normal an expected behavior, and the Elasticsearch instance is behaving within reasonable expectations of how to handle resource constraints.

The problem comes with what the *client* decides to do after timing out the connection.

If the client retries the request, now Elasticsearch has to process N+1 operations.

If the client backs off for a period of time, and then retries the request one is not guaranteed that available capacity to process that new request is available, so the client might as well just keep waiting for the first request. We are back to the Elasticsearch instance dealing with N+1 operations.

Now compound this with all 10 clients submitting requests, and then retrying at the same time.

When Elasticsearch takes a long time to respond to a request, that means it is providing a natural "back-pressure" on the client operations. Retrying, or adding more work, is not going to make anything go faster. This typically means some other bottleneck in how the Elasticsearch instance is architected (too few shards, too few nodes, disks too slow; basically not enough I/O paths to handle the parallelism expected from the clients).

The only timeout that an Elasticsearch client should use is when making its connection to the Elasticsearch instance, since no operation will be added to the work load that Elasticsearch is currently handling. So if it takes to long to connect, retrying the connection is very reasonable.

Does that help?

Comment 13 Willem Veerman 2020-09-30 16:18:46 UTC

That does help... essentially a client should not re-send a request that elasticsearch is in the course of processing.

It's annoying that ES won't send an initial acknowledgement, just to notify the client that the request has been received and is being processed. Seemingly it only responds once all the records in the request have been indexed or rejected.

Without that acknowledgement, the client doesn't know whether the request was received or not.

However, a timeout of 10 minutes should still be adequate, shouldn't it? I can't imagine a circumstance in which ES would take over 10 minutes to respond; Logstash has a default timeout of 60s.

I can see in your config that you use the retry_tag feature of the fluentd ES plugin.

Does that mean by setting the request timeout to 2^31 you're intending to effectively remove the request_timeout parameter so fluentd never retries, except when you receive a response for which it is valid to attempt a retry in which case the retry_tag mechanism kicks in?

Also, you mention connections - so you want bulk API HTTP requests to never be retried, but if fluentd detects no tcp connection to ES, it should pause all requests and constantly try to reach the ES tcp endpoint (which is presumably the default behaviour)?

Comment 14 Peter Portante 2020-09-30 17:39:41 UTC

(In reply to Willem Veerman from comment #13)
> That does help... essentially a client should not re-send a request that
> elasticsearch is in the course of processing.

Exactly.

> It's annoying that ES won't send an initial acknowledgement, just to notify
> the client that the request has been received and is being processed.
> Seemingly it only responds once all the records in the request have been
> indexed or rejected.

Well, unless you move to a different protocol all-together, the response for
bulk indexing APIs can't be constructed until all individual actions have a
completion status.

> Without that acknowledgement, the client doesn't know whether the request
> was received or not.

If the client is connected to the Elasticsearch server, then the client has
to assume the server is working on the PUT/POST request.  If the connection
gets dropped, then that is a different story, and the client has no other
recourse but to retry the request.

But if the connection is not dropped, waiting is the only thing a client can
reasonably do.

> However, a timeout of 10 minutes should still be adequate, shouldn't it?

Unfortunately, especially with Kubernetes and SDNs, sometimes delays can be
extremely long, and then all nodes in the cluster pig-pile on Elasticsearch
with retries.

> I can't imagine a circumstance in which ES would take over 10 minutes to
> respond; Logstash has a default timeout of 60s.

Most Elasticsearch clients are *Absolutely Horrible* in the face of problems.

They work great when the system is well balanced, but when errors are introduced
to the system, they fall down quickly and reek havoc.

This include Logstash, fluent-bit, rsyslog, etc.  All of them should be
configured to NOT retry on timeouts, only when timeouts happen making a
TCP connection to the Elasticsearch server.

> I can see in your config that you use the retry_tag feature of the fluentd
> ES plugin.
> 
> Does that mean by setting the request timeout to 2^31 you're intending to
> effectively remove the request_timeout parameter so fluentd never retries,
> except when you receive a response for which it is valid to attempt a retry
> in which case the retry_tag mechanism kicks in?

Yes.

> Also, you mention connections - so you want bulk API HTTP requests to never
> be retried, but if fluentd detects no tcp connection to ES, it should pause
> all requests and constantly try to reach the ES tcp endpoint (which is
> presumably the default behaviour)?

Yes, exactly.

Comment 15 Willem Veerman 2020-10-01 08:04:59 UTC

OK, thank you for the clarifications.

I agree that the logging agents cause hell when ES gets congested -  when our ES cluster started to respond slowly, we saw fluentd get into an endless retry loop which very quickly took down the entire ES cluster.

Our thinking was that it would be optimal to place a highly-available queue like redis between the logging agent daemonset and ES, and then have logstash pull from the queue and push to ES. We assume that logstash will be better at backing off from ES when it detects congestion from ES - i.e. delayed request responses.

I just want to make one final clarification.

If the tcp connection remains open, requests should never be retried. But if the agent detects that the tcp connection is closed (e.g. connection refused because nothing listening on port or intermediate connection problem between agent and ES server so connection cannot reach ES and just hangs) then, presumably, all requests which have not yet received a response should be retried, because the agent is unable to know how many of said requests did not reach ES.

My question is, does the fluentd ES plugin actually do this? If you remove the request_timeout parameter by setting it to colossal number, does fluentd still re-send pending requests when it detects the tcp connection is lost?

Comment 16 Peter Portante 2020-10-01 10:00:20 UTC

(In reply to Willem Veerman from comment #15)

> Our thinking was that it would be optimal to place a highly-available queue
> like redis between the logging agent daemonset and ES, and then have
> logstash pull from the queue and push to ES. We assume that logstash will be
> better at backing off from ES when it detects congestion from ES - i.e.
> delayed request responses.

Yes, this is the classic pattern that is used, some people use redis, some
kafka, some use Logstash directly.

This pattern gives you the ability to control the number of clients that talk
to Elasticsearch, which helps you balance the system easier.  With Kafka, it
scale to many many clients because it does not have to do all the indexing
heavy lifting that Elasticsearch has to do.  And Kafka provides persistent
queues, so you still have data saved should the system fail.

But, the thing that indexing into Elasticsearch from Kafka still has to play
nice with Elasticsearch.  I have been working with another team using this
pattern, and they still see this issue every now and then when their ES
instance has a problem, because even with fewer clients, they still can reek
havoc with short timeouts.

We just published a Python3 module, pyesbulk:

    https://github.com/distributed-system-analysis/py-es-bulk

It tries to implement a "play nice" client, no bulk request timeouts,
retries on connection failures, encourages the use of "create" method bulk
actions instead of "index" actions (use "create" gives you 409s when you
do end up finding that a record was already indexed, where "index" either
adds a new one (not good), or updates an existing (even worse)).

> 
> I just want to make one final clarification.
> 
> If the tcp connection remains open, requests should never be retried. But if
> the agent detects that the tcp connection is closed (e.g. connection refused
> because nothing listening on port or intermediate connection problem between
> agent and ES server so connection cannot reach ES and just hangs) then,
> presumably, all requests which have not yet received a response should be
> retried, because the agent is unable to know how many of said requests did
> not reach ES.

+1

See "create" versus "index" method above.

> 
> My question is, does the fluentd ES plugin actually do this? If you remove
> the request_timeout parameter by setting it to colossal number, does fluentd
> still re-send pending requests when it detects the tcp connection is lost?

I don't think so, I think you have to construct the fluentd configuration
properly to do this.

Comment 17 Peter Portante 2020-10-01 13:07:16 UTC

(In reply to Peter Portante from comment #16)
> (In reply to Willem Veerman from comment #15)
>
> > My question is, does the fluentd ES plugin actually do this? If you remove
> > the request_timeout parameter by setting it to colossal number, does fluentd
> > still re-send pending requests when it detects the tcp connection is lost?
> 
> I don't think so, I think you have to construct the fluentd configuration
> properly to do this.

From Jeff Cantrill: "this is the current version [of the fluentd pipeline that
OpenShift logging uses], but its the same concept":

https://github.com/openshift/cluster-logging-operator/blob/master/pkg/generators/forwarding/fluentd/fluent_conf_test.go#L1583-L1678

Comment 18 Willem Veerman 2020-10-01 15:41:31 UTC

That python module looks really good. I thought it might be a good idea to create a new fluentd ES plugin which follows this nicer retry behaviour and which has many of the unnecessary features stripped out - exactly like what you've done in that module.

I thought a 'create' operation would be better than index because we are using the plugin's elasticsearch_genid plugin to set the primary key for each record based on its sha256 digest, so performing an upsert each time a duplicate is sent is just pointless (and costly).

However, I've realised that this elasticsearch_genid feature is essentially just a hack to get over the fact that the plugin can unnecessarily send retries when it's misconfigured. Now that we're using the retry_tag feature there should be less need for it. Although I guess if you use both elasticsearch_genid and write_operation create then you essentially ensure that records appear only once but without excess computation cost on the ES cluster.

I've already been checking your fluentd config with interest - it's the reason that I posted my first comment.

I notice that you have set the ES output plugin's buffer to retry. I assumed that the retry configuration within the buffer section pertained to the application level - in other words each chunk in the buffer is sent in a separate HTTP request and it will keep retrying until it receives an HTTP response - for this reason I set our fluentd agents to only retry a limited number of times.

Am I wrong: do those settings within the buffer section actually pertain to the tcp connection? In other words, if fluentd can't reach the ES endpoint in tcp terms, it keeps on retrying each chunk forever, but if it detects the tcp connection is working, it sends the request once, and only retries via the retry_tag mechanism?

Comment 19 Willem Veerman 2020-10-01 16:49:49 UTC

I think possibly what might happen is that when ES is congested it can even fail to respond to tcp requests, so it seems like there's nothing listening on that port, and hence fluentd re-sends chunks, but some of the chunks sent in the first place do actually get received, and hence duplicates appear. Maybe.

Comment 20 Peter Portante 2020-10-01 22:08:39 UTC

(In reply to Willem Veerman from comment #18)
> I thought a 'create' operation would be better than index because we are
> using the plugin's elasticsearch_genid plugin to set the primary key for
> each record based on its sha256 digest, so performing an upsert each time a
> duplicate is sent is just pointless (and costly).

+1

> However, I've realised that this elasticsearch_genid feature is essentially
> just a hack to get over the fact that the plugin can unnecessarily send
> retries when it's misconfigured. Now that we're using the retry_tag feature
> there should be less need for it. Although I guess if you use both
> elasticsearch_genid and write_operation create then you essentially ensure
> that records appear only once but without excess computation cost on the ES
> cluster.

+1
 
> I've already been checking your fluentd config with interest - it's the
> reason that I posted my first comment.
> 
> I notice that you have set the ES output plugin's buffer to retry. I assumed
> that the retry configuration within the buffer section pertained to the
> application level - in other words each chunk in the buffer is sent in a
> separate HTTP request and it will keep retrying until it receives an HTTP
> response - for this reason I set our fluentd agents to only retry a limited
> number of times.
> 
> Am I wrong: do those settings within the buffer section actually pertain to
> the tcp connection? In other words, if fluentd can't reach the ES endpoint
> in tcp terms, it keeps on retrying each chunk forever, but if it detects the
> tcp connection is working, it sends the request once, and only retries via
> the retry_tag mechanism?

I think so, but you might want to file an issue asking that same question in
the git repo for the fluentd config.

Comment 21 Peter Portante 2020-10-01 22:09:29 UTC

(In reply to Willem Veerman from comment #19)
> I think possibly what might happen is that when ES is congested it can even
> fail to respond to tcp requests, so it seems like there's nothing listening
> on that port, and hence fluentd re-sends chunks, but some of the chunks sent
> in the first place do actually get received, and hence duplicates appear.
> Maybe.

+1, seems likely, and why using "create" is so important.

Comment 22 Willem Veerman 2020-10-02 07:49:09 UTC

I will ask that question in the fluentd ES plugin repo - or in the fluentd repo itself.

We did try 'create' but for some reason when we used it, it made our problems worse - it took down the cluster immediately - although at that time I think there were still a lot of other things wrong with our fluentd config. I'll try it again.

I see now that your pyesbulk module doesn't need to be written as a fluentd plugin; the module should be placed between the queue and ES - so fluentd pushes to the HA queue, and pyesbulk pulls from queue and pushes to ES without overloading it. That definitely seems like a useful application that would have a lot of uptake. Just needs some mechanism to be bolted onto and fed by another module which can pull from redis, kafka, etc.... but I guess you have that already.

Comment 23 Peter Portante 2020-10-02 11:49:19 UTC

(In reply to Willem Veerman from comment #22)
> I see now that your pyesbulk module doesn't need to be written as a fluentd
> plugin; the module should be placed between the queue and ES - so fluentd
> pushes to the HA queue, and pyesbulk pulls from queue and pushes to ES
> without overloading it.

Exactly.

> That definitely seems like a useful application that would have a lot of uptake.

Hopefully!

> Just needs some mechanism to be bolted onto and fed by another module which can
> pull from redis, kafka, etc.... but I guess you have that already.

We don't have a generic mechanism.  The interface is a simple generator pattern
which lends itself many uses.

Comment 24 Willem Veerman 2020-11-12 10:03:56 UTC

Thought you might be interested in this

I raised an issue with the fluentd ES plugin maintainer and he confirmed that the buffer retry parameters can over-ride the plugin's request_timeout parameter

https://github.com/uken/fluent-plugin-elasticsearch/issues/829

I _think_ that the buffer retries can be triggered by absence of HTTP response *and* by a failed TCP connection, i.e. target port closed.

I'm pretty sure that when ES gets severely strained it fails to respond on its port which, if my presumption is correct, would cause the buffer to immediately retry, making the problem much worse.