Bug 1563804

Summary:	Client can create denial of service (DOS) conditions on server
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	noah davids <ndavids>
Component:	glusterfs	Assignee:	Milind Changire <mchangir>
Status:	CLOSED ERRATA	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rhgs-3.3	CC:	amukherj, nchilaka, rgowdapp, rhinduja, rhs-bugs, sheggodu, vbellur
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1564600 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:46:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1564600
Bug Blocks:	1503137

Description noah davids 2018-04-04 18:25:23 UTC

Description of problem:
Clients with large numbers of bricks can create a denial of service condition for port 24007 when they when start up. They overflow the server's listen queue with connection requests so subsequent requests go unanswered.


Version-Release number of selected component (if applicable):


How reproducible:
Unclear but this has been seen in a number of cases, 02047946 and 02055527

Steps to Reproduce:
1.
2.
3.

Actual results:

Analysis of 1 trace from the customer showing a client sending up to 13 connections per second to 1 server.

$ tshark -r tcpdump-crp-prod-glusterfs03.srv.allianz.log -Y "tcp.dstport == 24007 && ip.src == 10.16.77.22 && not tcp.analysis.retransmission && tcp.flags.syn == 1" -T fields -e ip.src -e ip.dst 2>/dev/null | sort | uniq -c | awk '{print $1 " " $2 " -> " $3 " that is " $1/154.864538 " connection attempts per second"}' | column -t
963   10.16.77.22  ->  10.16.77.20  that  is  6.21834  connection  attempts  per  second
172   10.16.77.22  ->  10.16.77.21  that  is  1.11065  connection  attempts  per  second
233   10.16.77.22  ->  10.16.77.23  that  is  1.50454  connection  attempts  per  second
1907  10.16.77.22  ->  10.16.77.24  that  is  12.314   connection  attempts  per  second
2039  10.16.77.22  ->  10.16.77.25  that  is  13.1663  connection  attempts  per  second

Expected results:
Is it really necessary to send so many connection requests per second? 

Additional info:
I believe that this has been touch on in at least 2 other bugs 

    Bug 1552928 - [GSS] glusterd services have a lot of disconnects
and

    Bug 1535732 - [GSS] Bricks Show Online - Error says Offline

However, both bugs dealt with the effects of the this and not the root cause. Based on the discussion in these bugs, specifically comment 32 in 1535732 and comment 23 of 1552928 it would seem that this very large number of connections per second is a design feature. From comment #23 "...every individual bricks of the volumes would try to fetch the volfile from glusterd to complete the handshake by sending a fetchspec request on this port."

I submit that a feature that prevents correct operation (by overflowing the listen queue) is in fact either a bug in the software for not preventing the large number of connections per second or a bug in the documentation that does explicitly state the limitations of the design so that users know when they have created a system that is so large that this feature will create problems. At a minimum design documents should state that for certain environment sizes the listen queue needs to at least X.

Comment 2 Atin Mukherjee 2018-04-05 04:47:42 UTC

We have already initiated a discussion within engineering on having a higher value of transport.listen-backlog which will be set by default to adddress this issue.

Comment 3 noah davids 2018-04-05 16:53:32 UTC

I would like to see some documented guidance on what the listen backlog should be set to for various client configurations, just setting a larger value without any details on what client configuration that larger value will support just moves the problem, it doesn't really resolve it.

Comment 4 Atin Mukherjee 2018-04-05 17:18:19 UTC

I’d appreciate your patience till we work on a patch. The plan is to auto tune this configuration based on the number of volumes.

Milind can add further details here.

Comment 5 noah davids 2018-04-05 20:10:32 UTC

auto tuning would be good but keep in mind that if the auto tune value is larger than current value of the net.core.somaxconn value you will need to make changes to that value as well. And any auto-magic changes to that value needs to be documented so people understand why a system control value was suddenly changed, and they do not change it back. Also it may be that the value is controlled by system configuration tools like puppet so making a change to the local file will not persist even if no one manually changes it. It might be worth while to add a sanity check at start up so that if the transport.listen-backlog value is greater than the net.core.somaxconn value some kind of warning message is issued.

Comment 6 Raghavendra G 2018-04-06 03:22:06 UTC

(In reply to noah davids from comment #5)
> auto tuning would be good but keep in mind that if the auto tune value is
> larger than current value of the net.core.somaxconn value you will need to
> make changes to that value as well. And any auto-magic changes to that value
> needs to be documented so people understand why a system control value was
> suddenly changed, and they do not change it back. Also it may be that the
> value is controlled by system configuration tools like puppet so making a
> change to the local file will not persist even if no one manually changes
> it. It might be worth while to add a sanity check at start up so that if the
> transport.listen-backlog value is greater than the net.core.somaxconn value
> some kind of warning message is issued.

Are you aware of any known details of how auto-tuning can work? Is there a way to figure out current listen backlog is insufficient? What parameters (say from /proc or through ioctls) we should be checking for signs of insufficient backlog value?

Just wondering how would other software which deal with high traffic - webservers - solve this. If you can point to any informative discussion around this topic, it'll be very helpful for us.

Comment 7 noah davids 2018-04-06 14:22:23 UTC

The current listen backlog is displayed by the ss command in the receive-queue column for the listening socket. I have not checked how it gets that value. Were you planning on changing the backlog dynamically if the queue starts to build up? I am not sure you can do that. You would have to dynamically change the entry in the socket structure and I do not believe that there is currently a way to do that.

I am not aware of any environments that auto-tune their listen backlog queue. That doesn't mean that there are none -- just that I am unaware. I was assuming you could look at the configuration of the environment, i.e. number of clients number of bricks etc and calculate the number of connections assuming that some percentage of clients all tried to connect at once for some percentage of bricks.

Comment 8 Raghavendra G 2018-04-07 14:07:09 UTC

(In reply to noah davids from comment #7)
> The current listen backlog is displayed by the ss command in the
> receive-queue column for the listening socket. I have not checked how it
> gets that value. Were you planning on changing the backlog dynamically if
> the queue starts to build up? 

That was one option we were considering. Another question - Are there any draw back of having a large default value? We are thinking of having 1024 as a default value (along with some code changes which make accept() path in gluster faster [1]). So if a user/sysadmin starts seeing these errors, they should tune (preferably in the same order given):

* tune system wide configuration to tune to a higher value (may be 1024?)
* Increase the option server.event-threads to a higher value. This will increase the number of threads working on accept() path. Note that same set of threads process normal traffic _and_ accept in-coming connections.
* tune transport.listen-backlog to a value higher than 1024.

The idea as of now as far as tuning backlog values goes is to have default values high enough to suit for most of the scenarios. If not, it would be a responsibility of sysadmins to experiment and arrive at a value suitable for their environments.

What do you think about this approach?

[1] https://review.gluster.org/19833

Comment 9 noah davids 2018-04-09 15:01:49 UTC

The draw back to a large listen backlog is the resources that it may use when the backlog is full (or filling). Assuming no possibility of either a DOS attack or a misbehaving client application then the draw backs of a large listen queue are minimal to the server. However, a client with a connection accepted but still in the listen queue may send data and expect a response or expect an initial message from the service application. If these do not arrive within a time out the client may report and error and/or close the connection. The larger the backlog the greater the probability that this could happen. It would be a different and perhaps more confusing type of error then a cannot connect type of error generated when the listen queue is full.

I think what you are saying is that the default transport.listen-backlog will be 1024. This is still limited to the system default which is 128 but the administrator could tune it by changing only the net.core.somaxconn value and not have to worry about the transport.listen-backlog as well. Is that correct?

I realize that it is always the responsibility of the system administrators to tune the system for their environment but what I am saying is that if the design requires such a large number of simultaneous connection the system administrator should be given guidance as to at least a reasonable starting number for the backlog. Telling them to experiment isn't really very helpful since it may not be possible to duplicate the production environment for experimentation.

Comment 10 Milind Changire 2018-04-10 07:06:06 UTC

(In reply to noah davids from comment #9)
> The draw back to a large listen backlog is the resources that it may use
> when the backlog is full (or filling). Assuming no possibility of either a
> DOS attack or a misbehaving client application then the draw backs of a
> large listen queue are minimal to the server. However, a client with a
> connection accepted but still in the listen queue may send data and expect a
> response or expect an initial message from the service application. If these
> do not arrive within a time out the client may report and error and/or close
> the connection. The larger the backlog the greater the probability that this
> could happen. It would be a different and perhaps more confusing type of
> error then a cannot connect type of error generated when the listen queue is
> full.
> 
> I think what you are saying is that the default transport.listen-backlog
> will be 1024. This is still limited to the system default which is 128 but
> the administrator could tune it by changing only the net.core.somaxconn
> value and not have to worry about the transport.listen-backlog as well. Is
> that correct?

This is correct.

> 
> I realize that it is always the responsibility of the system administrators
> to tune the system for their environment but what I am saying is that if the
> design requires such a large number of simultaneous connection the system
> administrator should be given guidance as to at least a reasonable starting
> number for the backlog. Telling them to experiment isn't really very helpful
> since it may not be possible to duplicate the production environment for
> experimentation.

Assuming all volumes are replicated i.e. requiring self-heal daemons, the max value of the listen-backlog is to be twice the number of max bricks hosted by any single node across the cluster. This will take care of glusterd as well as glusterfsd backlog queues. System Administrators need to be aware that as number of volumes and bricks grow in the cluster, so does the requirement to monitor and tweak the listen-backlog value.

This will be documented in the User's Guide along with Raghavendra's suggestions mentioned at comment #8.

Comment 11 noah davids 2018-04-10 14:15:43 UTC

OK that sounds good.

Comment 12 Raghavendra G 2018-04-13 02:35:46 UTC

> However, a client with a connection accepted but still in the listen queue may send data and expect a response or expect an initial message from the service application. If these do not arrive within a time out the client may report and error and/or close the connection. The larger the backlog the greater the probability that this could happen. It would be a different and perhaps more confusing type of error then a cannot connect type of error generated when the listen queue is full.

At least at the level of Glusterfs protocol, timeout is 30 min. So, this is not an issue. But, lower layers like tcp might timeout and retry.

Comment 19 Bala Konda Reddy M 2018-05-14 12:18:11 UTC

Build: 3.12.2-9

Followed the steps mentioned in the comment 14
On a 8 node setup, created 200 volumes(100 EC Volumes(4+2) and 100 Replicate volumes)
Performed reboot on one of the nodes. Haven't seen peer connecting in peer status output. All peers are in connected state. All bricks are online.

Hence marking it as verified

Comment 21 errata-xmlrpc 2018-09-04 06:46:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607