1517987 – [GSS] high mem/cpu usage, brick processes not starting and ssl encryption issues while testing CRS scaling with multiplexing (500-800 vols)

Bug 1517987 - [GSS] high mem/cpu usage, brick processes not starting and ssl encryption issues while testing CRS scaling with multiplexing (500-800 vols)

Summary: [GSS] high mem/cpu usage, brick processes not starting and ssl encryption iss...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Mohit Agrawal
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:	1520245
Blocks:	1503138
TreeView+	depends on / blocked

Reported:	2017-11-27 20:27 UTC by Pan Ousley
Modified:	2021-06-10 13:42 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.12.2-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1520245 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:39:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:41:33 UTC

Description Pan Ousley 2017-11-27 20:27:00 UTC

Description of problem:

A customer is testing CRS scaling with brick multiplexing, with and without SSL encryption. They are encountering unexpected behavior, high cpu/mem usage and are unable to provision volumes with SSL encryption enabled.

These are the most recent tests, after disabling monitoring to avoid locking issues:

* rhgs-az[b,c,d]-1 - 3 node cluster with SSL encryption enabled - 500 volumes
* rhgs-az[b,c,d]-2 - 3 node cluster without encryption - 500 volumes
* rhs-c01-n0[1-6] - 6 node cluster without encryption - 800 volumes

Results:

- glusterd process eats a huge amount of memory (all available) during volume provisioning. No matter if SSL enabled or disabled. Memory consumption is lower (50%) after bouncing - restarting all gluster processes. 

- provisioning succeeds only on non-encrypted clusters 

- volumes are not started properly ~30 bricks processes are not started after provisioning. Needs gluster volume start force, or restarting glusterd on all nodes

- SSL encrypted cluster fails to provision volumes after some time, ends in non-consistent state.

- SSL encrypted cluster consumes more resources (RAM,CPU), but that's expected, due to encryption layer

- SSL errors like these:

/var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.859095] E [socket.c:2510:socket_poller] 0-socket.management: server setup failed
/var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.859779] E [socket.c:358:ssl_setup_connection] 0-socket.management: SSL connect error (client: 10.3.107.49:1013) (server: 10.3.104.154:24007)
/var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.859984] E [socket.c:2510:socket_poller] 0-socket.management: server setup failed
/var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.954674] E [socket.c:358:ssl_setup_connection] 0-socket.management: SSL connect error (client: 10.3.104.154:974) (server: 10.3.104.154:24007)
/var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.955464] E [socket.c:2510:socket_poller] 0-socket.management: server setup failed
/var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.955829] E [socket.c:358:ssl_setup_connection] 0-socket.management: SSL connect error (client: 10.3.104.154:996) (server: 10.3.104.154:24007)


Are there any recommended settings for cluster.max-bricks-per-process? The customer tried to use a single brick process for all volumes but encountered high mem/cpu usage. It looks like 15-25 is working fine for non-encrypted clusters.


Version-Release number of selected component (if applicable): RHGS 3.3, CRS 3.6


Please let me know what information would be useful for investigating these issues. The customer has uploaded many log files to the case which can be accessed in collab-shell.

Comment 35 Bala Konda Reddy M 2018-05-23 10:06:24 UTC

From comment 29:

Build: 3.12.2-9

On a three node cluster brick-mux enabled, with and without SSL enabled.
Created 500+ volumes of type replicate(2X3). Performed node reboot scenarios and volume set operation when a node is down. After the node reboot, all bricks are online.
Haven't seen much hike in glusterd memory from 11MB, it increased to around 300MB which is acceptable, in all the nodes in the cluster.

Hence marking it as verified

Comment 37 errata-xmlrpc 2018-09-04 06:39:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.