Description of problem: A customer is testing CRS scaling with brick multiplexing, with and without SSL encryption. They are encountering unexpected behavior, high cpu/mem usage and are unable to provision volumes with SSL encryption enabled. These are the most recent tests, after disabling monitoring to avoid locking issues: * rhgs-az[b,c,d]-1 - 3 node cluster with SSL encryption enabled - 500 volumes * rhgs-az[b,c,d]-2 - 3 node cluster without encryption - 500 volumes * rhs-c01-n0[1-6] - 6 node cluster without encryption - 800 volumes Results: - glusterd process eats a huge amount of memory (all available) during volume provisioning. No matter if SSL enabled or disabled. Memory consumption is lower (50%) after bouncing - restarting all gluster processes. - provisioning succeeds only on non-encrypted clusters - volumes are not started properly ~30 bricks processes are not started after provisioning. Needs gluster volume start force, or restarting glusterd on all nodes - SSL encrypted cluster fails to provision volumes after some time, ends in non-consistent state. - SSL encrypted cluster consumes more resources (RAM,CPU), but that's expected, due to encryption layer - SSL errors like these: /var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.859095] E [socket.c:2510:socket_poller] 0-socket.management: server setup failed /var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.859779] E [socket.c:358:ssl_setup_connection] 0-socket.management: SSL connect error (client: 10.3.107.49:1013) (server: 10.3.104.154:24007) /var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.859984] E [socket.c:2510:socket_poller] 0-socket.management: server setup failed /var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.954674] E [socket.c:358:ssl_setup_connection] 0-socket.management: SSL connect error (client: 10.3.104.154:974) (server: 10.3.104.154:24007) /var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.955464] E [socket.c:2510:socket_poller] 0-socket.management: server setup failed /var/log/glusterfs/glusterd.log:[2017-11-17 17:49:11.955829] E [socket.c:358:ssl_setup_connection] 0-socket.management: SSL connect error (client: 10.3.104.154:996) (server: 10.3.104.154:24007) Are there any recommended settings for cluster.max-bricks-per-process? The customer tried to use a single brick process for all volumes but encountered high mem/cpu usage. It looks like 15-25 is working fine for non-encrypted clusters. Version-Release number of selected component (if applicable): RHGS 3.3, CRS 3.6 Please let me know what information would be useful for investigating these issues. The customer has uploaded many log files to the case which can be accessed in collab-shell.
From comment 29: Build: 3.12.2-9 On a three node cluster brick-mux enabled, with and without SSL enabled. Created 500+ volumes of type replicate(2X3). Performed node reboot scenarios and volume set operation when a node is down. After the node reboot, all bricks are online. Haven't seen much hike in glusterd memory from 11MB, it increased to around 300MB which is acceptable, in all the nodes in the cluster. Hence marking it as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607