Description of problem: Many volumes don't start when gluster service is started. The volumes usually do start on most bricks, but usually at minimum one brick fails to start the volume. For instance: Status of volume: keysGluster process Port Online Pid ------------------------------------------------------------------------------Brick labstore1:/a/keys 24329 Y 27864Brick labstore2:/a/keys 24329 Y 4778 Brick labstore3:/a/keys 24329 N N/ABrick labstore4:/a/keys 24329 Y 27413NFS Server on localhost 38467 Y 7919Self-heal Daemon on localhost N/A N 7925 NFS Server on labstore4.pmtpa.wmnet 38467 Y 28194Self-heal Daemon on labstore4.pmtpa.wmnet N/A N 28202NFS Server on labstore1.pmtpa.wmnet 38467 Y 28590Self-heal Daemon on labstore1.pmtpa.wmnet N/A N 28596NFS Server on labstore2.pmtpa.wmnet 38467 Y 4784Self-heal Daemon on labstore2.pmtpa.wmnet N/A N 4790 Version-Release number of selected component (if applicable): 3.3.1 running on ubuntu precise How reproducible: Reproducible by doing stop/start on volumes or by restarting the gluster processes. Additional info: I have roughly 350 volumes. Here's a brick log on a failing brick: [2013-02-04 15:59:14.010995] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3 .3.1 [2013-02-04 15:59:14.013328] W [socket.c:410:__socket_keepalive] 0-socket: failed to set keep idle on socket 8 [2013-02-04 15:59:14.013426] W [socket.c:1876:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Oper ation not supported [2013-02-04 16:00:17.146181] E [socket.c:1715:socket_connect_finish] 0-glusterfs: connection to failed (Connection timed out) [2013-02-04 16:00:17.146232] E [glusterfsd-mgmt.c:1787:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: Transport endpoint is not connected [2013-02-04 16:00:17.146255] I [glusterfsd-mgmt.c:1790:mgmt_rpc_notify] 0-glusterfsd-mgmt: -1 connect attempts left [2013-02-04 16:00:17.146422] W [glusterfsd.c:831:cleanup_and_exit] (-->/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x28) [0x7f b8c98d18b8] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xc0) [0x7fb8c98d6090] (-->/usr/sbin/glusterfsd(+0xd3b6) [0x7fb8c9f863b 6]))) 0-: received signum (1), shutting down [2013-02-04 16:00:17.146477] W [rpc-clnt.c:1496:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x1x Program: Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs) [2013-02-04 16:00:17.147392] E [rpcsvc.c:1155:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with port map
Additional info: 13:33 < johnmark> ok 13:33 < Ryan_Lane> also, gluster volume start/stop/create take forever, eat tons of memory and cpu, and cause glusterd to become completely unresponsive for 20-30 seconds 13:33 < johnmark> Ryan_Lane: did you report that bug? 13:34 < johnmark> that's... interesting 13:34 < Ryan_Lane> I've had 3 outages in the past two weeks I'm guessing that glusterd's single-threadedness is probably hurting here, given the sheer number of volumes, resulting in slow responsiveness from glusterd.
Will the new multi-threaded glusterd help for this type of use case?
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.