907540 – Gluster fails to start many volumes

Bug 907540 - Gluster fails to start many volumes

Summary: Gluster fails to start many volumes

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	unclassified
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-04 16:44 UTC by Ryan Lane
Modified:	2014-12-14 19:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-14 19:40:30 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ryan Lane 2013-02-04 16:44:12 UTC

Description of problem:

Many volumes don't start when gluster service is started. The volumes usually do start on most bricks, but usually at minimum one brick fails to start the volume. For instance:

Status of volume: keysGluster process                                         Port    Online  Pid
------------------------------------------------------------------------------Brick labstore1:/a/keys                                 24329   Y       27864Brick labstore2:/a/keys                                 24329   Y       4778
Brick labstore3:/a/keys                                 24329   N       N/ABrick labstore4:/a/keys                                 24329   Y       27413NFS Server on localhost                                 38467   Y       7919Self-heal Daemon on localhost                           N/A     N       7925
NFS Server on labstore4.pmtpa.wmnet                     38467   Y       28194Self-heal Daemon on labstore4.pmtpa.wmnet               N/A     N       28202NFS Server on labstore1.pmtpa.wmnet                     38467   Y       28590Self-heal Daemon on labstore1.pmtpa.wmnet               N/A     N       28596NFS Server on labstore2.pmtpa.wmnet                     38467   Y       4784Self-heal Daemon on labstore2.pmtpa.wmnet               N/A     N       4790

Version-Release number of selected component (if applicable):

3.3.1 running on ubuntu precise

How reproducible:

Reproducible by doing stop/start on volumes or by restarting the gluster processes.

Additional info:

I have roughly 350 volumes. 

Here's a brick log on a failing brick:

[2013-02-04 15:59:14.010995] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3
.3.1
[2013-02-04 15:59:14.013328] W [socket.c:410:__socket_keepalive] 0-socket: failed to set keep idle on socket 8
[2013-02-04 15:59:14.013426] W [socket.c:1876:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Oper
ation not supported
[2013-02-04 16:00:17.146181] E [socket.c:1715:socket_connect_finish] 0-glusterfs: connection to  failed (Connection timed out)
[2013-02-04 16:00:17.146232] E [glusterfsd-mgmt.c:1787:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host:
 Transport endpoint is not connected
[2013-02-04 16:00:17.146255] I [glusterfsd-mgmt.c:1790:mgmt_rpc_notify] 0-glusterfsd-mgmt: -1 connect attempts left
[2013-02-04 16:00:17.146422] W [glusterfsd.c:831:cleanup_and_exit] (-->/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x28) [0x7f
b8c98d18b8] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xc0) [0x7fb8c98d6090] (-->/usr/sbin/glusterfsd(+0xd3b6) [0x7fb8c9f863b
6]))) 0-: received signum (1), shutting down
[2013-02-04 16:00:17.146477] W [rpc-clnt.c:1496:rpc_clnt_submit] 0-glusterfs: failed to submit rpc-request (XID: 0x1x Program:
 Gluster Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
[2013-02-04 16:00:17.147392] E [rpcsvc.c:1155:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with port
map

Comment 1 JMW 2013-02-14 21:42:12 UTC

Additional info: 

13:33 < johnmark> ok
13:33 < Ryan_Lane> also, gluster volume start/stop/create take forever, eat tons of memory and 
                   cpu, and cause glusterd to become completely unresponsive for 20-30 seconds
13:33 < johnmark> Ryan_Lane: did you report that bug?
13:34 < johnmark> that's... interesting
13:34 < Ryan_Lane> I've had 3 outages in the past two weeks


I'm guessing that glusterd's single-threadedness is probably hurting here, given the sheer number of volumes, resulting in slow responsiveness from glusterd.

Comment 2 JMW 2013-02-14 21:43:02 UTC

Will the new multi-threaded glusterd help for this type of use case?

Comment 3 Niels de Vos 2014-11-27 14:54:07 UTC

The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.

Note You need to log in before you can comment on or make changes to this bug.