Bug 1269536 - Glusterd cannot be started while having a large amount of volumes
Glusterd cannot be started while having a large amount of volumes
Status: CLOSED WORKSFORME
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
3.1
x86_64 Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Anand Nekkunti
storage-qa-internal@redhat.com
glusterd
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-07 10:08 EDT by Elad
Modified: 2016-01-03 23:50 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-08 08:35:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Elad 2015-10-07 10:08:34 EDT
Description of problem:
We have a Gluster setup with 3 servers in the cluster. There are around 115 volumes in the setup.
An attempt to start glusterd fails because it is taking too much time to initialize, systemctl fails before glusted finishes to initialize.

Version-Release number of selected component (if applicable):
glusterfs-fuse-3.7.1-11.el7rhgs.x86_64
glusterfs-libs-3.7.1-11.el7rhgs.x86_64
glusterfs-3.7.1-11.el7rhgs.x86_64
glusterfs-api-3.7.1-11.el7rhgs.x86_64
glusterfs-server-3.7.1-11.el7rhgs.x86_64
glusterfs-client-xlators-3.7.1-11.el7rhgs.x86_64
glusterfs-cli-3.7.1-11.el7rhgs.x86_64
Red Hat Enterprise Linux Server release 7.0 (Maipo)

How reproducible:
100%

Steps to Reproduce:
3 servers in Gluster setup and a large amount of volumes (~115). Try to start Glsuterd using systemctl

Actual results:
systemctl fails before glusterd is intializing

/var/log/glusterfs/etc-glusterfs-glusterd.vol.log:

[2015-10-06 15:23:08.530596] E [socket.c:3018:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
The message "I [MSGID: 106004] [glusterd-handler.c:5051:__glusterd_peer_rpc_notify] 0-management: Peer <gluster-storage-01.scl.lab.tlv.redhat.com> (<2d3991a3-2aa7-41d6-95be-e924a63533e4>), in state <Peer in Cluste
r>, has disconnected from glusterd." repeated 38 times between [2015-10-06 15:21:12.942223] and [2015-10-06 15:23:08.529766]
The message "I [MSGID: 106004] [glusterd-handler.c:5051:__glusterd_peer_rpc_notify] 0-management: Peer <10.35.160.203> (<69288d5b-866f-49ba-8508-5f7083ec6c5d>), in state <Peer in Cluster>, has disconnected from gl
usterd." repeated 38 times between [2015-10-06 15:21:12.943252] and [2015-10-06 15:23:08.531186]
[2015-10-06 15:23:11.539358] W [socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 171, Invalid argument
[2015-10-06 15:23:11.539416] E [socket.c:3018:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
[2015-10-06 15:23:11.542984] I [MSGID: 106004] [glusterd-handler.c:5051:__glusterd_peer_rpc_notify] 0-management: Peer <gluster-storage-01.scl.lab.tlv.redhat.com> (<2d3991a3-2aa7-41d6-95be-e924a63533e4>), in state
 <Peer in Cluster>, has disconnected from glusterd.
[2015-10-06 15:23:11.545285] W [socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 171, Invalid argument
[2015-10-06 15:23:11.545313] E [socket.c:3018:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
[2015-10-06 15:23:11.545611] I [MSGID: 106004] [glusterd-handler.c:5051:__glusterd_peer_rpc_notify] 0-management: Peer <10.35.160.203> (<69288d5b-866f-49ba-8508-5f7083ec6c5d>), in state <Peer in Cluster>, has disc
onnected from glusterd.
[2015-10-06 15:23:13.643520] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.7.1 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid)
[2015-10-06 15:23:13.692426] I [MSGID: 106478] [glusterd.c:1376:init] 0-management: Maximum allowed open file descriptors set to 65536
[2015-10-06 15:23:13.692550] I [MSGID: 106479] [glusterd.c:1425:init] 0-management: Using /var/lib/glusterd as working directory
[2015-10-06 15:23:13.705844] E [socket.c:823:__socket_server_bind] 0-socket.management: binding to  failed: Address already in use
[2015-10-06 15:23:13.705884] E [socket.c:826:__socket_server_bind] 0-socket.management: Port is already in use
[2015-10-06 15:23:13.705912] W [rpcsvc.c:1602:rpcsvc_transport_create] 0-rpc-service: listening on transport failed
[2015-10-06 15:23:13.705935] E [MSGID: 106243] [glusterd.c:1642:init] 0-management: creation of listener failed
[2015-10-06 15:23:13.705955] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2015-10-06 15:23:13.705972] E [MSGID: 101066] [graph.c:326:glusterfs_graph_init] 0-management: initializing translator failed
[2015-10-06 15:23:13.705991] E [MSGID: 101176] [graph.c:672:glusterfs_graph_activate] 0-graph: init failed
[2015-10-06 15:23:13.706705] W [glusterfsd.c:1219:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x7fbb5dc9e17d] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x126) [0x7fbb5dc9e026] -->/us
r/sbin/glusterd(cleanup_and_exit+0x69) [0x7fbb5dc9d609] ) 0-: received signum (0), shutting down


Expected results:
systemctl should wait for glusterd to finish its initialization.

Additional info:
/var/log/ from all servers:
http://file.tlv.redhat.com/ebenahar/bug2.tar.gz
Comment 2 Anand Nekkunti 2015-10-07 13:35:14 EDT
This is happening because of systemctrl time out before glusterd finishes its initialize (glusterd taking more than 90 sec to start due to more number of volumes but systemctrl default timeout is 90 sec  ).

This can be fixed by disabling  systemctrl timeout.

workaround:
Add below line in /usr/lib/systemd/system/glusterd.service unit file  after ExecStart
TimeoutSec=0

for more info:
http://www.freedesktop.org/software/systemd/man/systemd.service.html
Comment 3 SATHEESARAN 2015-10-07 13:54:00 EDT
(In reply to Anand Nekkunti from comment #2)
> This is happening because of systemctrl time out before glusterd finishes
> its initialize (glusterd taking more than 90 sec to start due to more number
> of volumes but systemctrl default timeout is 90 sec  ).
> 
> This can be fixed by disabling  systemctrl timeout.
> 
> workaround:
> Add below line in /usr/lib/systemd/system/glusterd.service unit file  after
> ExecStart
> TimeoutSec=0
> 
> for more info:
> http://www.freedesktop.org/software/systemd/man/systemd.service.html

Thanks Anand for the information.
But again, why does glusterd takes long time with large number of volumes ?
Comment 4 Anand Nekkunti 2015-10-08 08:35:06 EDT
(In reply to SATHEESARAN from comment #3)
> (In reply to Anand Nekkunti from comment #2)
> > This is happening because of systemctrl time out before glusterd finishes
> > its initialize (glusterd taking more than 90 sec to start due to more number
> > of volumes but systemctrl default timeout is 90 sec  ).
> > 
> > This can be fixed by disabling  systemctrl timeout.
> > 
> > workaround:
> > Add below line in /usr/lib/systemd/system/glusterd.service unit file  after
> > ExecStart
> > TimeoutSec=0
> > 
> > for more info:
> > http://www.freedesktop.org/software/systemd/man/systemd.service.html
> 
> Thanks Anand for the information.
> But again, why does glusterd takes long time with large number of volumes ?



I got setup where issue occurred , I found that the node is very slow due to insufficient hardware. Node has 1 core(we recommend 4 cores at least and 8 GB RAM) and it hosting 155 volumes(each volumes has 2-4 bricks), due to this glusterd is taking more time to start.

I have tested in my system with 225 volumes, i dint see any issue and it took  ~9 sec to start glusterd.

Please re-open bug if this happen in recommended hardware.

Note You need to log in before you can comment on or make changes to this bug.