1455840 – glusterd; glusterd failed to start on one of the node

Bug 1455840 - glusterd; glusterd failed to start on one of the node

Summary: glusterd; glusterd failed to start on one of the node

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Samikshan Bairagya
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Depends On:	1408431
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-26 09:32 UTC by Anil Shah
Modified:	2019-03-04 06:18 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-01-30 02:09:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Anil Shah 2017-05-26 09:32:21 UTC

Description of problem:

while executing uss enable/disable  in loop on one node and restarting glusterd on other node, glusterd failed to start 

Version-Release number of selected component (if applicable):

glusterfs-3.8.4-25.el7rhgs.x86_64

[root@dhcp46-132 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 Beta (Maipo

How reproducible:

1/1

Steps to Reproduce:
1. Create 2*2 distribute replicate node
2. Mount on client and start IO
3. Enable USS
4. Snapshot two  snapshot and activate it 
5. enable and disable uss on one node in loop and on other node restart glusterd 


Actual results:

glusterd failed on start on node where restart glusterd command was executing

Seeing AVC denied message too

type=AVC msg=audit(1495788895.100:403): avc:  denied  { unlink } for  pid=27165 comm="glusterd" name="glusterd.socket" dev="tmpfs" ino=204648 scontext=system_u:system_r:glusterd_t:s0 tcontext=unconfined_u:object_r:var_run_t:s0 tclass=sock_file
type=AVC msg=audit(1495788927.092:405): avc:  denied  { unlink } for  pid=27192 comm="glusterd" name="glusterd.socket" dev="tmpfs" ino=204648 scontext=system_u:system_r:glusterd_t:s0 tcontext=unconfined_u:object_r:var_run_t:s0 tclass=sock_file





Expected results:

glusterd shouldn't fail

Additional info:

=======================

glusterd logs


[2017-05-26 09:16:18.525243] E [rpc-transport.c:283:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.8.4/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2017-05-26 09:16:18.525289] W [rpc-transport.c:287:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2017-05-26 09:16:18.525305] W [rpcsvc.c:1646:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2017-05-26 09:16:18.525330] E [MSGID: 106243] [glusterd.c:1722:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2017-05-26 09:16:22.515271] I [MSGID: 106513] [glusterd-store.c:2101:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 31100
[2017-05-26 09:16:22.515867] E [MSGID: 101032] [store.c:433:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/vol0/info. [No such file or directory]
[2017-05-26 09:16:22.515896] E [MSGID: 106200] [glusterd-store.c:2653:glusterd_store_update_volinfo] 0-management: volinfo handle is NULL
[2017-05-26 09:16:22.515908] E [MSGID: 106207] [glusterd-store.c:2953:glusterd_store_retrieve_volume] 0-management: Failed to update volinfo for vol0 volume
[2017-05-26 09:16:22.515969] E [MSGID: 106201] [glusterd-store.c:3169:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: vol0
[2017-05-26 09:16:22.516007] E [MSGID: 101019] [xlator.c:486:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2017-05-26 09:16:22.516024] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-management: initializing translator failed
[2017-05-26 09:16:22.516037] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed
[2017-05-26 09:16:22.516829] W [glusterfsd.c:1291:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55c7fc85f9dd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x1b1) [0x55c7fc85f881] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55c7fc85ed6b] ) 0-: received signum (1), shutting down

Comment 6 Sam McLeod 2017-09-14 04:37:42 UTC

We have the same problem with GlusterFS 3.12 running under CentOS 7.4:

root@int-gluster-01:~  # cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

root@int-gluster-01:~  # rpm -qa | grep gluster
glusterfs-client-xlators-3.12.0-1.el7.x86_64
glusterfs-cli-3.12.0-1.el7.x86_64
glusterfs-3.12.0-1.el7.x86_64
glusterfs-fuse-3.12.0-1.el7.x86_64
glusterfs-api-3.12.0-1.el7.x86_64
glusterfs-server-3.12.0-1.el7.x86_64
gluster-nagios-addons-1.1.0-0.el7.centos.x86_64
gluster-nagios-common-1.1.0-0.el7.centos.noarch
glusterfs-libs-3.12.0-1.el7.x86_6


root@int-gluster-03:~ # systemctl start glusterd
Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details.

root@int-gluster-03:~ # tail /var/log/glusterfs/glusterd.log
[2017-09-14 04:30:33.223274] E [MSGID: 101176] [graph.c:698:glusterfs_graph_activate] 0-graph: init failed
[2017-09-14 04:30:33.224212] W [glusterfsd.c:1347:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x561c65678dbd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x561c65678c63] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x561c6567818b] ) 0-: received signum (-1), shutting down
[2017-09-14 04:30:35.363754] E [MSGID: 101032] [store.c:437:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/askizzy_docker_dev_static/info. [No such file or directory]
[2017-09-14 04:30:35.363862] E [MSGID: 106200] [glusterd-store.c:2837:glusterd_store_update_volinfo] 0-management: volinfo handle is NULL
[2017-09-14 04:30:35.363890] E [MSGID: 106207] [glusterd-store.c:3140:glusterd_store_retrieve_volume] 0-management: Failed to update volinfo for askizzy_docker_dev_static volume
[2017-09-14 04:30:35.363925] E [MSGID: 106201] [glusterd-store.c:3357:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: askizzy_docker_dev_static
[2017-09-14 04:30:35.364026] E [MSGID: 101019] [xlator.c:503:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2017-09-14 04:30:35.364055] E [MSGID: 101066] [graph.c:327:glusterfs_graph_init] 0-management: initializing translator failed
[2017-09-14 04:30:35.364078] E [MSGID: 101176] [graph.c:698:glusterfs_graph_activate] 0-graph: init failed
[2017-09-14 04:30:35.365404] W [glusterfsd.c:1347:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x5564a8909dbd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x5564a8909c63] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x5564a890918b] ) 0-: received signum (-1), shutting down



I'm then noticing a lot of SELinux denials:


----
time->Thu Sep 14 14:30:19 2017
type=PROCTITLE msg=audit(1505363419.360:13201): proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
type=SYSCALL msg=audit(1505363419.360:13201): arch=c000003e syscall=49 success=no exit=-13 a0=f a1=7ffc8241dfa0 a2=10 a3=22 items=0 ppid=24983 pid=24984 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd" subj=system_u:system_r:glusterd_t:s0 key=(null)
type=AVC msg=audit(1505363419.360:13201): avc:  denied  { name_bind } for  pid=24984 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0 tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0
----
time->Thu Sep 14 14:30:32 2017
type=PROCTITLE msg=audit(1505363432.477:13204): proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
type=SYSCALL msg=audit(1505363432.477:13204): arch=c000003e syscall=49 success=no exit=-13 a0=f a1=7ffddddb3720 a2=10 a3=22 items=0 ppid=25019 pid=25020 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd" subj=system_u:system_r:glusterd_t:s0 key=(null)
type=AVC msg=audit(1505363432.477:13204): avc:  denied  { name_bind } for  pid=25020 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0 tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0
----
time->Thu Sep 14 14:30:34 2017
type=PROCTITLE msg=audit(1505363434.531:13206): proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
type=SYSCALL msg=audit(1505363434.531:13206): arch=c000003e syscall=49 success=no exit=-13 a0=f a1=7ffc18923a80 a2=10 a3=22 items=0 ppid=25034 pid=25035 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd" subj=system_u:system_r:glusterd_t:s0 key=(null)
type=AVC msg=audit(1505363434.531:13206): avc:  denied  { name_bind } for  pid=25035 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0 tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0
----

Comment 7 Atin Mukherjee 2017-09-15 09:52:07 UTC

(In reply to Sam McLeod from comment #6)
> We have the same problem with GlusterFS 3.12 running under CentOS 7.4:
> 
> root@int-gluster-01:~  # cat /etc/redhat-release
> CentOS Linux release 7.4.1708 (Core)
> 
> root@int-gluster-01:~  # rpm -qa | grep gluster
> glusterfs-client-xlators-3.12.0-1.el7.x86_64
> glusterfs-cli-3.12.0-1.el7.x86_64
> glusterfs-3.12.0-1.el7.x86_64
> glusterfs-fuse-3.12.0-1.el7.x86_64
> glusterfs-api-3.12.0-1.el7.x86_64
> glusterfs-server-3.12.0-1.el7.x86_64
> gluster-nagios-addons-1.1.0-0.el7.centos.x86_64
> gluster-nagios-common-1.1.0-0.el7.centos.noarch
> glusterfs-libs-3.12.0-1.el7.x86_6
> 
> 
> root@int-gluster-03:~ # systemctl start glusterd
> Job for glusterd.service failed because the control process exited with
> error code. See "systemctl status glusterd.service" and "journalctl -xe" for
> details.
> 
> root@int-gluster-03:~ # tail /var/log/glusterfs/glusterd.log
> [2017-09-14 04:30:33.223274] E [MSGID: 101176]
> [graph.c:698:glusterfs_graph_activate] 0-graph: init failed
> [2017-09-14 04:30:33.224212] W [glusterfsd.c:1347:cleanup_and_exit]
> (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x561c65678dbd]
> -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x561c65678c63]
> -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x561c6567818b] ) 0-: received
> signum (-1), shutting down
> [2017-09-14 04:30:35.363754] E [MSGID: 101032]
> [store.c:437:gf_store_handle_retrieve] 0-: Path corresponding to
> /var/lib/glusterd/vols/askizzy_docker_dev_static/info. [No such file or
> directory]

The above log is the culprit. Apparently the info file of the volume is missing here. Have you by mistakenly deleted this fine? Technically, this file should never being nulled down by glusterd. As a workaround, you can copy the same file from other peer and restart the glusterd instance.

> [2017-09-14 04:30:35.363862] E [MSGID: 106200]
> [glusterd-store.c:2837:glusterd_store_update_volinfo] 0-management: volinfo
> handle is NULL
> [2017-09-14 04:30:35.363890] E [MSGID: 106207]
> [glusterd-store.c:3140:glusterd_store_retrieve_volume] 0-management: Failed
> to update volinfo for askizzy_docker_dev_static volume
> [2017-09-14 04:30:35.363925] E [MSGID: 106201]
> [glusterd-store.c:3357:glusterd_store_retrieve_volumes] 0-management: Unable
> to restore volume: askizzy_docker_dev_static
> [2017-09-14 04:30:35.364026] E [MSGID: 101019] [xlator.c:503:xlator_init]
> 0-management: Initialization of volume 'management' failed, review your
> volfile again
> [2017-09-14 04:30:35.364055] E [MSGID: 101066]
> [graph.c:327:glusterfs_graph_init] 0-management: initializing translator
> failed
> [2017-09-14 04:30:35.364078] E [MSGID: 101176]
> [graph.c:698:glusterfs_graph_activate] 0-graph: init failed
> [2017-09-14 04:30:35.365404] W [glusterfsd.c:1347:cleanup_and_exit]
> (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x5564a8909dbd]
> -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x5564a8909c63]
> -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x5564a890918b] ) 0-: received
> signum (-1), shutting down
> 
> 
> 
> I'm then noticing a lot of SELinux denials:
> 
> 
> ----
> time->Thu Sep 14 14:30:19 2017
> type=PROCTITLE msg=audit(1505363419.360:13201):
> proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7
> 573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
> type=SYSCALL msg=audit(1505363419.360:13201): arch=c000003e syscall=49
> success=no exit=-13 a0=f a1=7ffc8241dfa0 a2=10 a3=22 items=0 ppid=24983
> pid=24984 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
> fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd"
> subj=system_u:system_r:glusterd_t:s0 key=(null)
> type=AVC msg=audit(1505363419.360:13201): avc:  denied  { name_bind } for 
> pid=24984 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0
> tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0

This is known. The issue seems like because of the following entry in /proc/sys/net/ipv4/ip_local_port_range

32768   60999

Here the upper cap of the local port is 60999, however gluster maintains it local port range for its portmap logic which is up to 65535 and now when on a glusterd restart portmap table is rebuilt through pmap_registry_new (), bind () fails for 61000 port.

The good news is in glusterfs-3.12 we have introduced a way to tackle this. If you open glusterd.vol file and see a configuration called max-port, you just need to uncomment it and set it o 60999 in your case and restart glusterd.

> ----
> time->Thu Sep 14 14:30:32 2017
> type=PROCTITLE msg=audit(1505363432.477:13204):
> proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7
> 573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
> type=SYSCALL msg=audit(1505363432.477:13204): arch=c000003e syscall=49
> success=no exit=-13 a0=f a1=7ffddddb3720 a2=10 a3=22 items=0 ppid=25019
> pid=25020 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
> fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd"
> subj=system_u:system_r:glusterd_t:s0 key=(null)
> type=AVC msg=audit(1505363432.477:13204): avc:  denied  { name_bind } for 
> pid=25020 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0
> tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0
> ----
> time->Thu Sep 14 14:30:34 2017
> type=PROCTITLE msg=audit(1505363434.531:13206):
> proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7
> 573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
> type=SYSCALL msg=audit(1505363434.531:13206): arch=c000003e syscall=49
> success=no exit=-13 a0=f a1=7ffc18923a80 a2=10 a3=22 items=0 ppid=25034
> pid=25035 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
> fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd"
> subj=system_u:system_r:glusterd_t:s0 key=(null)
> type=AVC msg=audit(1505363434.531:13206): avc:  denied  { name_bind } for 
> pid=25035 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0
> tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0
> ----

Comment 8 Sam McLeod 2017-09-18 22:45:19 UTC

(In reply to Atin Mukherjee from comment #7)

Thanks for your response Atin,

> The above log is the culprit. Apparently the info file of the volume is
> missing here. Have you by mistakenly deleted this fine? Technically, this
> file should never being nulled down by glusterd. As a workaround, you can
> copy the same file from other peer and restart the glusterd instance.

What we've experienced is that if Gluster is interrupted while applying volume settings either by leaking memory resulting in OOM, by a user cancelling the operation or by a node being restarted when a background process may have been applying settings (e.g. automation etc...) then Gluster appears to leave info files in a partial / corrupted state which in turn seems to cause this error.

Once the info file is incorrect / corrupted - what is the easiest way to regenerate it? Ideally I'd like to make that an automated process as there are a number of ways Gluster could be interrupted when managing the service at scale.

At present we only have around 120 volumes on the cluster but another site will soon be commissioned with 300~ volumes and thus the risk of corruption will likely increase if Gluster can leave these info files partially complete.

> 
> > [2017-09-14 04:30:35.363862] E [MSGID: 106200]
> > [glusterd-store.c:2837:glusterd_store_update_volinfo] 0-management: volinfo
> > handle is NULL
> > [2017-09-14 04:30:35.363890] E [MSGID: 106207]
> > [glusterd-store.c:3140:glusterd_store_retrieve_volume] 0-management: Failed
> > to update volinfo for askizzy_docker_dev_static volume
> > [2017-09-14 04:30:35.363925] E [MSGID: 106201]
> > [glusterd-store.c:3357:glusterd_store_retrieve_volumes] 0-management: Unable
> > to restore volume: askizzy_docker_dev_static
> > [2017-09-14 04:30:35.364026] E [MSGID: 101019] [xlator.c:503:xlator_init]
> > 0-management: Initialization of volume 'management' failed, review your
> > volfile again
> > [2017-09-14 04:30:35.364055] E [MSGID: 101066]
> > [graph.c:327:glusterfs_graph_init] 0-management: initializing translator
> > failed
> > [2017-09-14 04:30:35.364078] E [MSGID: 101176]
> > [graph.c:698:glusterfs_graph_activate] 0-graph: init failed
> > [2017-09-14 04:30:35.365404] W [glusterfsd.c:1347:cleanup_and_exit]
> > (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x5564a8909dbd]
> > -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x5564a8909c63]
> > -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x5564a890918b] ) 0-: received
> > signum (-1), shutting down
> > 
> > 
> > 
> > I'm then noticing a lot of SELinux denials:
> > 
> > 
> > ----
> > time->Thu Sep 14 14:30:19 2017
> > type=PROCTITLE msg=audit(1505363419.360:13201):
> > proctitle=2F7573722F7362696E2F676C757374657264002D70002F7661722F72756E2F676C7
> > 573746572642E706964002D2D6C6F672D6C6576656C005741524E494E47
> > type=SYSCALL msg=audit(1505363419.360:13201): arch=c000003e syscall=49
> > success=no exit=-13 a0=f a1=7ffc8241dfa0 a2=10 a3=22 items=0 ppid=24983
> > pid=24984 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
> > fsgid=0 tty=(none) ses=4294967295 comm="glusterd" exe="/usr/sbin/glusterfsd"
> > subj=system_u:system_r:glusterd_t:s0 key=(null)
> > type=AVC msg=audit(1505363419.360:13201): avc:  denied  { name_bind } for 
> > pid=24984 comm="glusterd" src=61000 scontext=system_u:system_r:glusterd_t:s0
> > tcontext=system_u:object_r:ephemeral_port_t:s0 tclass=tcp_socket permissive=0
> 
> This is known. The issue seems like because of the following entry in
> /proc/sys/net/ipv4/ip_local_port_range
> 
> 32768   60999
> 
> Here the upper cap of the local port is 60999, however gluster maintains it
> local port range for its portmap logic which is up to 65535 and now when on
> a glusterd restart portmap table is rebuilt through pmap_registry_new (),
> bind () fails for 61000 port.

I suspected it was that, so I watched the port usage while Gluster was running and it never used more than a few ports - perhaps I wasn't monitoring them properly.


> The good news is in glusterfs-3.12 we have introduced a way to tackle this.
> If you open glusterd.vol file and see a configuration called max-port, you
> just need to uncomment it and set it o 60999 in your case and restart
> glusterd.

That's really useful, thanks for pointing that out!

Note You need to log in before you can comment on or make changes to this bug.