1242803 – Quota list on a volume hangs after glusterd restart an a node.

Bug 1242803 - Quota list on a volume hangs after glusterd restart an a node.

Summary: Quota list on a volume hangs after glusterd restart an a node.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	quota
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.1
Assignee:	Vijaikumar Mallikarjuna
QA Contact:	Triveni Rao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1216951 1242819 1251815 1257881
TreeView+	depends on / blocked

Reported:	2015-07-14 08:28 UTC by Triveni Rao
Modified:	2016-09-17 12:36 UTC (History)
CC List:	17 users (show)
Fixed In Version:	glusterfs-3.7.1-14
Doc Type:	Bug Fix
Doc Text:	Previously, executing 'gluster volume quota list' command use to hang if quotad was not running. With this fix, "Connection failed. Please check if quota daemon is operational." error message is displayed.
Clone Of:
Clones:	1242819 (view as bug list)
Environment:
Last Closed:	2015-10-05 07:20:07 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1845	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.1 update	2015-10-05 11:06:22 UTC

Description Triveni Rao 2015-07-14 08:28:36 UTC

Description of problem:

Quota list on a volume hangs after glusterd restart an a node.

Version-Release number of selected component (if applicable):

[root@rhsqa14-vm1 ~]# rpm -qa | grep gluster
gluster-nagios-addons-0.2.3-1.el6rhs.x86_64
glusterfs-3.7.1-8.el6rhs.x86_64
glusterfs-cli-3.7.1-8.el6rhs.x86_64
gluster-nagios-common-0.2.0-1.el6rhs.noarch
glusterfs-libs-3.7.1-8.el6rhs.x86_64
glusterfs-client-xlators-3.7.1-8.el6rhs.x86_64
glusterfs-api-3.7.1-8.el6rhs.x86_64
glusterfs-server-3.7.1-8.el6rhs.x86_64
glusterfs-rdma-3.7.1-8.el6rhs.x86_64
vdsm-gluster-4.16.20-1.1.el6rhs.noarch
glusterfs-fuse-3.7.1-8.el6rhs.x86_64
glusterfs-geo-replication-3.7.1-8.el6rhs.x86_64
[root@rhsqa14-vm1 ~]# 


How reproducible:
easily

Steps to Reproduce:
1.create dsitrep volume,
2. enable quota on it. enable quota limit
3.check the enabled quota limit.
4. restart glusterd on the node.
5. check the list immediately.
it hangs indefinitely  and use ctrl+c to stop the command and check the quota limit, it displays properly.


Additional info:


root@rhsqa14-vm1 ~]# gluster v create good 10.70.47.165:/rhs/brick1/g0 10.70.47.163:/rhs/brick1/g0
volume create: good: success: please start the volume to access data
[root@rhsqa14-vm1 ~]# gluster v start good
volume start: good: success
[root@rhsqa14-vm1 ~]# gluster v info good
 
Volume Name: good
Type: Distribute
Volume ID: 03bf7a59-aa79-4ff2-80ec-57e33ff37d88
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.70.47.165:/rhs/brick1/g0
Brick2: 10.70.47.163:/rhs/brick1/g0
Options Reconfigured:
performance.readdir-ahead: on
[root@rhsqa14-vm1 ~]# gluster v status good
Status of volume: good
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.165:/rhs/brick1/g0           49156     0          Y       10746
Brick 10.70.47.163:/rhs/brick1/g0           49156     0          Y       8773 
NFS Server on localhost                     2049      0          Y       10765
NFS Server on 10.70.47.163                  2049      0          Y       8792 
 
Task Status of Volume good
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@rhsqa14-vm1 ~]# ./options.sh good
volume set: success
volume quota : success
volume set: success
volume quota : success
volume set: success
[root@rhsqa14-vm1 ~]# service glusterd status
glusterd (pid  10465) is running...
[root@rhsqa14-vm1 ~]# gluster v quota good list
                  Path                   Hard-limit Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
---------------------------------------------------------------------------------------------------------------------------
/                                         20.0GB       80%      0Bytes  20.0GB              No                   No
[root@rhsqa14-vm1 ~]# service glusterd restart
Starting glusterd:                                         [  OK  ]
[root@rhsqa14-vm1 ~]# gluster v quota good list
                  Path                   Hard-limit Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
---------------------------------------------------------------------------------------------------------------------------




^C
[root@rhsqa14-vm1 ~]# 
[root@rhsqa14-vm1 ~]# gluster v quota good list
                  Path                   Hard-limit Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
---------------------------------------------------------------------------------------------------------------------------
/                                         20.0GB       80%      0Bytes  20.0GB              No                   No
[root@rhsqa14-vm1 ~]# service glusterd restart
Starting glusterd:                                         [  OK  ]
[root@rhsqa14-vm1 ~]# gluster v quota good list
                  Path                   Hard-limit Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
---------------------------------------------------------------------------------------------------------------------------



^C
[root@rhsqa14-vm1 ~]# gluster v quota good list
quota command failed : Locking failed on 10.70.47.163. Please check log file for details.
[root@rhsqa14-vm1 ~]# gluster v quota good list
quota command failed : Locking failed on 10.70.47.163. Please check log file for details.
[root@rhsqa14-vm1 ~]# 


[root@rhsqa14-vm2 ~]# gluster v quota good list
                  Path                   Hard-limit Soft-limit   Used  Available  Soft-limit exceeded? Hard-limit exceeded?
---------------------------------------------------------------------------------------------------------------------------
/                                         20.0GB       80%      0Bytes  20.0GB              No                   No
[root@rhsqa14-vm2 ~]# service glusterd restart
Starting glusterd:                                         [  OK  ]
[root@rhsqa14-vm2 ~]# 
[root@rhsqa14-vm2 ~]# gluster v quota good list
^C
[root@rhsqa14-vm2 ~]# gluster v quota good list
quota command failed : Another transaction is in progress for good. Please try again after sometime.
[root@rhsqa14-vm2 ~]# 


Log messages.

[2015-07-13 18:56:46.187969] W [glusterd-locks.c:677:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f0b2896a52
0] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x551)[0x7f0b1d40b441] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/
glusterd.so(__glusterd_peer_rpc_notify+0x2f0)[0x7f0b1d371290] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)
[0x7f0b1d354a00] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a3)[0x7f0b28736953] ))))) 0-management: Lock for vol good not held
[2015-07-13 18:56:46.188046] W [MSGID: 106118] [glusterd-handler.c:5059:__glusterd_peer_rpc_notify] 0-management: Lock not released for good
[2015-07-13 18:56:46.188901] W [glusterd-locks.c:677:glusterd_mgmt_v3_unlock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f0b2896a520] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_unlock+0x551)[0x7f0b1d40b441] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x2f0)[0x7f0b1d371290] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7f0b1d354a00] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a3)[0x7f0b28736953] ))))) 0-management: Lock for vol test not held
[2015-07-13 18:56:46.188966] W [MSGID: 106118] [glusterd-handler.c:5059:__glusterd_peer_rpc_notify] 0-management: Lock not released for test
[2015-07-13 18:56:51.877542] I [MSGID: 106163] [glusterd-handshake.c:1198:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703
[2015-07-13 18:56:51.988913] I [MSGID: 106490] [glusterd-handler.c:2530:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 053f0a9c-84df-4e69-9825-e480ea7f51fa
[2015-07-13 18:56:52.032972] I [MSGID: 106490] [glusterd-handler.c:2530:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 053f0a9c-84df-4e69-9825-e480ea7f51fa
[2015-07-13 18:56:56.571192] I [MSGID: 106493] [glusterd-handler.c:3757:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.163 (0), ret: 0
[2015-07-13 18:56:56.822277] W [glusterd-locks.c:575:glusterd_mgmt_v3_lock] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7f0b2896a520] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_mgmt_v3_lock+0x1d4)[0x7f0b1d40bf84] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(+0x5b9b5)[0x7f0b1d3779b5] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_op_sm+0x231)[0x7f0b1d37b8e1] (--> /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(+0x1115b9)[0x7f0b1d42d5b9] ))))) 0-management: Lock for good held by 053f0a9c-84df-4e69-9825-e480ea7f51fa
[2015-07-13 18:56:56.693505] I [MSGID: 106493] [glusterd-handler.c:3757:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.47.163 (0), ret: 0
[2015-07-13 18:56:56.822334] E [MSGID: 106119] [glusterd-op-sm.c:3556:glusterd_op_ac_lock] 0-management: Unable to acquire lock for good
[2015-07-13 18:56:56.822557] E [MSGID: 106376] [glusterd-op-sm.c:7289:glusterd_op_sm] 0-management: handler returned: -1
[2015-07-13 18:56:58.786170] E [glusterd-op-sm.c:253:glusterd_get_txn_opinfo] (--> 0-management: Unable to get transaction opinfo for transaction ID : c79c04fd-e35a-4bd2-8670-295679f2196e
[2015-07-13 18:56:58.786219] E [MSGID: 106402] [glusterd-op-sm.c:7270:glusterd_op_sm] 0-management: Unable to get transaction's opinfo
[2015-07-13 18:56:59.004321] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 053f0a9c-84df-4e69-9825-e480ea7f51fa, host: 10.70.47.163, port: 0
[2015-07-13 18:56:59.114350] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid::


[2015-07-13 18:56:59.114350] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 053f0a9c-84df-4e69-9825-e480ea7f51fa
[2015-07-13 18:56:59.114454] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2015-07-13 18:56:59.114485] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2015-07-13 18:56:59.115160] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 053f0a9c-84df-4e69-9825-e480ea7f51fa
[2015-07-13 18:56:59.115238] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2015-07-13 18:56:51.919906] I [MSGID: 106163] [glusterd-handshake.c:1198:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30703
[2015-07-13 18:56:59.115289] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2015-07-13 18:58:31.155153] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.70.47.163. Please check log file for details.
[2015-07-13 18:58:31.155257] E [MSGID: 106151] [glusterd-syncop.c:1870:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2015-07-13 18:58:31.160289] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.47.163. Please check log file for details.
[2015-07-13 18:58:31.160534] E [MSGID: 106152] [glusterd-syncop.c:1574:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
[2015-07-13 18:59:00.579592] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Locking failed on 10.70.47.163. Please check log file for details.
[2015-07-13 18:59:00.579681] E [MSGID: 106151] [glusterd-syncop.c:1870:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2015-07-13 18:59:00.584048] E [MSGID: 106116] [glusterd-mgmt.c:134:gd_mgmt_v3_collate_errors] 0-management: Unlocking failed on 10.70.47.163. Please check log file for details.
[2015-07-13 18:59:00.584102] E [MSGID: 106152] [glusterd-syncop.c:1574:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
(END)

Comment 2 Gaurav Kumar Garg 2015-07-14 11:25:59 UTC

triveni, 

how many volume you have. 

can you paste the output of "#gluster volume status"   after restarting glusterd.

Comment 3 Anand Nekkunti 2015-07-14 12:27:40 UTC

[Switching to thread 3 (Thread 0x7ffff01ea700 (LWP 6210))]#0 0x0000003bc760e264 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003bc760e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003bc7609508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003bc76093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040b25d in cli_cmd_broadcast_response (status=-1) at cli-cmd.c:310
#4  0x0000000000424e1f in cli_quotad_getlimit_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7fffe000181c) at cli-rpc-ops.c:3359
#5  0x00007ffff7911ca6 in rpc_clnt_submit (rpc=0x6ac2d0, prog=<value optimized out>, procnum=<value optimized out>, cbkfn=0x424d90 <cli_quotad_getlimit_cbk>, proghdr=<value optimized out>, proghdrcount=<value optimized out>,
    progpayload=0x0, progpayloadcount=0, iobref=<value optimized out>, frame=0x7fffe000181c, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at rpc-clnt.c:1600
#6  0x000000000040a74b in cli_submit_request (rpc=0x6ac2d0, req=<value optimized out>, frame=0x7fffe000181c, prog=<value optimized out>, procnum=2, iobref=0x7fffe0002ba0, this=0x7ffff7ffc220,
    cbkfn=0x424d90 <cli_quotad_getlimit_cbk>, xdrproc=0x409388 <xdr_gf_cli_req@plt>) at cli.c:237
#7  0x000000000040b4fa in cli_cmd_submit (rpc=0x6ac2d0, req=0x7ffff01e8c00, frame=0x7fffe000181c, prog=0x659380, procnum=<value optimized out>, iobref=<value optimized out>, this=0x7ffff7ffc220,
    cbkfn=0x424d90 <cli_quotad_getlimit_cbk>, xdrproc=0x409388 <xdr_gf_cli_req@plt>) at cli-cmd.c:380
#8  0x0000000000415a2f in cli_quotad_getlimit (frame=0x7fffe000181c, this=0x7ffff7ffc220, data=<value optimized out>) at cli-rpc-ops.c:3389
#9  0x000000000040f7cf in cli_cmd_quota_handle_list_all (words=<value optimized out>, options=<value optimized out>) at cli-cmd-volume.c:1419
#10 0x0000000000410c41 in cli_cmd_quota_cbk (state=0x7fffffffe220, word=<value optimized out>, words=0x7fffffffe420, wordcount=4) at cli-cmd-volume.c:1571
#11 0x000000000040b70b in cli_cmd_process (state=0x7fffffffe220, argc=4, argv=0x7fffffffe420) at cli-cmd.c:135
#12 0x000000000040b170 in cli_batch (d=<value optimized out>) at input.c:29
#13 0x0000003bc76079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003bc72e8b6d in clone () from /lib64/libc.so.6
(gdb) p cond_mutex
$2 = {__data = {__lock = 2, __count = 0, __owner = 6210, __nusers = 1, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000B\030\000\000\001", '\000' <repeats 26 times>,
  __align = 2}


from above bt, callback also called by same thread and it trying get lock which is held by same thread, it leads to dead lock .

Comment 4 Anand Nekkunti 2015-07-14 12:33:15 UTC

cli list command hangs if quotad is not running, this is negative scenario , so this bug is not blocker for 3.1,

Comment 10 Gaurav Kumar Garg 2015-08-04 09:30:36 UTC

upstream patch already available for this bug. url is http://review.gluster.org/#/c/11658/

Comment 11 Gaurav Kumar Garg 2015-08-04 09:49:30 UTC

http://review.gluster.org/11658  patch solving this problem which is already merged. but this problem might hit again, Assigning to quota team member.

Comment 15 Vijaikumar Mallikarjuna 2015-08-28 11:47:23 UTC

Patch merged: https://code.engineering.redhat.com/gerrit/#/c/56573/

Comment 16 Anil Shah 2015-09-02 11:06:47 UTC

[root@darkknight glusterfs]# systemctl restart glusterd 
gluster v quota testvol list
[root@darkknight glusterfs]# gluster v quota testvol list
Connection failed. Please check if quota daemon is operational.

[root@darkknight glusterfs]# ps aux | grep quotad
root     19330  0.0  0.4 461684 33248 ?        Ssl  21:33   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/lib/glusterd/quotad/run/quotad.pid -l /var/log/glusterfs/quotad.log -S /var/run/gluster/f077dff85992befedc232c1fc52788c8.socket --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off
root     19453  0.0  0.0 112644   928 pts/0    S+   21:36   0:00 grep --color=auto quotad
[root@darkknight glusterfs]# kill 19330
[root@darkknight glusterfs]# gluster v quota testvol list
Connection failed. Please check if quota daemon is operational.


Bug verified on build glusterfs-3.7.1-14.el7rhgs.x86_64

Comment 18 errata-xmlrpc 2015-10-05 07:20:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html

Note You need to log in before you can comment on or make changes to this bug.