Bug 986953

Summary: quota: glusterd crash
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Saurabh <saujain>
Component: glusterdAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED WORKSFORME QA Contact: Sudhir D <sdharane>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: kdhananj, mzywusko, rhs-bugs, vbellur, vshastry
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-06 10:53:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Saurabh 2013-07-22 13:44:10 UTC
Description of problem:
had a 6x2 volume
rhs nodes[1, 2, 3, 4]
set limit on volume and directories
while I/O going on had put down two nodes

after sometime used gluster volume start <vol> force to bring back bricks

started self heal, before this step the I/O was stopped.


[root@nfs1 ~]# gluster volume quota quota-dist-rep list
                  Path                   Hard-limit Soft-limit   Used  Available
--------------------------------------------------------------------------------
/                                           30GB       90%       7.2GB  22.8GB
/dir2                                        1GB       90%    1023.9MB  64.0KB
/dir3                                        1GB       90%    1022.9MB   1.1MB
/dir4                                        1GB       90%    1023.9MB  64.0KB
/dir5                                        1GB       90%    1022.9MB   1.1MB
/dir6                                        1GB       90%    1022.9MB   1.1MB
/dir7                                        1GB       90%       1.0GB  0Bytes
/dir8                                        1GB       90%     104.0MB 920.0MB
/dir9                                        1GB       90%      0Bytes   1.0GB
/dir10                                       1GB       90%      0Bytes   1.0GB
/dir1                                        2GB       90%    1023.9MB   1.0GB
/bar                                        10MB       90%         N/A     N/A
/foo                                        10MB       90%      95.4MB  0Bytes


Version-Release number of selected component (if applicable):
[root@nfs1 ~]# rpm -qa | grep glusterfs
glusterfs-3.4.0.12rhs.beta4-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.12rhs.beta4-1.el6rhs.x86_64

glusterfs-server-3.4.0.12rhs.beta4-1.el6rhs.x86_64


How reproducible:
happened this time

Actual results:

found core on both node2 and node3,


Status of volume: quota-dist-rep
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.180:/rhs/bricks/quota-d1r1               49172   Y       23303
Brick 10.70.37.139:/rhs/bricks/quota-d2r2               49172   Y       17100
Brick 10.70.37.180:/rhs/bricks/quota-d3r1               49173   Y       23314
Brick 10.70.37.139:/rhs/bricks/quota-d4r2               49173   Y       17111
Brick 10.70.37.180:/rhs/bricks/quota-d5r1               49174   Y       23325
Brick 10.70.37.139:/rhs/bricks/quota-d6r2               49174   Y       17122
NFS Server on localhost                                 2049    Y       25673
Self-heal Daemon on localhost                           N/A     Y       25680
NFS Server on 10.70.37.139                              2049    Y       18714
Self-heal Daemon on 10.70.37.139                        N/A     Y       18721
 
           Task                                      ID         Status
           ----                                      --         ------
      Rebalance    9e281276-6e32-43d6-8028-d06c80dc3b18              3


(gdb) bt
#0  0x000000396ba328a5 in raise () from /lib64/libc.so.6
#1  0x000000396ba34085 in abort () from /lib64/libc.so.6
#2  0x000000396ba707b7 in __libc_message () from /lib64/libc.so.6
#3  0x000000396ba760e6 in malloc_printerr () from /lib64/libc.so.6
#4  0x000000348f415715 in data_destroy (data=0x7f041c24f200) at dict.c:147
#5  0x000000348f416309 in _dict_set (this=<value optimized out>, key=0x7f041a21b8ff "features.limit-usage", value=0x7f041c25088c, replace=_gf_true) at dict.c:262
#6  0x000000348f41654a in dict_set (this=0x7f041c431144, key=0x7f041a21b8ff "features.limit-usage", value=0x7f041c25088c) at dict.c:334
#7  0x00007f041a1f9ff7 in glusterd_quota_limit_usage (volinfo=0x19ad930, dict=0x7f041c4327b0, op_errstr=0x1c4f538) at glusterd-quota.c:717
#8  0x00007f041a1faf78 in glusterd_op_quota (dict=0x7f041c4327b0, op_errstr=0x1c4f538, rsp_dict=0x7f041c432468) at glusterd-quota.c:1019
#9  0x00007f041a1c6046 in glusterd_op_commit_perform (op=GD_OP_QUOTA, dict=0x7f041c4327b0, op_errstr=0x1c4f538, rsp_dict=0x7f041c432468) at glusterd-op-sm.c:3899
#10 0x00007f041a1c7843 in glusterd_op_ac_commit_op (event=<value optimized out>, ctx=0x7f0410000c70) at glusterd-op-sm.c:3645
#11 0x00007f041a1c3281 in glusterd_op_sm () at glusterd-op-sm.c:5309
#12 0x00007f041a1b137d in __glusterd_handle_commit_op (req=0x7f041a12602c) at glusterd-handler.c:750
#13 0x00007f041a1ae53f in glusterd_big_locked_handler (req=0x7f041a12602c, actor_fn=0x7f041a1b1280 <__glusterd_handle_commit_op>) at glusterd-handler.c:75
#14 0x000000348f447292 in synctask_wrap (old_task=<value optimized out>) at syncop.c:131
#15 0x000000396ba43b70 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb) 


Expected results:
crash is not expected.

Additional info:
script used for creating data
#!/bin/bash
set -x

create-data()
{
for i in `seq 1 10`
do
while [ 1 ]
do
cmd=`dd if=/dev/urandom of=dir$i/$(date +%s) bs=1024 count=1024 2>&1`
echo $cmd
if [ "$(echo $cmd | awk '/Disk quota exceeded/')" ] 
then
   echo "quota limit reached"
   break
fi
done
done
return 1
}

create-data

Comment 4 Krutika Dhananjay 2013-07-23 05:11:18 UTC
Looking at the backtrace, it seems to me that the cause of this crash is the same as the cause of the crash in https://bugzilla.redhat.com/show_bug.cgi?id=983544.

CAUSE:

This happens because in the earlier code (in function glusterd_quota_limit_usage()), the pointer @quota_limits pointed to a location that is pointed to by the 'value' for key='features.limit-usage' in volinfo->dict. At some point in time, we do a GF_FREE on quota_limits. This implies that the 'value' in volinfo->dict gets freed as well, making 'value' a dangling pointer. Now some time later, we do a dict_set_str on key='features.limit-usage' in this same function, which tries to GF_FREE the object pointed to by 'value' before making it point to the new value. This causes the process to crash. In the end, this bug is a case of process crash due to double free.




The fix for 983544 is available in glusterfs-3.4.0.12rhs.beta5. Could you please check if this bug is valid in the latest version, i.e., glusterfs-3.4.0.12rhs.beta5?

Comment 5 Krutika Dhananjay 2013-09-02 06:59:14 UTC
As per the root cause analysis in comment #4, the bug was fixed as part of the build glusterfs-3.4.0.12rhs.beta5. This is very much true with respect to the new design as well. Hence moving the state of the bug to ON_QA.