Bug 1110730

Summary:	brick process crashed when rebalance and rename was in progress
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rachana Patel <racpatel>
Component:	distribute	Assignee:	Nithya Balachandran <nbalacha>
Status:	CLOSED ERRATA	QA Contact:	amainkar
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	achauras, amainkar, asrivast, nbalacha, nlevinki, nsathyan, rcyriac, sauchter, sdharane, shmohan, smohan, ssaha, ssamanta, vagarwal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.0.4
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.6.0.46-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1113960 1165897 (view as bug list)		Environment:
Last Closed:	2015-03-26 06:34:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1113960, 1182947

Description Rachana Patel 2014-06-18 10:49:24 UTC

Description of problem:
=======================
brick process crashed, rebalance was in progress and performed rename from mount point



Version-Release number of selected component (if applicable):
=============================================================
3.6.0.18-1.el6rhs.x86_64

How reproducible:
=================
Intermittent

Steps to Reproduce:
====================
1. create and mount(NFS) distributed volume( 3 bricks)
2. add 2 more bricks.
2. start creating files and Dires on mount point 
(cd /mnt/nfs; for i in {1..100} ; do mv new$i $i ; cd $i; mv etcn$i etc$i; for j in {1..100}; do mv  newf$j file$j; done ; done)
3.execute remove-brick start for one brick
4. stop remov-brick operation
5. start rebalance from one node
6. while rebalance is in progress start rename operation from mount point and keep renaming files and Directories which were created earlier.
7. rename failed with 'Input/output error' for few entires.
Verified in backed. brick process was crashed and rebalance was failed

[root@OVM3 ~]# gluster volume status 
Status of volume: test1
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.35.198:/brick0				N/A	N	19201
Brick 10.70.35.172:/brick0				49152	Y	21103
Brick 10.70.35.240:/brick0				49152	Y	26751
Brick 10.70.35.240:/brick1				49153	Y	26839
Brick 10.70.35.198:/brick1				49153	Y	20252
NFS Server on localhost					2049	Y	20855
NFS Server on 10.70.35.240				2049	Y	27460
NFS Server on 10.70.35.172				2049	Y	21116
 
Task Status of Volume test1
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 28ee029b-36c9-46e0-a31f-86e6184b0bfa
Status               : failed              
 

[root@OVM3 ~]# gluster volume rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1249        24.4MB          5727            12             0               failed             195.00
                            10.70.35.172               73        31.7KB         10198            63             0               failed             196.00
                            10.70.35.240             1527         8.2MB          5118            10             2               failed             196.00


Actual results:
===============
brick process was crashed




Additional info:
=================

bt
(gdb) 
#0  0x000000365860c380 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fbe02f98af6 in dict_get (this=0x662f333677656e2f, key=<value optimized out>) at dict.c:374
#2  0x00007fbdf8924f1b in posix_lookup_xattr_fill (this=0x9846c0, 
    real_path=0x7fbdf2cf4460 "/brick0/.glusterfs/e2/f5/e2f52f09-eefe-45dc-b4ad-3d9de740317e/new9/new9/new9/new9/new9/new9/new9/new9/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new"..., loc=0x7fbdf2cf4610, 
    xattr_req=0x662f333677656e2f, buf=0x323677656e2f3136) at posix-helpers.c:633
#3  0x00007fbdf891221e in posix_entry_xattr_fill (this=0x9846c0, inode=<value optimized out>, fd=0x9d2c78, 
    name=0x7fbde4397158 "file51", dict=0x662f333677656e2f, stbuf=0x323677656e2f3136) at posix.c:4801
#4  0x00007fbdf89210bf in posix_readdirp_fill (this=0x9846c0, fd=0x9d2c78, entries=0x7fbdf2cf4a90, dict=0x7fbe017d71a8)
    at posix.c:4853
#5  0x00007fbdf89215b3 in posix_do_readdir (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=<value optimized out>, off=58, 
    whichop=40, dict=0x7fbe017d71a8) at posix.c:4935
#6  0x00007fbdf892232e in posix_readdirp (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=130944, off=0, dict=0x7fbe017d71a8)
    at posix.c:4985
#7  0x00007fbe02fa3f63 in default_readdirp (frame=0x7fbe01ddb3c8, this=0x985f90, fd=0x9d2c78, size=130944, off=<value optimized out>, 
    xdata=<value optimized out>) at defaults.c:2078
#8  0x00007fbdf84eb4bd in posix_acl_readdirp (frame=0x7fbe01dda250, this=0x988e30, fd=0x9d2c78, size=130944, offset=0, 
    dict=<value optimized out>) at posix-acl.c:1614
#9  0x00007fbdf82d3ef4 in pl_readdirp (frame=0x7fbe01ddabb8, this=0x98a030, fd=0x9d2c78, size=130944, offset=0, dict=0x7fbe017d71a8)
    at posix.c:2150
#10 0x00007fbe02fa6832 in default_readdirp_resume (frame=0x7fbe01dda04c, this=0x98b290, fd=0x9d2c78, size=130944, off=0, 
    xdata=0x7fbe017d71a8) at defaults.c:1645
#11 0x00007fbe02fc0631 in call_resume_wind (stub=0x7fbe0186204c) at call-stub.c:2492
#12 call_resume (stub=0x7fbe0186204c) at call-stub.c:2841
#13 0x00007fbdf80c8348 in iot_worker (data=0x9bcf80) at io-threads.c:214
#14 0x00000036586079d1 in start_thread () from /lib64/libpthread.so.0
#15 0x00000036582e8b7d in clone () from /lib64/libc.so.6


(gdb) bt full
#0  0x000000365860c380 in pthread_spin_lock () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fbe02f98af6 in dict_get (this=0x662f333677656e2f, key=<value optimized out>) at dict.c:374
        pair = <value optimized out>
        __FUNCTION__ = "dict_get"
#2  0x00007fbdf8924f1b in posix_lookup_xattr_fill (this=0x9846c0, 
    real_path=0x7fbdf2cf4460 "/brick0/.glusterfs/e2/f5/e2f52f09-eefe-45dc-b4ad-3d9de740317e/new9/new9/new9/new9/new9/new9/new9/new9/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new"..., loc=0x7fbdf2cf4610, 
    xattr_req=0x662f333677656e2f, buf=0x323677656e2f3136) at posix-helpers.c:633
        xattr = 0x0
        filler = {this = 0x0, real_path = 0x0, xattr = 0x0, stbuf = 0x0, loc = 0x0, inode = 0x0, fd = 0, flags = 0, op_errno = 0}
        list = _gf_false
#3  0x00007fbdf891221e in posix_entry_xattr_fill (this=0x9846c0, inode=<value optimized out>, fd=0x9d2c78, 
    name=0x7fbde4397158 "file51", dict=0x662f333677656e2f, stbuf=0x323677656e2f3136) at posix.c:4801
        tmp_loc = {path = 0x3135656c69 <Address 0x3135656c69 out of bounds>, name = 0x0, inode = 0x7fbdf0ca4cb8, parent = 0x0, 
          gfid = '\000' <repeats 15 times>, pargfid = '\000' <repeats 15 times>}
        entry_path = 0x7fbdf2cf4460 "/brick0/.glusterfs/e2/f5/e2f52f09-eefe-45dc-b4ad-3d9de740317e/new9/new9/new9/new9/new9/new9/new9/new9/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new"...
#4  0x00007fbdf89210bf in posix_readdirp_fill (this=0x9846c0, fd=0x9d2c78, entries=0x7fbdf2cf4a90, dict=0x7fbe017d71a8)
    at posix.c:4853
        entry = 0x7fbde43970b0
        itable = 0x9d0170
        inode = 0x7fbdf0ca4cb8
        hpath = 0x7fbdf2cf46a0 "/brick0/.glusterfs/bf/2e/bf2e666d-939f-4f05-b415-c695f99d8111/new8/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new25/new26/new27/new28/new29/new30/ne"...
        len = <value optimized out>
        stbuf = {ia_ino = 11215285721967525407, ia_gfid = "a\275ŃY\345C\246\233\244\262\361hb~\037", ia_dev = 64774, 
          ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', 
              write = 1 '\001', exec = 0 '\000'}, group = {read = 1 '\001', write = 0 '\000', exec = 0 '\000'}, other = {
              read = 1 '\001', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 1, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, 
          ia_blksize = 4096, ia_blocks = 0, ia_atime = 1402994133, ia_atime_nsec = 0, ia_mtime = 1402994133, ia_mtime_nsec = 0, 
          ia_ctime = 1402994133, ia_ctime_nsec = 837004085}
---Type <return> to continue, or q <return> to quit---
        gfid = '\000' <repeats 15 times>
        ret = <value optimized out>
#5  0x00007fbdf89215b3 in posix_do_readdir (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=<value optimized out>, off=58, 
    whichop=40, dict=0x7fbe017d71a8) at posix.c:4935
        pfd = 0x7fbdd41d8eb0
        dir = <value optimized out>
        ret = <value optimized out>
        count = 58
        op_ret = 58
        op_errno = 2
        entries = {{list = {next = 0x7fbde402f890, prev = 0x7fbde43f9460}, {next = 0x7fbde402f890, prev = 0x7fbde43f9460}}, 
          d_ino = 4294967295, d_off = 140453799218192, d_len = 2629385827, d_type = 2888593132, d_stat = {ia_ino = 233410450496, 
            ia_gfid = "\000\000\000\000\000\000\000\000\030\006\000\000\000\000\000", ia_dev = 140454070839116, ia_type = 9997872, 
            ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', 
                exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', 
                write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 39, ia_rdev = 140454070839116, ia_size = 42, 
            ia_blksize = 0, ia_blocks = 140454070421914, ia_atime = 0, ia_atime_nsec = 257004085, ia_mtime = 50329162, 
            ia_mtime_nsec = 32702, ia_ctime = 23019252, ia_ctime_nsec = 32702}, dict = 0x7fbe017d71a8, inode = 0x7fbe017d71d8, 
          d_name = 0x7fbdf2cf4a90 "\220\370\002\344\275\177"}
        skip_dirs = 0
        __FUNCTION__ = "posix_do_readdir"
#6  0x00007fbdf892232e in posix_readdirp (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=130944, off=0, dict=0x7fbe017d71a8)
    at posix.c:4985
        entries = {{list = {next = 0x41, prev = 0x7fbe02ff88bf}, {next = 0x41, prev = 0x7fbe02ff88bf}}, d_ino = 1, 
          d_off = 140454070635415, d_len = 1, d_type = 16843520, d_stat = {ia_ino = 140453554085168, 
            ia_gfid = "\fMl\001\276\177\000\000\376\177\371\002\276\177\000", ia_dev = 1, ia_type = 23874828, ia_prot = {
              suid = 0 '\000', sgid = 1 '\001', sticky = 1 '\001', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, 
              group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', 
                exec = 0 '\000'}}, ia_nlink = 24998312, ia_uid = 32702, ia_gid = 49915109, ia_rdev = 140453765923276, 
            ia_size = 140454070565365, ia_blksize = 4073671752, ia_blocks = 140453891477175, ia_atime = 3966643372, 
            ia_atime_nsec = 1665055132, ia_mtime = 24998312, ia_mtime_nsec = 32702, ia_ctime = 24998360, ia_ctime_nsec = 32702}, 
          dict = 0x7fbdf84f0eb7, inode = 0x7fbe015f3eec, d_name = 0x7fbdf2cf4ba0 "A"}
---Type <return> to continue, or q <return> to quit---
        op_ret = -1
        op_errno = 0
        entry = 0x0
        __FUNCTION__ = "posix_readdirp"
#7  0x00007fbe02fa3f63 in default_readdirp (frame=0x7fbe01ddb3c8, this=0x985f90, fd=0x9d2c78, size=130944, off=<value optimized out>, 
    xdata=<value optimized out>) at defaults.c:2078
        old_THIS = 0x985f90
#8  0x00007fbdf84eb4bd in posix_acl_readdirp (frame=0x7fbe01dda250, this=0x988e30, fd=0x9d2c78, size=130944, offset=0, 
    dict=<value optimized out>) at posix-acl.c:1614
        _new = 0x7fbe01ddb3c8
        old_THIS = 0x988e30
        tmp_cbk = 0x7fbdf84ee8e0 <posix_acl_readdirp_cbk>
        ret = <value optimized out>
        alloc_dict = <value optimized out>
        __FUNCTION__ = "posix_acl_readdirp"
#9  0x00007fbdf82d3ef4 in pl_readdirp (frame=0x7fbe01ddabb8, this=0x98a030, fd=0x9d2c78, size=130944, offset=0, dict=0x7fbe017d71a8)
    at posix.c:2150
        _new = 0x7fbe01dda250
        old_THIS = 0x98a030
        tmp_cbk = 0x7fbdf82d4cd0 <pl_readdirp_cbk>
        local = <value optimized out>
        __FUNCTION__ = "pl_readdirp"
#10 0x00007fbe02fa6832 in default_readdirp_resume (frame=0x7fbe01dda04c, this=0x98b290, fd=0x9d2c78, size=130944, off=0, 
    xdata=0x7fbe017d71a8) at defaults.c:1645
        _new = 0x7fbe01ddabb8
        old_THIS = 0x98b290
        tmp_cbk = 0x7fbe02faaeb0 <default_readdirp_cbk>
        __FUNCTION__ = "default_readdirp_resume"
#11 0x00007fbe02fc0631 in call_resume_wind (stub=0x7fbe0186204c) at call-stub.c:2492
No locals.
#12 call_resume (stub=0x7fbe0186204c) at call-stub.c:2841
        old_THIS = 0x98b290
---Type <return> to continue, or q <return> to quit---
        __FUNCTION__ = "call_resume"
#13 0x00007fbdf80c8348 in iot_worker (data=0x9bcf80) at io-threads.c:214
        conf = 0x9bcf80
        this = <value optimized out>
        stub = <value optimized out>
        sleep_till = {tv_sec = 1403078869, tv_nsec = 0}
        ret = <value optimized out>
        pri = 3
        timeout = 0 '\000'
        bye = 0 '\000'
        sleep = {tv_sec = 0, tv_nsec = 0}
        __FUNCTION__ = "iot_worker"
#14 0x00000036586079d1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#15 0x00000036582e8b7d in clone () from /lib64/libc.so.6
No symbol table info available.

Comment 7 Rachana Patel 2014-06-23 09:22:17 UTC

able to reproduce even without remove-brick step. 
- create, start and mount volume, add bricks. Do I/O. start rebalance and while rename is in progress do renames.

brick process was crashed

(gdb) bt
#0  0x000000376740c380 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f2669124af6 in dict_get () from /usr/lib64/libglusterfs.so.0
#2  0x00007f265a9eff1b in posix_lookup_xattr_fill () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so
#3  0x00007f265a9dd21e in posix_entry_xattr_fill () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so
#4  0x00007f265a9ec0bf in posix_readdirp_fill () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so
#5  0x00007f265a9ec5b3 in posix_do_readdir () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so
#6  0x00007f265a9ed32e in posix_readdirp () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so
#7  0x00007f266912ff63 in default_readdirp () from /usr/lib64/libglusterfs.so.0
#8  0x00007f265a5b64bd in posix_acl_readdirp () from /usr/lib64/glusterfs/3.6.0.19/xlator/features/access-control.so
#9  0x00007f265a39eef4 in pl_readdirp () from /usr/lib64/glusterfs/3.6.0.19/xlator/features/locks.so
#10 0x00007f2669132832 in default_readdirp_resume () from /usr/lib64/libglusterfs.so.0
#11 0x00007f266914c631 in call_resume () from /usr/lib64/libglusterfs.so.0
#12 0x00007f265a193348 in iot_worker () from /usr/lib64/glusterfs/3.6.0.19/xlator/performance/io-threads.so
#13 0x00000037674079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003766ce8b7d in clone () from /lib64/libc.so.6

Comment 10 Sayan Saha 2014-07-02 17:08:07 UTC

I think we should consider this as a blocker for Denali if this is frequently reproducible or document that people do not execute a rename operation when rebalance is in progress and fix it in U1.

Comment 11 Nithya Balachandran 2014-07-03 03:46:39 UTC

We have not managed to reproduce the bug after it was reported. I'm not sure if documenting it will help as, IIUC, rebalance can take days to complete and placing restrictions on the operations allowed might not be possible for such a long time.

Comment 12 Vivek Agarwal 2014-07-03 05:55:54 UTC

Based on comment above, removing the blocker flag. To be targeted for u1

Comment 14 shylesh 2015-01-08 08:47:27 UTC

this issue is also reproducible on 3.6.0.41-1.el6rhs.x86_64

Comment 22 Amit Chaurasia 2015-03-01 18:21:14 UTC

Tried on NFS as well as FUSE mount points while looking in progress.
No crash seen while performing rename and even while rebalance. 
Marking it verified.

Comment 24 errata-xmlrpc 2015-03-26 06:34:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0682.html