Description of problem: ======================= brick process crashed, rebalance was in progress and performed rename from mount point Version-Release number of selected component (if applicable): ============================================================= 3.6.0.18-1.el6rhs.x86_64 How reproducible: ================= Intermittent Steps to Reproduce: ==================== 1. create and mount(NFS) distributed volume( 3 bricks) 2. add 2 more bricks. 2. start creating files and Dires on mount point (cd /mnt/nfs; for i in {1..100} ; do mv new$i $i ; cd $i; mv etcn$i etc$i; for j in {1..100}; do mv newf$j file$j; done ; done) 3.execute remove-brick start for one brick 4. stop remov-brick operation 5. start rebalance from one node 6. while rebalance is in progress start rename operation from mount point and keep renaming files and Directories which were created earlier. 7. rename failed with 'Input/output error' for few entires. Verified in backed. brick process was crashed and rebalance was failed [root@OVM3 ~]# gluster volume status Status of volume: test1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.198:/brick0 N/A N 19201 Brick 10.70.35.172:/brick0 49152 Y 21103 Brick 10.70.35.240:/brick0 49152 Y 26751 Brick 10.70.35.240:/brick1 49153 Y 26839 Brick 10.70.35.198:/brick1 49153 Y 20252 NFS Server on localhost 2049 Y 20855 NFS Server on 10.70.35.240 2049 Y 27460 NFS Server on 10.70.35.172 2049 Y 21116 Task Status of Volume test1 ------------------------------------------------------------------------------ Task : Rebalance ID : 28ee029b-36c9-46e0-a31f-86e6184b0bfa Status : failed [root@OVM3 ~]# gluster volume rebalance test1 status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1249 24.4MB 5727 12 0 failed 195.00 10.70.35.172 73 31.7KB 10198 63 0 failed 196.00 10.70.35.240 1527 8.2MB 5118 10 2 failed 196.00 Actual results: =============== brick process was crashed Additional info: ================= bt (gdb) #0 0x000000365860c380 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00007fbe02f98af6 in dict_get (this=0x662f333677656e2f, key=<value optimized out>) at dict.c:374 #2 0x00007fbdf8924f1b in posix_lookup_xattr_fill (this=0x9846c0, real_path=0x7fbdf2cf4460 "/brick0/.glusterfs/e2/f5/e2f52f09-eefe-45dc-b4ad-3d9de740317e/new9/new9/new9/new9/new9/new9/new9/new9/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new"..., loc=0x7fbdf2cf4610, xattr_req=0x662f333677656e2f, buf=0x323677656e2f3136) at posix-helpers.c:633 #3 0x00007fbdf891221e in posix_entry_xattr_fill (this=0x9846c0, inode=<value optimized out>, fd=0x9d2c78, name=0x7fbde4397158 "file51", dict=0x662f333677656e2f, stbuf=0x323677656e2f3136) at posix.c:4801 #4 0x00007fbdf89210bf in posix_readdirp_fill (this=0x9846c0, fd=0x9d2c78, entries=0x7fbdf2cf4a90, dict=0x7fbe017d71a8) at posix.c:4853 #5 0x00007fbdf89215b3 in posix_do_readdir (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=<value optimized out>, off=58, whichop=40, dict=0x7fbe017d71a8) at posix.c:4935 #6 0x00007fbdf892232e in posix_readdirp (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=130944, off=0, dict=0x7fbe017d71a8) at posix.c:4985 #7 0x00007fbe02fa3f63 in default_readdirp (frame=0x7fbe01ddb3c8, this=0x985f90, fd=0x9d2c78, size=130944, off=<value optimized out>, xdata=<value optimized out>) at defaults.c:2078 #8 0x00007fbdf84eb4bd in posix_acl_readdirp (frame=0x7fbe01dda250, this=0x988e30, fd=0x9d2c78, size=130944, offset=0, dict=<value optimized out>) at posix-acl.c:1614 #9 0x00007fbdf82d3ef4 in pl_readdirp (frame=0x7fbe01ddabb8, this=0x98a030, fd=0x9d2c78, size=130944, offset=0, dict=0x7fbe017d71a8) at posix.c:2150 #10 0x00007fbe02fa6832 in default_readdirp_resume (frame=0x7fbe01dda04c, this=0x98b290, fd=0x9d2c78, size=130944, off=0, xdata=0x7fbe017d71a8) at defaults.c:1645 #11 0x00007fbe02fc0631 in call_resume_wind (stub=0x7fbe0186204c) at call-stub.c:2492 #12 call_resume (stub=0x7fbe0186204c) at call-stub.c:2841 #13 0x00007fbdf80c8348 in iot_worker (data=0x9bcf80) at io-threads.c:214 #14 0x00000036586079d1 in start_thread () from /lib64/libpthread.so.0 #15 0x00000036582e8b7d in clone () from /lib64/libc.so.6 (gdb) bt full #0 0x000000365860c380 in pthread_spin_lock () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fbe02f98af6 in dict_get (this=0x662f333677656e2f, key=<value optimized out>) at dict.c:374 pair = <value optimized out> __FUNCTION__ = "dict_get" #2 0x00007fbdf8924f1b in posix_lookup_xattr_fill (this=0x9846c0, real_path=0x7fbdf2cf4460 "/brick0/.glusterfs/e2/f5/e2f52f09-eefe-45dc-b4ad-3d9de740317e/new9/new9/new9/new9/new9/new9/new9/new9/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new"..., loc=0x7fbdf2cf4610, xattr_req=0x662f333677656e2f, buf=0x323677656e2f3136) at posix-helpers.c:633 xattr = 0x0 filler = {this = 0x0, real_path = 0x0, xattr = 0x0, stbuf = 0x0, loc = 0x0, inode = 0x0, fd = 0, flags = 0, op_errno = 0} list = _gf_false #3 0x00007fbdf891221e in posix_entry_xattr_fill (this=0x9846c0, inode=<value optimized out>, fd=0x9d2c78, name=0x7fbde4397158 "file51", dict=0x662f333677656e2f, stbuf=0x323677656e2f3136) at posix.c:4801 tmp_loc = {path = 0x3135656c69 <Address 0x3135656c69 out of bounds>, name = 0x0, inode = 0x7fbdf0ca4cb8, parent = 0x0, gfid = '\000' <repeats 15 times>, pargfid = '\000' <repeats 15 times>} entry_path = 0x7fbdf2cf4460 "/brick0/.glusterfs/e2/f5/e2f52f09-eefe-45dc-b4ad-3d9de740317e/new9/new9/new9/new9/new9/new9/new9/new9/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new"... #4 0x00007fbdf89210bf in posix_readdirp_fill (this=0x9846c0, fd=0x9d2c78, entries=0x7fbdf2cf4a90, dict=0x7fbe017d71a8) at posix.c:4853 entry = 0x7fbde43970b0 itable = 0x9d0170 inode = 0x7fbdf0ca4cb8 hpath = 0x7fbdf2cf46a0 "/brick0/.glusterfs/bf/2e/bf2e666d-939f-4f05-b415-c695f99d8111/new8/new9/new10/new11/new12/new13/new14/new15/new16/new17/new18/new19/new20/new21/new22/new23/new24/new25/new26/new27/new28/new29/new30/ne"... len = <value optimized out> stbuf = {ia_ino = 11215285721967525407, ia_gfid = "a\275ŃY\345C\246\233\244\262\361hb~\037", ia_dev = 64774, ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 0 '\000'}, group = {read = 1 '\001', write = 0 '\000', exec = 0 '\000'}, other = { read = 1 '\001', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 1, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 4096, ia_blocks = 0, ia_atime = 1402994133, ia_atime_nsec = 0, ia_mtime = 1402994133, ia_mtime_nsec = 0, ia_ctime = 1402994133, ia_ctime_nsec = 837004085} ---Type <return> to continue, or q <return> to quit--- gfid = '\000' <repeats 15 times> ret = <value optimized out> #5 0x00007fbdf89215b3 in posix_do_readdir (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=<value optimized out>, off=58, whichop=40, dict=0x7fbe017d71a8) at posix.c:4935 pfd = 0x7fbdd41d8eb0 dir = <value optimized out> ret = <value optimized out> count = 58 op_ret = 58 op_errno = 2 entries = {{list = {next = 0x7fbde402f890, prev = 0x7fbde43f9460}, {next = 0x7fbde402f890, prev = 0x7fbde43f9460}}, d_ino = 4294967295, d_off = 140453799218192, d_len = 2629385827, d_type = 2888593132, d_stat = {ia_ino = 233410450496, ia_gfid = "\000\000\000\000\000\000\000\000\030\006\000\000\000\000\000", ia_dev = 140454070839116, ia_type = 9997872, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 39, ia_rdev = 140454070839116, ia_size = 42, ia_blksize = 0, ia_blocks = 140454070421914, ia_atime = 0, ia_atime_nsec = 257004085, ia_mtime = 50329162, ia_mtime_nsec = 32702, ia_ctime = 23019252, ia_ctime_nsec = 32702}, dict = 0x7fbe017d71a8, inode = 0x7fbe017d71d8, d_name = 0x7fbdf2cf4a90 "\220\370\002\344\275\177"} skip_dirs = 0 __FUNCTION__ = "posix_do_readdir" #6 0x00007fbdf892232e in posix_readdirp (frame=0x7fbe01ddb3c8, this=0x9846c0, fd=0x9d2c78, size=130944, off=0, dict=0x7fbe017d71a8) at posix.c:4985 entries = {{list = {next = 0x41, prev = 0x7fbe02ff88bf}, {next = 0x41, prev = 0x7fbe02ff88bf}}, d_ino = 1, d_off = 140454070635415, d_len = 1, d_type = 16843520, d_stat = {ia_ino = 140453554085168, ia_gfid = "\fMl\001\276\177\000\000\376\177\371\002\276\177\000", ia_dev = 1, ia_type = 23874828, ia_prot = { suid = 0 '\000', sgid = 1 '\001', sticky = 1 '\001', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 24998312, ia_uid = 32702, ia_gid = 49915109, ia_rdev = 140453765923276, ia_size = 140454070565365, ia_blksize = 4073671752, ia_blocks = 140453891477175, ia_atime = 3966643372, ia_atime_nsec = 1665055132, ia_mtime = 24998312, ia_mtime_nsec = 32702, ia_ctime = 24998360, ia_ctime_nsec = 32702}, dict = 0x7fbdf84f0eb7, inode = 0x7fbe015f3eec, d_name = 0x7fbdf2cf4ba0 "A"} ---Type <return> to continue, or q <return> to quit--- op_ret = -1 op_errno = 0 entry = 0x0 __FUNCTION__ = "posix_readdirp" #7 0x00007fbe02fa3f63 in default_readdirp (frame=0x7fbe01ddb3c8, this=0x985f90, fd=0x9d2c78, size=130944, off=<value optimized out>, xdata=<value optimized out>) at defaults.c:2078 old_THIS = 0x985f90 #8 0x00007fbdf84eb4bd in posix_acl_readdirp (frame=0x7fbe01dda250, this=0x988e30, fd=0x9d2c78, size=130944, offset=0, dict=<value optimized out>) at posix-acl.c:1614 _new = 0x7fbe01ddb3c8 old_THIS = 0x988e30 tmp_cbk = 0x7fbdf84ee8e0 <posix_acl_readdirp_cbk> ret = <value optimized out> alloc_dict = <value optimized out> __FUNCTION__ = "posix_acl_readdirp" #9 0x00007fbdf82d3ef4 in pl_readdirp (frame=0x7fbe01ddabb8, this=0x98a030, fd=0x9d2c78, size=130944, offset=0, dict=0x7fbe017d71a8) at posix.c:2150 _new = 0x7fbe01dda250 old_THIS = 0x98a030 tmp_cbk = 0x7fbdf82d4cd0 <pl_readdirp_cbk> local = <value optimized out> __FUNCTION__ = "pl_readdirp" #10 0x00007fbe02fa6832 in default_readdirp_resume (frame=0x7fbe01dda04c, this=0x98b290, fd=0x9d2c78, size=130944, off=0, xdata=0x7fbe017d71a8) at defaults.c:1645 _new = 0x7fbe01ddabb8 old_THIS = 0x98b290 tmp_cbk = 0x7fbe02faaeb0 <default_readdirp_cbk> __FUNCTION__ = "default_readdirp_resume" #11 0x00007fbe02fc0631 in call_resume_wind (stub=0x7fbe0186204c) at call-stub.c:2492 No locals. #12 call_resume (stub=0x7fbe0186204c) at call-stub.c:2841 old_THIS = 0x98b290 ---Type <return> to continue, or q <return> to quit--- __FUNCTION__ = "call_resume" #13 0x00007fbdf80c8348 in iot_worker (data=0x9bcf80) at io-threads.c:214 conf = 0x9bcf80 this = <value optimized out> stub = <value optimized out> sleep_till = {tv_sec = 1403078869, tv_nsec = 0} ret = <value optimized out> pri = 3 timeout = 0 '\000' bye = 0 '\000' sleep = {tv_sec = 0, tv_nsec = 0} __FUNCTION__ = "iot_worker" #14 0x00000036586079d1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #15 0x00000036582e8b7d in clone () from /lib64/libc.so.6 No symbol table info available.
able to reproduce even without remove-brick step. - create, start and mount volume, add bricks. Do I/O. start rebalance and while rename is in progress do renames. brick process was crashed (gdb) bt #0 0x000000376740c380 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00007f2669124af6 in dict_get () from /usr/lib64/libglusterfs.so.0 #2 0x00007f265a9eff1b in posix_lookup_xattr_fill () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so #3 0x00007f265a9dd21e in posix_entry_xattr_fill () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so #4 0x00007f265a9ec0bf in posix_readdirp_fill () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so #5 0x00007f265a9ec5b3 in posix_do_readdir () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so #6 0x00007f265a9ed32e in posix_readdirp () from /usr/lib64/glusterfs/3.6.0.19/xlator/storage/posix.so #7 0x00007f266912ff63 in default_readdirp () from /usr/lib64/libglusterfs.so.0 #8 0x00007f265a5b64bd in posix_acl_readdirp () from /usr/lib64/glusterfs/3.6.0.19/xlator/features/access-control.so #9 0x00007f265a39eef4 in pl_readdirp () from /usr/lib64/glusterfs/3.6.0.19/xlator/features/locks.so #10 0x00007f2669132832 in default_readdirp_resume () from /usr/lib64/libglusterfs.so.0 #11 0x00007f266914c631 in call_resume () from /usr/lib64/libglusterfs.so.0 #12 0x00007f265a193348 in iot_worker () from /usr/lib64/glusterfs/3.6.0.19/xlator/performance/io-threads.so #13 0x00000037674079d1 in start_thread () from /lib64/libpthread.so.0 #14 0x0000003766ce8b7d in clone () from /lib64/libc.so.6
I think we should consider this as a blocker for Denali if this is frequently reproducible or document that people do not execute a rename operation when rebalance is in progress and fix it in U1.
We have not managed to reproduce the bug after it was reported. I'm not sure if documenting it will help as, IIUC, rebalance can take days to complete and placing restrictions on the operations allowed might not be possible for such a long time.
Based on comment above, removing the blocker flag. To be targeted for u1
this issue is also reproducible on 3.6.0.41-1.el6rhs.x86_64
Tried on NFS as well as FUSE mount points while looking in progress. No crash seen while performing rename and even while rebalance. Marking it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0682.html