Description of problem: ----------------------- On a 6x3 volume, some bricks were brought down when rebalance was in progress. This caused the mount to be read-only (client quorum was enabled). While rebalance was in progress, the bricks were brought back up. While checking self-heal info output, one brick was found to be not connected. This was not one of the bricks that was brought down. Following is seen in brick logs - pending frames: frame : type(0) op(18) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-04-22 11:28:30 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7dev /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3d140221c6] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3d1403de2f] /lib64/libc.so.6[0x3d120326a0] /usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_getxattr+0xbd3)[0x7f558e324c03] /usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab] /usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab] /usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab] /usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_getxattr+0x1e9)[0x7f558d6923a9] /usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_getxattr+0x173)[0x7f558d48b9f3] /usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_getxattr+0x1bb)[0x7f558d275d8b] /usr/lib64/libglusterfs.so.0(default_getxattr+0x7b)[0x3d14027bab] /usr/lib64/libglusterfs.so.0(default_getxattr_resume+0x13a)[0x3d1402b38a] /usr/lib64/libglusterfs.so.0(call_resume+0x80)[0x3d14046470] /usr/lib64/glusterfs/3.7dev/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f558ce5a388] /lib64/libpthread.so.0[0x3d124079d1] /lib64/libc.so.6(clone+0x6d)[0x3d120e88fd] --------- Following is the volume configuration - # gluster volume info vol Volume Name: vol Type: Distributed-Replicate Volume ID: 133fe4f3-987c-474d-9904-c28475d4812f Status: Started Number of Bricks: 6 x 3 = 18 Transport-type: tcp Bricks: Brick1: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick2: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick3: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick4: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick1/b1 Brick5: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1 Brick6: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1 Brick7: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1 Brick8: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick2/b1 Brick9: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1 Brick10: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1 Brick11: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1 Brick12: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick3/b1 Brick13: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1 Brick14: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1 Brick15: vm5-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1 Brick16: vm6-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick4/b1 Brick17: vm3-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1 Brick18: vm4-rhsqa13.lab.eng.blr.redhat.com:/rhs/brick5/b1 Options Reconfigured: cluster.quorum-type: auto client.event-threads: 4 server.event-threads: 5 features.uss: enable features.quota: on cluster.consistent-metadata: on Note that the client was on a different version of glusterfs than the server. Version-Release number of selected component (if applicable): --------------------------------------------------------------- On the server - glusterfs-3.7dev-0.965.git2788ddd.el6.x86_64 On the client - glusterfs-3.7dev-0.1009.git8b987be.el6.x86_64 How reproducible: ------------------ Saw this issue once. Steps to Reproduce: -------------------- 1. On a 6x3 volume, started remove-brick operation of one replica set. 2. After completion of data migration for the remove-brick operation, executed stop remove-brick. 3. Started rebalance operation on the volume. 4. While rebalance was in progress, killed two bricks each in 3 replica sets. 5. After a while, while rebalance was still running, started the volume using force. 6. Was monitoring volume heal info output when I noticed that one of the bricks was not connected. Actual results: ---------------- Brick process crashed. Expected results: ------------------ Brick process is not expected to crash. Additional info:
On the same volume, another brick process crashed after some time. From the brick logs - [2015-04-22 12:31:23.720702] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/netprio_cgroup.h frame : type(0) op(0) frame : type(0) op(20) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(40) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-04-22 12:31:23 configuration details: [2015-04-22 12:31:23.731428] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/raw.h argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7dev /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3ad30221c6] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3ad303de2f] /lib64/libc.so.6[0x3ad14326a0] /lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x3ad180c380] /usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(+0x589b)[0x7f484d94689b] /usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(quota_fill_inodectx+0x1fa)[0x7f484d94f6aa] /usr/lib64/glusterfs/3.7dev/xlator/features/quota.so(quota_readdirp_cbk+0x13e)[0x7f484d94fa9e] /usr/lib64/glusterfs/3.7dev/xlator/features/marker.so(marker_readdirp_cbk+0x13e)[0x7f484db71bbe] /usr/lib64/libglusterfs.so.0(default_readdirp_cbk+0xc2)[0x3ad302e622] /usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_readdirp_cbk+0x18b)[0x7f484e5b6cfb] /usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_readdirp_cbk+0x27a)[0x7f484e7d0b7a] /usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_readdirp_cbk+0x214)[0x7f484e9db304] /usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_do_readdir+0x1b8)[0x7f484f871498] /usr/lib64/glusterfs/3.7dev/xlator/storage/posix.so(posix_readdirp+0x1ee)[0x7f484f872fde] /usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333] /usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333] /usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333] /usr/lib64/glusterfs/3.7dev/xlator/features/bitrot-stub.so(br_stub_readdirp+0x259)[0x7f484e9d8e29] /usr/lib64/glusterfs/3.7dev/xlator/features/access-control.so(posix_acl_readdirp+0x19d)[0x7f484e7cd4bd] /usr/lib64/glusterfs/3.7dev/xlator/features/locks.so(pl_readdirp+0x204)[0x7f484e5b5d94] /usr/lib64/libglusterfs.so.0(default_readdirp+0x83)[0x3ad3027333] /usr/lib64/libglusterfs.so.0(default_readdirp_resume+0x142)[0x3ad3029db2] /usr/lib64/libglusterfs.so.0(call_resume+0x80)[0x3ad3046470] /usr/lib64/glusterfs/3.7dev/xlator/performance/io-threads.so(iot_worker+0x158)[0x7f484e1a1388] /lib64/libpthread.so.0[0x3ad18079d1] /lib64/libc.so.6(clone+0x6d)[0x3ad14e88fd] --------- While self-heal was running, I attempted to remove a directory on the mount point. # rm -fr linux-3.19.4 rm: cannot remove `linux-3.19.4/include/crypto': Directory not empty rm: cannot remove `linux-3.19.4/include/drm': Directory not empty rm: cannot remove `linux-3.19.4/include/media': Directory not empty rm: cannot remove `linux-3.19.4/include/net/netfilter': Directory not empty rm: cannot remove `linux-3.19.4/include/net/bluetooth': Directory not empty rm: cannot remove `linux-3.19.4/include/net/caif': Directory not empty rm: cannot remove `linux-3.19.4/include/net/tc_act': Directory not empty I also see a lot of the following messages in brick logs - <snip> [2015-04-22 12:31:14.461836] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/netns [2015-04-22 12:31:18.132176] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/caif/caif_layer.h [2015-04-22 12:31:23.675448] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/ip6_fib.h [2015-04-22 12:31:23.691089] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/ip_fib.h [2015-04-22 12:31:23.699589] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/lib80211.h [2015-04-22 12:31:23.711344] E [posix.c:4433:posix_removexattr] 0-vol-posix: null gfid for path /linux-3.19.4/include/net/neighbour.h </snip>
From the core for the first crash (core.28981) : (gdb) t 12 [Switching to thread 12 (Thread 0x7f558ef38700 (LWP 28986))]#0 0x0000003d13839230 in ?? () from /usr/lib64/libsqlite3.so.0 (gdb) bt #0 0x0000003d13839230 in ?? () from /usr/lib64/libsqlite3.so.0 #1 0x0000003d1381f9a1 in sqlite3_initialize () from /usr/lib64/libsqlite3.so.0 #2 0x0000003d13841949 in ?? () from /usr/lib64/libsqlite3.so.0 #3 0x00007f558dcd0cba in gf_open_sqlite3_conn (args=0x7f559821cbe8, db_conn=0x7f55880795f0) at gfdb_sqlite3.c:190 #4 gf_sqlite3_init (args=0x7f559821cbe8, db_conn=0x7f55880795f0) at gfdb_sqlite3.c:428 #5 0x00007f558dccbe1a in init_db (args=0x7f559821cbe8, gfdb_db_type=GFDB_SQLITE3) at gfdb_data_store.c:251 #6 0x00007f558dee2546 in init (this=0x7f558800a530) at changetimerecorder.c:1046 #7 0x0000003d1401f662 in __xlator_init (xl=0x7f558800a530) at xlator.c:397 #8 xlator_init (xl=0x7f558800a530) at xlator.c:420 #9 0x0000003d1405ec01 in glusterfs_graph_init (graph=<value optimized out>) at graph.c:319 #10 0x0000003d1405ed37 in glusterfs_graph_activate (graph=<value optimized out>, ctx=0xc77010) at graph.c:659 #11 0x0000000000405a9b in glusterfs_process_volfp (ctx=0xc77010, fp=0x7f55880008e0) at glusterfsd.c:2160 #12 0x000000000040d0c0 in mgmt_getspec_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f55986556d4) at glusterfsd-mgmt.c:1552 #13 0x0000003d1480ec65 in rpc_clnt_handle_reply (clnt=0xcdce60, pollin=0x7f55880016a0) at rpc-clnt.c:766 #14 0x0000003d14810102 in rpc_clnt_notify (trans=<value optimized out>, mydata=0xcdce90, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:894 #15 0x0000003d1480b7f8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543 #16 0x00007f558f799bbd in socket_event_poll_in (this=0xcde9b0) at socket.c:2290 #17 0x00007f558f79b6ed in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0xcde9b0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403 #18 0x0000003d1407e150 in event_dispatch_epoll_handler (data=0xcdfb70) at event-epoll.c:572 #19 event_dispatch_epoll_worker (data=0xcdfb70) at event-epoll.c:674 #20 0x0000003d124079d1 in start_thread () from /lib64/libpthread.so.0 #21 0x0000003d120e88fd in iopl () from /lib64/libc.so.6 #22 0x0000000000000000 in ?? () (gdb) t 12 [Switching to thread 12 (Thread 0x7f558ef38700 (LWP 28986))]#0 0x0000003d13839230 in ?? () from /usr/lib64/libsqlite3.so.0 (gdb) bt #0 0x0000003d13839230 in ?? () from /usr/lib64/libsqlite3.so.0 #1 0x0000003d1381f9a1 in sqlite3_initialize () from /usr/lib64/libsqlite3.so.0 #2 0x0000003d13841949 in ?? () from /usr/lib64/libsqlite3.so.0 #3 0x00007f558dcd0cba in gf_open_sqlite3_conn (args=0x7f559821cbe8, db_conn=0x7f55880795f0) at gfdb_sqlite3.c:190 #4 gf_sqlite3_init (args=0x7f559821cbe8, db_conn=0x7f55880795f0) at gfdb_sqlite3.c:428 #5 0x00007f558dccbe1a in init_db (args=0x7f559821cbe8, gfdb_db_type=GFDB_SQLITE3) at gfdb_data_store.c:251 #6 0x00007f558dee2546 in init (this=0x7f558800a530) at changetimerecorder.c:1046 #7 0x0000003d1401f662 in __xlator_init (xl=0x7f558800a530) at xlator.c:397 #8 xlator_init (xl=0x7f558800a530) at xlator.c:420 #9 0x0000003d1405ec01 in glusterfs_graph_init (graph=<value optimized out>) at graph.c:319 #10 0x0000003d1405ed37 in glusterfs_graph_activate (graph=<value optimized out>, ctx=0xc77010) at graph.c:659 #11 0x0000000000405a9b in glusterfs_process_volfp (ctx=0xc77010, fp=0x7f55880008e0) at glusterfsd.c:2160 #12 0x000000000040d0c0 in mgmt_getspec_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f55986556d4) at glusterfsd-mgmt.c:1552 #13 0x0000003d1480ec65 in rpc_clnt_handle_reply (clnt=0xcdce60, pollin=0x7f55880016a0) at rpc-clnt.c:766 #14 0x0000003d14810102 in rpc_clnt_notify (trans=<value optimized out>, mydata=0xcdce90, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:894 #15 0x0000003d1480b7f8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543 #16 0x00007f558f799bbd in socket_event_poll_in (this=0xcde9b0) at socket.c:2290 #17 0x00007f558f79b6ed in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0xcde9b0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403 #18 0x0000003d1407e150 in event_dispatch_epoll_handler (data=0xcdfb70) at event-epoll.c:572 #19 event_dispatch_epoll_worker (data=0xcdfb70) at event-epoll.c:674 #20 0x0000003d124079d1 in start_thread () from /lib64/libpthread.so.0 #21 0x0000003d120e88fd in iopl () from /lib64/libc.so.6 #22 0x0000000000000000 in ?? () So it looks like a request has come and been accepted before all the xlators in the graph have been initialised
Created attachment 1018374 [details] sosreport of Node1
Please file a separate BZ for the second crash as it is a different issue.
BZ #1215550 filed for the second crash.
Assigning it to Nithya who seem to have been actively debugging the issue. Please assign it back if otherwise. Thanks!
Another crash in master which seems similar to this issue: http://build.gluster.org/job/rackspace-regression-2GB-triggered/9585/consoleFull The glusterfs_graph_init() still is in progress before completion of which notifications were being received.
REVIEW: http://review.gluster.org/11463 (protocol/server: start listening only after all xlators in graph are initialised.) posted (#2) for review on release-3.7 by Raghavendra G (rgowdapp)
REVIEW: http://review.gluster.org/11504 (protocol/server: fail setvolume if any of xlators is not initialized yet) posted (#1) for review on release-3.7 by Raghavendra G (rgowdapp)
COMMIT: http://review.gluster.org/11504 committed in release-3.7 by Raghavendra G (rgowdapp) ------ commit 12c854b7a28a8d764f0446d2e0133c447c2537c2 Author: Raghavendra G <rgowdapp> Date: Wed Jul 1 16:54:55 2015 +0530 protocol/server: fail setvolume if any of xlators is not initialized yet We can only start recieving fops only when all xlators in graph are initialized. Change-Id: Id79100bab5878bb2518ed133c1118554fbb35229 BUG: 1214169 Signed-off-by: Raghavendra G <rgowdapp> Reviewed-on: http://review.gluster.org/11504 Tested-by: NetBSD Build System <jenkins.org> Tested-by: Gluster Build System <jenkins.com>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.3, please open a new bug report. glusterfs-3.7.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/12078 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user