Description of problem: 2 replica volume with 3 fuse and 3 nfs clients each running different tests such as ping_pong, rdd, fs-perf-test, sanity, million files creation etc. Graph changes were happening parallely. Bouned a brick and gave volume heal command. Added 2 more bricks, thus making it 2x2 distributed replicate volume. Gave rebalance. quota, geo-replication, lock-heal were enabled. While rebalance was running brought down 2 bricks one from a replica pair. And after a while brought them up (volume start force). Rebalance process had crashed with the following backtrace. Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id mirror --xlator-option *dht'. Program terminated with signal 6, Aborted. #0 0x0000003cdba32885 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt #0 0x0000003cdba32885 in raise () from /lib64/libc.so.6 #1 0x0000003cdba34065 in abort () from /lib64/libc.so.6 #2 0x0000003cdba2b9fe in __assert_fail_base () from /lib64/libc.so.6 #3 0x0000003cdba2bac0 in __assert_fail () from /lib64/libc.so.6 #4 0x00007fbbbdb5bd6e in __inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8) at ../../../libglusterfs/src/inode.c:1090 #5 0x00007fbbbdb5c156 in inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8) at ../../../libglusterfs/src/inode.c:1191 #6 0x00007fbbb95abfdb in protocol_client_reopen (this=0x2581920, fdctx=0x2646020) at ../../../../../xlators/protocol/client/src/client-handshake.c:1175 #7 0x00007fbbb95ac495 in client_post_handshake (frame=0x7fbbbc775d5c, this=0x2581920) at ../../../../../xlators/protocol/client/src/client-handshake.c:1283 #8 0x00007fbbb95accc0 in client_setvolume_cbk (req=0x7fbbb80d71c4, iov=0x7fbbb80d7204, count=1, myframe=0x7fbbbc775d5c) at ../../../../../xlators/protocol/client/src/client-handshake.c:1439 #9 0x00007fbbbd91da48 in rpc_clnt_handle_reply (clnt=0x25e97a0, pollin=0x32afdd0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:788 #10 0x00007fbbbd91dde5 in rpc_clnt_notify (trans=0x25f9330, mydata=0x25e97d0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x32afdd0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:907 #11 0x00007fbbbd919ec8 in rpc_transport_notify (this=0x25f9330, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x32afdd0) at ../../../../rpc/rpc-lib/src/rpc-transport.c:489 #12 0x00007fbbba3e7280 in socket_event_poll_in (this=0x25f9330) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1677 #13 0x00007fbbba3e7804 in socket_event_handler (fd=26, idx=16, data=0x25f9330, poll_in=1, poll_out=0, poll_err=0) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1792 #14 0x00007fbbbdb74d9c in event_dispatch_epoll_handler (event_pool=0x255f500, events=0x2579090, i=5) at ../../../libglusterfs/src/event.c:785 #15 0x00007fbbbdb74fbf in event_dispatch_epoll (event_pool=0x255f500) at ../../../libglusterfs/src/event.c:847 #16 0x00007fbbbdb7534a in event_dispatch (event_pool=0x255f500) at ../../../libglusterfs/src/event.c:947 #17 0x00000000004084c1 in main (argc=27, argv=0x7fff3d996178) at ../../../glusterfsd/src/glusterfsd.c:1674 (gdb) f 5 #5 0x00007fbbbdb5c156 in inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8) at ../../../libglusterfs/src/inode.c:1191 1191 ret = __inode_path (inode, name, bufp); (gdb) p *inode $1 = {table = 0x7fbbac000d80, gfid = '\000' <repeats 15 times>, lock = 1, nlookup = 0, ref = 6, ia_type = IA_IFREG, fd_list = { next = 0x7fbbb0ecd1a4, prev = 0x7fbbb0ecd1a4}, dentry_list = {next = 0x7fbbb0ecd1b4, prev = 0x7fbbb0ecd1b4}, hash = { next = 0x7fbbb0ecd1c4, prev = 0x7fbbb0ecd1c4}, list = {next = 0x7fbbb0ecd0ac, prev = 0x7fbbb0ecd708}, _ctx = 0x7fbba4000e60} (gdb) f 3 #3 0x0000003cdba2bac0 in __assert_fail () from /lib64/libc.so.6 (gdb) f 4 #4 0x00007fbbbdb5bd6e in __inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8) at ../../../libglusterfs/src/inode.c:1090 1090 GF_ASSERT (0); (gdb) l 1085 int64_t ret = 0; 1086 int len = 0; 1087 char *buf = NULL; 1088 1089 if (!inode || uuid_is_null (inode->gfid)) { 1090 GF_ASSERT (0); 1091 gf_log_callingfn (THIS->name, GF_LOG_WARNING, "invalid inode"); 1092 return -1; 1093 } 1094 (gdb) p inode->gfid $2 = '\000' <repeats 15 times> (gdb) info thr 13 Thread 0x7fbbb3fff700 (LWP 30367) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 12 Thread 0x7fbb830bc700 (LWP 30886) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 11 Thread 0x7fbba182c700 (LWP 30579) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 10 Thread 0x7fbbb35fe700 (LWP 30368) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 9 Thread 0x7fbba3fff700 (LWP 30400) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 8 Thread 0x7fbba35fe700 (LWP 30401) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 7 Thread 0x7fbbbc3f5700 (LWP 30363) 0x0000003cdc20f245 in sigwait () from /lib64/libpthread.so.0 6 Thread 0x7fbbba1bc700 (LWP 30366) 0x0000003cdc20eccd in nanosleep () from /lib64/libpthread.so.0 5 Thread 0x7fbb826bb700 (LWP 30887) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 0x7fbba0e2b700 (LWP 30580) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x7fbbbaff3700 (LWP 30365) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 0x7fbbbb9f4700 (LWP 30364) 0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 * 1 Thread 0x7fbbbd6da700 (LWP 30362) 0x0000003cdba32885 in raise () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. created a 2 replica replicate volume, started it and mounted it via 3 fuse and 3 nfs clients. 2. enabled geo-replication, quota, lock-heal, profiling on the volume 3. different tests on the mount points such as ping_pong, fs-perf-test, rdd, sanity script, million files creation etc. 4. bounced a brick and started self-heal 5. after some time added 2 more bricks making it 2x2 distributed-replicate and gave rebalance 6. while above tasks were running bounced 2 bricks one from a replica pair. Actual results: rebalance process crashed Expected results: rebalance process should not crash Additional info: gluster volume info Volume Name: mirror Type: Distributed-Replicate Volume ID: 2f7a3469-369f-4176-82cc-6afd744d1e37 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.16.156.9:/export/mirror Brick2: 10.16.156.12:/export/mirror Brick3: 10.16.156.15:/export/mirror Brick4: 10.16.156.18:/export/mirror Options Reconfigured: performance.client-io-threads: on diagnostics.latency-measurement: on diagnostics.count-fop-hits: on features.quota: on features.limit-usage: /:1TB features.lock-heal: on network.ping-timeout: 222 performance.stat-prefetch: off geo-replication.indexing: on [2012-05-29 02:45:39.660648] I [client-handshake.c:1437:client_setvolume_cbk] 3-mirror-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-05-29 02:45:39.663753] I [client-handshake.c:453:client_set_lk_version_cbk] 3-mirror-client-3: Server lk version = 1 [2012-05-29 02:45:39.663797] I [client-handshake.c:453:client_set_lk_version_cbk] 3-mirror-client-1: Server lk version = 1 [2012-05-29 02:45:39.673566] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-mirror-replicate-0: added root inode [2012-05-29 02:45:39.678303] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-mirror-replicate-1: added root inode [2012-05-29 02:45:39.679909] I [dht-common.c:2337:dht_setxattr] 3-mirror-dht: fixing the layout of / [2012-05-29 02:45:39.699981] I [dht-rebalance.c:1058:gf_defrag_migrate_data] 0-mirror-dht: migrate data called on / [2012-05-29 02:45:39.737738] I [dht-rebalance.c:639:dht_migrate_file] 3-mirror-dht: /out: attempting to move from mirror-replicate-0 t o mirror-replicate-1 [2012-05-29 02:45:46.706462] W [client.c:103:client_grace_timeout] 3-mirror-client-2: client grace timer expired, updating the lk-vers ion to 2 [2012-05-29 02:46:49.093765] I [client-handshake.c:1628:select_server_supported_programs] 3-mirror-client-2: Using Program GlusterFS 3 .3.0qa43, Num (1298437), Version (330) [2012-05-29 02:46:49.094564] I [client-handshake.c:1425:client_setvolume_cbk] 3-mirror-client-2: Connected to 10.16.156.15:24009, atta ched to remote volume '/export/mirror'. [2012-05-29 02:46:49.094612] I [client-handshake.c:1437:client_setvolume_cbk] 3-mirror-client-2: Server and Client lk-version numbers are not same, reopening the fds [2012-05-29 02:46:49.096629] I [client-handshake.c:453:client_set_lk_version_cbk] 3-mirror-client-2: Server lk version = 2 [2012-05-29 02:46:49.103579] I [client-handshake.c:1628:select_server_supported_programs] 0-mirror-client-2: Using Program GlusterFS 3 .3.0qa43, Num (1298437), Version (330) [2012-05-29 02:46:49.104543] I [client-handshake.c:1628:select_server_supported_programs] 1-mirror-client-2: Using Program GlusterFS 3 .3.0qa43, Num (1298437), Version (330) [2012-05-29 02:46:49.104704] I [client-handshake.c:1425:client_setvolume_cbk] 0-mirror-client-2: Connected to 10.16.156.15:24009, atta ched to remote volume '/export/mirror'. [2012-05-29 02:46:49.104731] I [client-handshake.c:1437:client_setvolume_cbk] 0-mirror-client-2: Server and Client lk-version numbers are not same, reopening the fds [2012-05-29 02:46:49.104766] I [client-handshake.c:1274:client_post_handshake] 0-mirror-client-2: 6 fds open - Delaying child_up until they are re-opened pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2012-05-29 02:46:49 configuration details: argp 1
*** Bug 859387 has been marked as a duplicate of this bug. ***
*** Bug 821148 has been marked as a duplicate of this bug. ***
*** Bug 824533 has been marked as a duplicate of this bug. ***
*** Bug 849136 has been marked as a duplicate of this bug. ***
*** Bug 858456 has been marked as a duplicate of this bug. ***
CHANGE: http://review.gluster.org/4192 (protocol/client: Remember the gfid of opened fd) merged in release-3.3 by Vijay Bellur (vbellur)
*** Bug 832396 has been marked as a duplicate of this bug. ***