Bug 826080 - [glusterfs- 3.3.0qa43]: rebalance process asserted due to null gfid
Summary: [glusterfs- 3.3.0qa43]: rebalance process asserted due to null gfid
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: protocol
Version: pre-release
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
: 821148 824533 832396 849136 858456 (view as bug list)
Depends On:
Blocks: 849136 852564 858456 858481
TreeView+ depends on / blocked
 
Reported: 2012-05-29 14:41 UTC by Raghavendra Bhat
Modified: 2013-07-24 17:18 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-07-24 17:18:27 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Raghavendra Bhat 2012-05-29 14:41:46 UTC
Description of problem:
 2 replica volume with 3 fuse and 3 nfs clients each running different tests such as ping_pong, rdd, fs-perf-test, sanity, million files creation etc. Graph changes were happening parallely. Bouned a brick and gave volume heal command. Added 2 more bricks, thus making it 2x2 distributed replicate volume. Gave rebalance. quota, geo-replication, lock-heal were enabled. 

While rebalance was running brought down 2 bricks one from a replica pair. And after a while brought them up (volume start force). 

Rebalance process had crashed with the following backtrace.

Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id mirror --xlator-option *dht'.
Program terminated with signal 6, Aborted.
#0  0x0000003cdba32885 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  0x0000003cdba32885 in raise () from /lib64/libc.so.6
#1  0x0000003cdba34065 in abort () from /lib64/libc.so.6
#2  0x0000003cdba2b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003cdba2bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fbbbdb5bd6e in __inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8)
    at ../../../libglusterfs/src/inode.c:1090
#5  0x00007fbbbdb5c156 in inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8) at ../../../libglusterfs/src/inode.c:1191
#6  0x00007fbbb95abfdb in protocol_client_reopen (this=0x2581920, fdctx=0x2646020)
    at ../../../../../xlators/protocol/client/src/client-handshake.c:1175
#7  0x00007fbbb95ac495 in client_post_handshake (frame=0x7fbbbc775d5c, this=0x2581920)
    at ../../../../../xlators/protocol/client/src/client-handshake.c:1283
#8  0x00007fbbb95accc0 in client_setvolume_cbk (req=0x7fbbb80d71c4, iov=0x7fbbb80d7204, count=1, myframe=0x7fbbbc775d5c)
    at ../../../../../xlators/protocol/client/src/client-handshake.c:1439
#9  0x00007fbbbd91da48 in rpc_clnt_handle_reply (clnt=0x25e97a0, pollin=0x32afdd0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:788
#10 0x00007fbbbd91dde5 in rpc_clnt_notify (trans=0x25f9330, mydata=0x25e97d0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x32afdd0)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:907
#11 0x00007fbbbd919ec8 in rpc_transport_notify (this=0x25f9330, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x32afdd0)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:489
#12 0x00007fbbba3e7280 in socket_event_poll_in (this=0x25f9330) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1677
#13 0x00007fbbba3e7804 in socket_event_handler (fd=26, idx=16, data=0x25f9330, poll_in=1, poll_out=0, poll_err=0)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1792
#14 0x00007fbbbdb74d9c in event_dispatch_epoll_handler (event_pool=0x255f500, events=0x2579090, i=5)
    at ../../../libglusterfs/src/event.c:785
#15 0x00007fbbbdb74fbf in event_dispatch_epoll (event_pool=0x255f500) at ../../../libglusterfs/src/event.c:847
#16 0x00007fbbbdb7534a in event_dispatch (event_pool=0x255f500) at ../../../libglusterfs/src/event.c:947
#17 0x00000000004084c1 in main (argc=27, argv=0x7fff3d996178) at ../../../glusterfsd/src/glusterfsd.c:1674
(gdb) f 5
#5  0x00007fbbbdb5c156 in inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8) at ../../../libglusterfs/src/inode.c:1191
1191                    ret = __inode_path (inode, name, bufp);
(gdb) p *inode
$1 = {table = 0x7fbbac000d80, gfid = '\000' <repeats 15 times>, lock = 1, nlookup = 0, ref = 6, ia_type = IA_IFREG, fd_list = {
    next = 0x7fbbb0ecd1a4, prev = 0x7fbbb0ecd1a4}, dentry_list = {next = 0x7fbbb0ecd1b4, prev = 0x7fbbb0ecd1b4}, hash = {
    next = 0x7fbbb0ecd1c4, prev = 0x7fbbb0ecd1c4}, list = {next = 0x7fbbb0ecd0ac, prev = 0x7fbbb0ecd708}, _ctx = 0x7fbba4000e60}
(gdb) f 3
#3  0x0000003cdba2bac0 in __assert_fail () from /lib64/libc.so.6
(gdb) f 4
#4  0x00007fbbbdb5bd6e in __inode_path (inode=0x7fbbb0ecd174, name=0x0, bufp=0x7fff3d995ab8)
    at ../../../libglusterfs/src/inode.c:1090
1090                    GF_ASSERT (0);
(gdb) l
1085            int64_t        ret   = 0;
1086            int            len   = 0;
1087            char          *buf   = NULL;
1088
1089            if (!inode || uuid_is_null (inode->gfid)) {
1090                    GF_ASSERT (0);
1091                    gf_log_callingfn (THIS->name, GF_LOG_WARNING, "invalid inode");
1092                    return -1;
1093            }
1094
(gdb) p inode->gfid
$2 = '\000' <repeats 15 times>
(gdb) info thr
  13 Thread 0x7fbbb3fff700 (LWP 30367)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12 Thread 0x7fbb830bc700 (LWP 30886)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11 Thread 0x7fbba182c700 (LWP 30579)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  10 Thread 0x7fbbb35fe700 (LWP 30368)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9 Thread 0x7fbba3fff700 (LWP 30400)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8 Thread 0x7fbba35fe700 (LWP 30401)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  7 Thread 0x7fbbbc3f5700 (LWP 30363)  0x0000003cdc20f245 in sigwait () from /lib64/libpthread.so.0
  6 Thread 0x7fbbba1bc700 (LWP 30366)  0x0000003cdc20eccd in nanosleep () from /lib64/libpthread.so.0
  5 Thread 0x7fbb826bb700 (LWP 30887)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4 Thread 0x7fbba0e2b700 (LWP 30580)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x7fbbbaff3700 (LWP 30365)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x7fbbbb9f4700 (LWP 30364)  0x0000003cdc20b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 Thread 0x7fbbbd6da700 (LWP 30362)  0x0000003cdba32885 in raise () from /lib64/libc.so.6
(gdb) 





Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. created a 2 replica replicate volume, started it and mounted it via 3 fuse and 3 nfs clients.
2. enabled geo-replication, quota, lock-heal, profiling on the volume
3. different tests on the mount points such as ping_pong, fs-perf-test, rdd, sanity script, million files creation etc.
4. bounced a brick and started self-heal
5. after some time added 2 more bricks making it 2x2 distributed-replicate and gave rebalance
6. while above tasks were running bounced 2 bricks one from a replica pair.

Actual results:
rebalance process crashed

Expected results:

rebalance process should not crash
Additional info:
gluster volume info
 
Volume Name: mirror
Type: Distributed-Replicate
Volume ID: 2f7a3469-369f-4176-82cc-6afd744d1e37
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.16.156.9:/export/mirror
Brick2: 10.16.156.12:/export/mirror
Brick3: 10.16.156.15:/export/mirror
Brick4: 10.16.156.18:/export/mirror
Options Reconfigured:
performance.client-io-threads: on
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
features.quota: on
features.limit-usage: /:1TB
features.lock-heal: on
network.ping-timeout: 222
performance.stat-prefetch: off
geo-replication.indexing: on

[2012-05-29 02:45:39.660648] I [client-handshake.c:1437:client_setvolume_cbk] 3-mirror-client-1: Server and Client lk-version numbers 
are not same, reopening the fds
[2012-05-29 02:45:39.663753] I [client-handshake.c:453:client_set_lk_version_cbk] 3-mirror-client-3: Server lk version = 1
[2012-05-29 02:45:39.663797] I [client-handshake.c:453:client_set_lk_version_cbk] 3-mirror-client-1: Server lk version = 1
[2012-05-29 02:45:39.673566] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-mirror-replicate-0: added root inode
[2012-05-29 02:45:39.678303] I [afr-common.c:1965:afr_set_root_inode_on_first_lookup] 3-mirror-replicate-1: added root inode
[2012-05-29 02:45:39.679909] I [dht-common.c:2337:dht_setxattr] 3-mirror-dht: fixing the layout of /
[2012-05-29 02:45:39.699981] I [dht-rebalance.c:1058:gf_defrag_migrate_data] 0-mirror-dht: migrate data called on /
[2012-05-29 02:45:39.737738] I [dht-rebalance.c:639:dht_migrate_file] 3-mirror-dht: /out: attempting to move from mirror-replicate-0 t
o mirror-replicate-1
[2012-05-29 02:45:46.706462] W [client.c:103:client_grace_timeout] 3-mirror-client-2: client grace timer expired, updating the lk-vers
ion to 2
[2012-05-29 02:46:49.093765] I [client-handshake.c:1628:select_server_supported_programs] 3-mirror-client-2: Using Program GlusterFS 3
.3.0qa43, Num (1298437), Version (330)
[2012-05-29 02:46:49.094564] I [client-handshake.c:1425:client_setvolume_cbk] 3-mirror-client-2: Connected to 10.16.156.15:24009, atta
ched to remote volume '/export/mirror'.
[2012-05-29 02:46:49.094612] I [client-handshake.c:1437:client_setvolume_cbk] 3-mirror-client-2: Server and Client lk-version numbers 
are not same, reopening the fds
[2012-05-29 02:46:49.096629] I [client-handshake.c:453:client_set_lk_version_cbk] 3-mirror-client-2: Server lk version = 2
[2012-05-29 02:46:49.103579] I [client-handshake.c:1628:select_server_supported_programs] 0-mirror-client-2: Using Program GlusterFS 3
.3.0qa43, Num (1298437), Version (330)
[2012-05-29 02:46:49.104543] I [client-handshake.c:1628:select_server_supported_programs] 1-mirror-client-2: Using Program GlusterFS 3
.3.0qa43, Num (1298437), Version (330)
[2012-05-29 02:46:49.104704] I [client-handshake.c:1425:client_setvolume_cbk] 0-mirror-client-2: Connected to 10.16.156.15:24009, atta
ched to remote volume '/export/mirror'.
[2012-05-29 02:46:49.104731] I [client-handshake.c:1437:client_setvolume_cbk] 0-mirror-client-2: Server and Client lk-version numbers 
are not same, reopening the fds
[2012-05-29 02:46:49.104766] I [client-handshake.c:1274:client_post_handshake] 0-mirror-client-2: 6 fds open - Delaying child_up until
 they are re-opened
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2012-05-29 02:46:49
configuration details:
argp 1

Comment 1 shishir gowda 2012-09-25 07:39:22 UTC
*** Bug 859387 has been marked as a duplicate of this bug. ***

Comment 2 shishir gowda 2012-09-26 05:39:53 UTC
*** Bug 821148 has been marked as a duplicate of this bug. ***

Comment 3 Raghavendra G 2012-11-16 06:13:56 UTC
*** Bug 824533 has been marked as a duplicate of this bug. ***

Comment 4 Raghavendra G 2012-11-16 06:14:56 UTC
*** Bug 849136 has been marked as a duplicate of this bug. ***

Comment 5 Raghavendra G 2012-11-16 06:16:08 UTC
*** Bug 858456 has been marked as a duplicate of this bug. ***

Comment 6 Vijay Bellur 2012-11-20 07:30:43 UTC
CHANGE: http://review.gluster.org/4192 (protocol/client: Remember the gfid of opened fd) merged in release-3.3 by Vijay Bellur (vbellur)

Comment 7 Vijay Bellur 2012-11-20 07:41:24 UTC
*** Bug 832396 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.