Bug 826584

Summary: [glusterfs-3.3.0qa45]: rebalance process crashed
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: distributeAssignee: shishir gowda <sgowda>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: pre-releaseCC: amarts, bhubbard, fharshav, gluster-bugs, nsathyan, shmohan
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:33:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 856013, 894237    

Description Raghavendra Bhat 2012-05-30 14:49:40 UTC
Description of problem:

Same setup as 826080. After adding 2 bricks to a 2 replica volume (making it 2x2 distributed replicate) did rebalance. Rebalance process crashed with the following backtrace.


Loaded symbols for /usr/local/lib/glusterfs/3.3.0qa45/xlator/performance/md-cache.so
Cannot find new threads: generic error
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.3.x86_64 valgrind-3.6.0-4.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  0x0000003cdc20eccd in nanosleep () from /lib64/libpthread.so.0
#1  0x0000000004c3e22b in gf_timer_proc (ctx=0x52ea040) at ../../../libglusterfs/src/timer.c:173
#2  0x0000003cdc2077f1 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003cdbae592d in clone () from /lib64/libc.so.6
(gdb) t 2
[Switching to thread 2 (Thread 28542)]#0  0x0000003cdc20e48d in write () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003cdc20e48d in write () from /lib64/libpthread.so.0
#1  0x0000000004c3a269 in gf_print_trace (signum=6) at ../../../libglusterfs/src/common-utils.c:373
#2  <signal handler called>
#3  0x0000003cdba32885 in raise () from /lib64/libc.so.6
#4  0x0000003cdba34065 in abort () from /lib64/libc.so.6
#5  0x0000003cdba2b9fe in __assert_fail_base () from /lib64/libc.so.6
#6  0x0000003cdba2bac0 in __assert_fail () from /lib64/libc.so.6
#7  0x0000000004c5ae1f in __gf_free (free_ptr=0x569f400) at ../../../libglusterfs/src/mem-pool.c:269
#8  0x000000000986e38a in gf_defrag_start_crawl (data=0x5689dd0) at ../../../../../xlators/cluster/dht/src/dht-rebalance.c:1491
#9  0x0000000004c6a882 in synctask_wrap (old_task=0xc6d0f20) at ../../../libglusterfs/src/syncop.c:120
#10 0x0000003cdba43610 in ?? () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb)  f 7
#7  0x0000000004c5ae1f in __gf_free (free_ptr=0x569f400) at ../../../libglusterfs/src/mem-pool.c:269
269                     GF_ASSERT (0);
(gdb) f 8
#8  0x000000000986e38a in gf_defrag_start_crawl (data=0x5689dd0) at ../../../../../xlators/cluster/dht/src/dht-rebalance.c:1491
1491                    GF_FREE (defrag);
(gdb) p defrag
$1 = (gf_defrag_info_t *) 0x569f400
(gdb) p *defrag
$2 = {total_files = 0, total_data = 0, num_files_lookedup = 0, total_failures = 0, lock = 1, cmd = 1, th = 356079360, 
  defrag_status = GF_DEFRAG_STATUS_COMPLETE, rpc = 0x0, connected = 0, is_exiting = 1, pid = -3, root_inode = 0x1579607c, 
  node_uuid = "/>\"N\310\fK\031\223\370\260\250\230\204_\035"}
(gdb) t 3
[Switching to thread 3 (Thread 28541)]#0  0x0000003cdba32885 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003cdba32885 in raise () from /lib64/libc.so.6
#1  0x0000003cdba34065 in abort () from /lib64/libc.so.6
#2  0x0000003cdba2b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003cdba2bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x0000000004c5ae1f in __gf_free (free_ptr=0x569f400) at ../../../libglusterfs/src/mem-pool.c:269
#5  0x000000000986e38a in gf_defrag_start_crawl (data=0x5689dd0) at ../../../../../xlators/cluster/dht/src/dht-rebalance.c:1491
#6  0x0000000004c6a882 in synctask_wrap (old_task=0x66e8240) at ../../../libglusterfs/src/syncop.c:120
#7  0x0000003cdba43610 in ?? () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()
(gdb) f 5
#5  0x000000000986e38a in gf_defrag_start_crawl (data=0x5689dd0) at ../../../../../xlators/cluster/dht/src/dht-rebalance.c:1491
1491                    GF_FREE (defrag);
(gdb) p defrag
$3 = (gf_defrag_info_t *) 0x569f400
(gdb) p *defrag
$4 = {total_files = 0, total_data = 0, num_files_lookedup = 0, total_failures = 0, lock = 1, cmd = 1, th = 356079360, 
  defrag_status = GF_DEFRAG_STATUS_COMPLETE, rpc = 0x0, connected = 0, is_exiting = 1, pid = -3, root_inode = 0x1579607c, 
  node_uuid = "/>\"N\310\fK\031\223\370\260\250\230\204_\035"}
(gdb) t 4
[Switching to thread 4 (Thread 28540)]#0  0x0000003cdbad97f7 in unlink () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003cdbad97f7 in unlink () from /lib64/libc.so.6
#1  0x0000000000407807 in glusterfs_pidfile_cleanup (ctx=0x52ea040) at ../../../glusterfsd/src/glusterfsd.c:1333
#2  0x000000000040632c in cleanup_and_exit (signum=15) at ../../../glusterfsd/src/glusterfsd.c:851
#3  0x0000000000407c18 in glusterfs_sigwaiter (arg=0x7fefffa70) at ../../../glusterfsd/src/glusterfsd.c:1421
#4  0x0000003cdc2077f1 in start_thread () from /lib64/libpthread.so.0
#5  0x0000003cdbae592d in clone () from /lib64/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 28539)]#0  0x0000003cdbad862d in read () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003cdbad862d in read () from /lib64/libc.so.6
#1  0x0000003cdba70f58 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x0000003cdba72a5e in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x0000003cdba6724a in _IO_getline_info_internal () from /lib64/libc.so.6
#4  0x0000003cdba6ff71 in fgets_unlocked () from /lib64/libc.so.6
#5  0x0000000008778656 in _nss_files_getservbyport_r () from /lib64/libnss_files.so.2
#6  0x0000003cdbb02cfa in getservbyport_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#7  0x0000003cdbb093f6 in getnameinfo () from /lib64/libc.so.6
#8  0x0000000004c39aa7 in gf_resolve_ip6 (hostname=0x6674750 "10.16.156.9", port=24007, family=0, dnscache=0x56e9ba8, 
    addr_info=0x7feffdf00) at ../../../libglusterfs/src/common-utils.c:138
#9  0x000000000855a2ed in af_inet_client_get_remote_sockaddr (this=0x56e9b30, sockaddr=0x7feffdfd0, sockaddr_len=0x7feffe054)
    at ../../../../../rpc/rpc-transport/socket/src/name.c:241
#10 0x000000000855ae61 in socket_client_get_remote_sockaddr (this=0x56e9b30, sockaddr=0x7feffdfd0, sockaddr_len=0x7feffe054, 
    sa_family=0x7feffe052) at ../../../../../rpc/rpc-transport/socket/src/name.c:499
#11 0x00000000085573cf in socket_connect (this=0x56e9b30, port=24009) at ../../../../../rpc/rpc-transport/socket/src/socket.c:2009
#12 0x0000000004eab9a8 in rpc_transport_connect (this=0x56e9b30, port=24009) at ../../../../rpc/rpc-lib/src/rpc-transport.c:380
#13 0x0000000004eb1394 in rpc_clnt_submit (rpc=0x56e96d0, prog=0x95cc340, procnum=27, cbkfn=0x93a8d10 <client3_1_lookup_cbk>, 
    proghdr=0x7feffeb10, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=0xee96a90, frame=0x5369a1c, rsphdr=0x0, 
    rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:1482
#14 0x000000000938dd4d in client_submit_request (this=0x5685000, req=0x7feffece0, frame=0x5369a1c, prog=0x95cc340, procnum=27, 
    cbkfn=0x93a8d10 <client3_1_lookup_cbk>, iobref=0x0, rsphdr=0x0, rsphdr_count=0, rsp_payload=0x0, rsp_payload_count=0, 
    rsp_iobref=0x0, xdrproc=0x50c236c <xdr_gfs3_lookup_req>) at ../../../../../xlators/protocol/client/src/client.c:229
#15 0x00000000093aa6b5 in client3_1_lookup (frame=0x5369a1c, this=0x5685000, data=0x7feffedf0)
    at ../../../../../xlators/protocol/client/src/client3_1-fops.c:2983
#16 0x000000000938e35d in client_lookup (frame=0x5369a1c, this=0x5685000, loc=0xb079d10, xdata=0x561ea70)
    at ../../../../../xlators/protocol/client/src/client.c:368
#17 0x0000000009636ccb in afr_lookup (frame=0x5369970, this=0x6677180, loc=0xa7447a0, xattr_req=0x561eb18)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:2182
#18 0x000000000988143d in dht_lookup_directory (frame=0x536976c, this=0x5689dd0, loc=0xa7447a0)
    at ../../../../../xlators/cluster/dht/src/dht-common.c:1132
#19 0x0000000009881b3d in dht_lookup_cbk (frame=0x536976c, cookie=0x5369818, this=0x5689dd0, op_ret=-1, op_errno=107, inode=0x0, 
    stbuf=0xb07ccac, xattr=0x0, postparent=0xb07cd1c) at ../../../../../xlators/cluster/dht/src/dht-common.c:1210
#20 0x0000000009635dbc in afr_lookup_done (frame=0x5369818, this=0x6677180)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:1880
#21 0x00000000096364be in afr_lookup_cbk (frame=0x5369818, cookie=0x0, this=0x6677180, op_ret=-1, op_errno=107, inode=0x1579607c, 
    buf=0x7fefff3f0, xattr=0x0, postparent=0x7fefff380) at ../../../../../xlators/cluster/afr/src/afr-common.c:2043
#22 0x00000000093a943c in client3_1_lookup_cbk (req=0xb6bb498, iov=0x7fefff600, count=1, myframe=0x53698c4)
at ../../../../../xlators/protocol/client/src/client3_1-fops.c:2636
#23 0x0000000004eaeb04 in saved_frames_unwind (saved_frames=0xb93b1f0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:378
#24 0x0000000004eaebb3 in saved_frames_destroy (frames=0xb93b1f0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:396
#25 0x0000000004eaf16d in rpc_clnt_connection_cleanup (conn=0x56e9700) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:558
#26 0x0000000004eafc4e in rpc_clnt_notify (trans=0x56e9b30, mydata=0x56e9700, event=RPC_TRANSPORT_DISCONNECT, data=0x56e9b30)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:861
#27 0x0000000004eabec8 in rpc_transport_notify (this=0x56e9b30, event=RPC_TRANSPORT_DISCONNECT, data=0x56e9b30)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:489
#28 0x00000000085521d7 in socket_event_poll_err (this=0x56e9b30) at ../../../../../rpc/rpc-transport/socket/src/socket.c:685
#29 0x0000000008556890 in socket_event_handler (fd=10, idx=6, data=0x56e9b30, poll_in=1, poll_out=0, poll_err=16)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1799
#30 0x0000000004c59d9c in event_dispatch_epoll_handler (event_pool=0x5307e60, events=0x666dc70, i=0)
    at ../../../libglusterfs/src/event.c:785
#31 0x0000000004c59fbf in event_dispatch_epoll (event_pool=0x5307e60) at ../../../libglusterfs/src/event.c:847
#32 0x0000000004c5a34a in event_dispatch (event_pool=0x5307e60) at ../../../libglusterfs/src/event.c:947
#33 0x0000000000408567 in main (argc=27, argv=0x7fefffc78) at ../../../glusterfsd/src/glusterfsd.c:1689
(gdb) 







Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

gluster volume info
 
Volume Name: mirror
Type: Distributed-Replicate
Volume ID: b99f6e32-ff75-4d87-9ed0-267cd8d6c66c
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.16.156.9:/export/mirror
Brick2: 10.16.156.12:/export/mirror
Brick3: 10.16.156.15:/export/mirror
Brick4: 10.16.156.18:/export/mirror
Options Reconfigured:
performance.stat-prefetch: off
network.ping-timeout: 222
features.lock-heal: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
features.limit-usage: /:1TB
features.quota: on
geo-replication.indexing: on

valgrind log:

==28539== Thread 3:
==28539== Invalid read of size 4
==28539==    at 0x986E2DA: gf_defrag_start_crawl (dht-rebalance.c:1471)
==28539==  Address 0x569f430 is 80 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid read of size 4
==28539==    at 0x986E2E6: gf_defrag_start_crawl (dht-rebalance.c:1472)
==28539==  Address 0x569f430 is 80 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid write of size 4
==28539==    at 0x986E2F2: gf_defrag_start_crawl (dht-rebalance.c:1473)
==28539==  Address 0x569f430 is 80 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid read of size 4
==28539==    at 0x3CDC20C110: pthread_spin_lock (in /lib64/libpthread-2.12.so)
==28539==  Address 0x569f420 is 64 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid read of size 4
==28539==    at 0x986E5AE: gf_defrag_status_get (dht-rebalance.c:1560)
==28539==  Address 0x569f430 is 80 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid read of size 8
==28539==    at 0x986E5BD: gf_defrag_status_get (dht-rebalance.c:1563)
==28539==  Address 0x569f400 is 32 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid read of size 8
==28539==    at 0x986E5C8: gf_defrag_status_get (dht-rebalance.c:1564)
==28539==  Address 0x569f408 is 40 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)
==28539== 
==28539== Invalid read of size 8
==28539==    at 0x986E5D4: gf_defrag_status_get (dht-rebalance.c:1565)
==28539==  Address 0x569f410 is 48 bytes inside a block of size 144 free'd
==28539==    at 0x4A0595D: free (vg_replace_malloc.c:366)

Comment 1 shishir gowda 2012-06-07 06:41:25 UTC
The patch http://review.gluster.com/#change,3495 has been successfully merged into master branch.

Comment 2 shishir gowda 2012-06-07 09:02:02 UTC
*** Bug 823256 has been marked as a duplicate of this bug. ***

Comment 3 Amar Tumballi 2012-06-08 06:13:12 UTC
only in upstream for now.

Comment 4 Harshavardhana 2012-10-01 21:43:35 UTC
*** Bug 856013 has been marked as a duplicate of this bug. ***