Bug 764955 (GLUSTER-3223)

Summary: rebalace fails
Product: [Community] GlusterFS Reporter: Saurabh <saurabh>
Component: coreAssignee: Amar Tumballi <amarts>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: mainlineCC: amarts, gluster-bugs, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Saurabh 2011-07-22 06:20:40 UTC
setup is,

[root@centos-qa-client-2 ~]# /opt/glusterfs/git_inst/sbin/gluster volume info dr3
 
Volume Name: dr3
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.1.12.134:/mnt/d3
Brick2: 10.1.12.135:/mnt/r3
Brick3: 10.1.12.134:/mnt/d-d3
Brick4: 10.1.12.135:/mnt/r-r3
Brick5: 10.1.12.134:/mnt/dd-d3
Brick6: 10.1.12.135:/mnt/rr-r3
[root@centos-qa-client-2 ~]# 

in the above setup I added bricks, 

Brick5: 10.1.12.134:/mnt/dd-d3
Brick6: 10.1.12.135:/mnt/rr-r3

after creating data over fuse mount,
data included just creating few filed of size 1024.

after addition the rebalance fails,

this issue is seen twice already.

log glimpse from the glusterd log file,

[2011-07-22 01:24:20.769730] I [glusterd-rebalance.c:280:glusterd_defrag_start] 0-rebalance: rebalance on /etc/glusterd/mount/dr3 complete
[2011-07-22 01:24:20.770901] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:962)
[2011-07-22 01:24:20.840145] I [glusterd-rebalance.c:422:glusterd_rebalance_cmd_attempted_log] 0-glusterd: Received rebalance volume 3 on dr3
[2011-07-22 01:24:20.840777] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:946)


logs glimpse from the bricks logs,(10.1.12.135:/mnt/rr-r3)

from the newly added replicate brick, 
[2011-07-22 01:24:20.669310] I [server-handshake.c:534:server_setvolume] 0-dr3-server: accepted client from 10.1.12.135:964
[2011-07-22 01:24:20.770924] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.dr3-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.12.135:964)
[2011-07-22 01:24:20.770980] I [server-helpers.c:485:do_fd_cleanup] 0-dr3-server: fd cleanup on /
[2011-07-22 01:24:20.771004] I [server.c:438:server_rpc_notify] 0-dr3-server: disconnected connection from 10.1.12.135:964
[2011-07-22 01:24:20.771023] I [server-helpers.c:783:server_connection_destroy] 0-dr3-server: destroyed connection of centos-qa-client-3-29997-2011/07/22-01:24:16:640972-dr3-client-5
[2011-07-22 01:24:55.432259] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.dr3-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.12.134:989)
[2011-07-22 01:24:55.432349] I [server.c:438:server_rpc_notify] 0-dr3-server: disconnected connection from 10.1.12.134:989
[2011-07-22 01:24:55.432370] I [server-helpers.c:783:server_connection_destroy] 0-dr3-server: destroyed connection of centos-qa-client-2-23514-2011/07/22-01:17:07:634652-dr3-client-5



from the newly added distribute brick,(10.1.12.134:/mnt/dd-d3)

[2011-07-22 01:17:41.698905] W [socket.c:204:__socket_rwv] 0-tcp.dr3-server: readv failed (Connection reset by peer)
[2011-07-22 01:17:41.698919] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.dr3-server: reading from socket failed. Error (Connection reset by peer), peer (10.1.12.135:946)
[2011-07-22 01:17:41.698943] I [server-helpers.c:485:do_fd_cleanup] 0-dr3-server: fd cleanup on /
[2011-07-22 01:17:41.698960] I [server.c:438:server_rpc_notify] 0-dr3-server: disconnected connection from 10.1.12.135:946
[2011-07-22 01:17:41.698976] I [server-helpers.c:783:server_connection_destroy] 0-dr3-server: destroyed connection of centos-qa-client-3-29997-2011/07/22-01:24:16:640972-dr3-client-4
[2011-07-22 01:18:16.358620] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.dr3-server: reading from socket failed. Error (Transport endpoint is not connected), peer (10.1.12.134:1022)
[2011-07-22 01:18:16.358699] I [server.c:438:server_rpc_notify] 0-dr3-server: disconnected connection from 10.1.12.134:1022
[2011-07-22 01:18:16.358737] I [server-helpers.c:783:server_connection_destroy] 0-dr3-server: destroyed connection of centos-qa-client-2-23514-2011/07/22-01:17:07:634652-dr3-client-4


but the peer status shows that it is still connected,
[root@centos-qa-client-3 glusterfs]# /opt/glusterfs/git_inst/sbin/gluster peer status
Number of Peers: 1

Hostname: 10.1.12.134
Uuid: 017e0724-58ae-427a-b7f8-928850b3dba6
State: Peer in Cluster (Connected)
[root@centos-qa-client-3 glusterfs]#

Comment 1 Saurabh 2011-07-22 06:21:20 UTC
the git repo is synced to this id,


commit 37b4b7ccad5a479ed2ce96598e902a12fd457bd5
Author: Raghavendra Bhat <raghavendrabhat>
Date:   Wed Jul 20 12:21:12 2011 +0530

    debug/io-stats: allow log-level to be set to INFO
    
    Change-Id: Ieba6591b0641dcb7dad724c1d8199dc5e91b6bfd

Comment 2 Saurabh 2011-07-22 09:09:34 UTC

Comment 3 Amar Tumballi 2011-07-26 01:09:09 UTC
Hi Saurabh,

Can you check if rebalance works fine now? because after http://review.gluster.com/90 most of the things should now work fine.

Regards,
Amar

Comment 4 Saurabh 2011-07-26 08:36:03 UTC
Yeah Amar, the issue is still happening,

latest sync points to this change number,
Change-Id: I85cf4afc1f534f5f51018449d5d84baef18fce23

Comment 5 Saurabh 2011-08-19 09:52:48 UTC
rebalance fails after quota disable and adding new bricks,

[root@Centos1 ~]# /opt/glusterfs/3.3.0//sbin/gluster volume quota drep disable 
Disabling quota will delete all the quota configuration. Do you want to continue? (y/n) y
Disabling quota has been successful
[root@Centos1 ~]# /opt/glusterfs/3.3.0//sbin/gluster volume add-brick drep 10.1.12.134:/mnt/d-d3 10.1.12.135:/mnt/r-r3
Add Brick successful
[root@Centos1 ~]# /opt/glusterfs/3.3.0//sbin/gluster volume rebalance drep start
starting rebalance on volume drep has been successful
[root@Centos1 ~]# /opt/glusterfs/3.3.0//sbin/gluster volume rebalance drep status
rebalance failed



[2011-08-19 08:42:27.649555] I [dht-rebalance.c:317:dht_migrate_file] 0-drep-dht: /d0/f.26: attempting to move from drep-replicate-0 to d
rep-replicate-4
[2011-08-19 08:42:27.650134] W [dict.c:356:dict_del] (-->/opt/glusterfs/3.3.0//lib/glusterfs/3.3.0qa3/xlator/cluster/distribute.so(dht_mi
grate_file+0x54e) [0x2aaaada2e7be] (-->/opt/glusterfs/3.3.0//lib/libglusterfs.so.0(syncop_lookup+0x148) [0x2b540f63f508] (-->/opt/gluster
fs/3.3.0//lib/glusterfs/3.3.0qa3/xlator/cluster/replicate.so(afr_lookup+0x403) [0x2aaaad8105f3]))) 0-dict: !this || key=gfid-req
[2011-08-19 08:42:27.652039] W [dht-rebalance.c:244:__dht_check_free_space] 0-drep-dht: data movement attempted from node (drep-replicate
-0) with higher disk space to a node (drep-replicate-4) with lesser disk space (/d0/f.26)
[2011-08-19 08:42:27.652098] C [fuse-bridge.c:876:fuse_setxattr_cbk] 0-glusterfs-fuse: extended attribute not supported by the backend storage
[2011-08-19 08:42:27.652233] W [fuse-bridge.c:847:fuse_err_cbk] 0-glusterfs-fuse: 142: SETXATTR() /d0/f.26 => -1 (Operation not supported)
pending frames:
frame : type(1) op(SETXATTR)
frame : type(1) op(SETXATTR)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2011-08-19 08:42:27
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
[2011-08-19 08:42:27.649555] I [dht-rebalance.c:317:dht_migrate_file] 0-drep-dht: /d0/f.26: attempting to move from drep-replicate-0 to d
rep-replicate-4
[2011-08-19 08:42:27.650134] W [dict.c:356:dict_del] (-->/opt/glusterfs/3.3.0//lib/glusterfs/3.3.0qa3/xlator/cluster/distribute.so(dht_mi
grate_file+0x54e) [0x2aaaada2e7be] (-->/opt/glusterfs/3.3.0//lib/libglusterfs.so.0(syncop_lookup+0x148) [0x2b540f63f508] (-->/opt/gluster
fs/3.3.0//lib/glusterfs/3.3.0qa3/xlator/cluster/replicate.so(afr_lookup+0x403) [0x2aaaad8105f3]))) 0-dict: !this || key=gfid-req
[2011-08-19 08:42:27.652039] W [dht-rebalance.c:244:__dht_check_free_space] 0-drep-dht: data movement attempted from node (drep-replicate
-0) with higher disk space to a node (drep-replicate-4) with lesser disk space (/d0/f.26)
[2011-08-19 08:42:27.652098] C [fuse-bridge.c:876:fuse_setxattr_cbk] 0-glusterfs-fuse: extended attribute not supported by the backend storage
[2011-08-19 08:42:27.652233] W [fuse-bridge.c:847:fuse_err_cbk] 0-glusterfs-fuse: 142: SETXATTR() /d0/f.26 => -1 (Operation not supported)
pending frames:
frame : type(1) op(SETXATTR)
frame : type(1) op(SETXATTR)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2011-08-19 08:42:27
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0qa3
/lib64/libc.so.6[0x3470c302d0]
/opt/glusterfs/3.3.0//lib/libglusterfs.so.0(synctask_wrap+0x4)[0x2b540f63c834]
/lib64/libc.so.6[0x3470c419c0]
---------

:

:

Comment 6 Amar Tumballi 2011-08-19 14:17:03 UTC
I tried to reproduce this in my setup.. i get the same log


[2011-08-19 22:41:41.595461] W [dict.c:356:dict_del] (-->/usr/local/lib/glusterfs/3git/xlator/cluster/distribute.so(+0xbab2) [0x7f06d70abab2] (-->/usr/local/lib/libglusterfs.so.0(syncop_lookup+0x350) [0x7f06dce6fbc9] (-->/usr/local/lib/glusterfs/3git/xlator/cluster/replicate.so(afr_lookup+0x524) [0x7f06d73444a5]))) 0-dict: !this || key=gfid-req
[2011-08-19 22:41:41.596551] W [dht-rebalance.c:244:__dht_check_free_space] 0-test-dht: data movement attempted from node (test-replicate-2) with higher disk space to a node (test-replicate-1) with lesser disk space (/testdir/16)
[2011-08-19 22:41:41.596588] W [fuse-bridge.c:847:fuse_err_cbk] 0-glusterfs-fuse: 135: SETXATTR() /testdir/16 => -1 (Operation not supported)

but, the crash doesn't happen.

Comment 7 Amar Tumballi 2011-08-19 14:28:03 UTC
Saurabh, can you share details of the machine login on which the rebalance failed? Want to have a look at the core.

Comment 8 Amar Tumballi 2011-10-02 22:58:33 UTC
With bug 765368 getting fixed this should be fixed. Please reopen if seen again.