Bug 851068 - replace-brick reports "Migration complete" while data are not migrated
replace-brick reports "Migration complete" while data are not migrated
Status: CLOSED DEFERRED
Product: GlusterFS
Classification: Community
Component: unclassified (Show other bugs)
3.3.0
x86_64 Linux
high Severity unspecified
: ---
: ---
Assigned To: krishnan parthasarathi
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-23 04:00 EDT by Christos Triantafyllidis
Modified: 2015-11-03 18:04 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-12-14 14:40:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server (23.84 KB, application/octet-stream)
2012-08-23 04:00 EDT, Christos Triantafyllidis
no flags Details

  None (edit)
Description Christos Triantafyllidis 2012-08-23 04:00:51 EDT
Created attachment 606471 [details]
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server

Description of problem:
The replace-brick command reports "Migration complete" after some time of migrating while it doesn't finish the migration. I'm attaching part of the log file at the destination gluster server/brick:
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log

In the brick log on the source brick i see many lines (could be that many as one per each missing file on the destination) like this (the full log is quite big, let me know if you want me to check something in specific or attach part of it):
[2012-08-22 20:51:25.620159] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-cpg-pump: path /backup on subvolume cpg-replace-brick => -1 (No such file or directory)

I've verified that at least some (random selected) files that are not migrated are:
a) visible and accessible on brick's FS
b) visible and accessible from the FUSE client on another node

Version-Release number of selected component (if applicable):
3.3.0

How reproducible:
Not sure what caused it, but now i can replicate it on every brick i'm trying to migrate from and to any gluster server


Steps to Reproduce:
1. Start a replace-brick process
2. Wait a few hours
3. Get the "Migration completed" message and check the "df" or "find . | wc -l" output of the source and destination brick
  
Actual results:
The brick is partly migrated

Expected results:
The brick should fully migrate

Additional info:
As said above logs are available but i'm not sure what to attach, let me know if something in specific is needed.
Comment 1 Christos Triantafyllidis 2012-08-24 03:42:36 EDT
Hi any update?

Cheers,
Christos
Comment 2 Christos Triantafyllidis 2012-08-24 05:32:30 EDT
Some update from my end.

I tried a new migration on another volume. After a while i got at the logs of source server:
[2012-08-24 12:01:35.615266] W [socket.c:195:__socket_rwv] 0-software-replace-brick: readv failed (Connection reset by peer)
[2012-08-24 12:01:35.615330] W [socket.c:1512:__socket_proto_state_machine] 0-software-replace-brick: reading from socket failed. Error (Connection reset by peer), peer (10.250.121.162:24026)
[2012-08-24 12:01:35.621058] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x3a25a0f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x3a25a0f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3a25a0ef0e]))) 0-software-replace-brick: forced unwinding frame type(GlusterFS 3.1) op(LINK(9)) called at 2012-08-24 12:01:28.717036 (xid=0x1847026x)
[2012-08-24 12:01:35.621098] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-software-replace-brick: remote operation failed: Transport endpoint is not connected (00000000-0000-0000-0000-000000000000 -> /centos/6/os/i386/Packages/libcmpiCppImpl0-2.0.1-5.el6.i686.rpm)

and after that a huge list with "operation failed: Transport endpoint is not connected" errors.

and in about the same time in destination server:
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-08-24 12:01:04
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0
/lib64/libc.so.6[0x3208a32920]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x320920c170]
/usr/lib64/libglusterfs.so.0(iobuf_unref+0x27)[0x320a242387]
/usr/lib64/libglusterfs.so.0(iobref_destroy+0x28)[0x320a242468]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_submit_reply+0x22c)[0x7f39dd90e18c]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_cbk+0x3c9)[0x7f39dd929f99]
/usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup_cbk+0x122)[0x7f39ddb45302]
/usr/lib64/glusterfs/3.3.0/xlator/storage/posix.so(posix_lookup+0x415)[0x7f39ddd626a5]
/usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup+0x1e4)[0x7f39ddb44b44]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_resume+0x125)[0x7f39dd92a935]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_done+0x33)[0x7f39dd910833]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xc5)[0x7f39dd9110b5]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0x8d)[0x7f39dd910f4d]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xbe)[0x7f39dd9110ae]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_inode+0x35)[0x7f39dd910e75]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0xa8)[0x7f39dd910f68]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0x9e)[0x7f39dd91108e]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(resolve_and_resume+0x14)[0x7f39dd911154]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0x18f)[0x7f39dd92a62f]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293)[0x320b20a443]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93)[0x320b20a5b3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x320b20b018]
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f39dcfda954]
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f39dcfdaa37]
/usr/lib64/libglusterfs.so.0[0x320a23ed44]
/usr/sbin/glusterfs(main+0x58a)[0x4073ca]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3208a1ecdd]
/usr/sbin/glusterfs[0x404379]


Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=848859 ? If so we now have it with single volume transfers too :(.
Comment 3 hans 2013-06-07 04:47:52 EDT
this reproduces on 3.3.2qa3 :

stor1:~/ gluster volume replace-brick vol01 stor1:/brick/e stor3:/brick/b status
Number of files migrated = 3385012        Migration complete

stor1:~/ df -h /brick/e
Filesystem            Size  Used Avail Use% Mounted on
/dev/sde1             1.8T  1.5T  372G  80% /brick/e

stor3:~/ df -h /brick/b
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             1.8T  122G  1.7T   7% /brick/b

Clearly over a TiB of data is missing.
Comment 4 Niels de Vos 2014-11-27 09:53:48 EST
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.

Note You need to log in before you can comment on or make changes to this bug.