Bug 851068 - replace-brick reports "Migration complete" while data are not migrated
Summary: replace-brick reports "Migration complete" while data are not migrated
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: GlusterFS
Classification: Community
Component: unclassified
Version: 3.3.0
Hardware: x86_64
OS: Linux
high
unspecified
Target Milestone: ---
Assignee: krishnan parthasarathi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-23 08:00 UTC by Christos Triantafyllidis
Modified: 2015-11-03 23:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-14 19:40:29 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server (23.84 KB, application/octet-stream)
2012-08-23 08:00 UTC, Christos Triantafyllidis
no flags Details

Description Christos Triantafyllidis 2012-08-23 08:00:51 UTC
Created attachment 606471 [details]
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server

Description of problem:
The replace-brick command reports "Migration complete" after some time of migrating while it doesn't finish the migration. I'm attaching part of the log file at the destination gluster server/brick:
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log

In the brick log on the source brick i see many lines (could be that many as one per each missing file on the destination) like this (the full log is quite big, let me know if you want me to check something in specific or attach part of it):
[2012-08-22 20:51:25.620159] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-cpg-pump: path /backup on subvolume cpg-replace-brick => -1 (No such file or directory)

I've verified that at least some (random selected) files that are not migrated are:
a) visible and accessible on brick's FS
b) visible and accessible from the FUSE client on another node

Version-Release number of selected component (if applicable):
3.3.0

How reproducible:
Not sure what caused it, but now i can replicate it on every brick i'm trying to migrate from and to any gluster server


Steps to Reproduce:
1. Start a replace-brick process
2. Wait a few hours
3. Get the "Migration completed" message and check the "df" or "find . | wc -l" output of the source and destination brick
  
Actual results:
The brick is partly migrated

Expected results:
The brick should fully migrate

Additional info:
As said above logs are available but i'm not sure what to attach, let me know if something in specific is needed.

Comment 1 Christos Triantafyllidis 2012-08-24 07:42:36 UTC
Hi any update?

Cheers,
Christos

Comment 2 Christos Triantafyllidis 2012-08-24 09:32:30 UTC
Some update from my end.

I tried a new migration on another volume. After a while i got at the logs of source server:
[2012-08-24 12:01:35.615266] W [socket.c:195:__socket_rwv] 0-software-replace-brick: readv failed (Connection reset by peer)
[2012-08-24 12:01:35.615330] W [socket.c:1512:__socket_proto_state_machine] 0-software-replace-brick: reading from socket failed. Error (Connection reset by peer), peer (10.250.121.162:24026)
[2012-08-24 12:01:35.621058] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x3a25a0f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x3a25a0f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3a25a0ef0e]))) 0-software-replace-brick: forced unwinding frame type(GlusterFS 3.1) op(LINK(9)) called at 2012-08-24 12:01:28.717036 (xid=0x1847026x)
[2012-08-24 12:01:35.621098] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-software-replace-brick: remote operation failed: Transport endpoint is not connected (00000000-0000-0000-0000-000000000000 -> /centos/6/os/i386/Packages/libcmpiCppImpl0-2.0.1-5.el6.i686.rpm)

and after that a huge list with "operation failed: Transport endpoint is not connected" errors.

and in about the same time in destination server:
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-08-24 12:01:04
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0
/lib64/libc.so.6[0x3208a32920]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x320920c170]
/usr/lib64/libglusterfs.so.0(iobuf_unref+0x27)[0x320a242387]
/usr/lib64/libglusterfs.so.0(iobref_destroy+0x28)[0x320a242468]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_submit_reply+0x22c)[0x7f39dd90e18c]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_cbk+0x3c9)[0x7f39dd929f99]
/usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup_cbk+0x122)[0x7f39ddb45302]
/usr/lib64/glusterfs/3.3.0/xlator/storage/posix.so(posix_lookup+0x415)[0x7f39ddd626a5]
/usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup+0x1e4)[0x7f39ddb44b44]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_resume+0x125)[0x7f39dd92a935]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_done+0x33)[0x7f39dd910833]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xc5)[0x7f39dd9110b5]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0x8d)[0x7f39dd910f4d]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xbe)[0x7f39dd9110ae]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_inode+0x35)[0x7f39dd910e75]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0xa8)[0x7f39dd910f68]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0x9e)[0x7f39dd91108e]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(resolve_and_resume+0x14)[0x7f39dd911154]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0x18f)[0x7f39dd92a62f]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293)[0x320b20a443]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93)[0x320b20a5b3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x320b20b018]
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f39dcfda954]
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f39dcfdaa37]
/usr/lib64/libglusterfs.so.0[0x320a23ed44]
/usr/sbin/glusterfs(main+0x58a)[0x4073ca]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3208a1ecdd]
/usr/sbin/glusterfs[0x404379]


Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=848859 ? If so we now have it with single volume transfers too :(.

Comment 3 hans 2013-06-07 08:47:52 UTC
this reproduces on 3.3.2qa3 :

stor1:~/ gluster volume replace-brick vol01 stor1:/brick/e stor3:/brick/b status
Number of files migrated = 3385012        Migration complete

stor1:~/ df -h /brick/e
Filesystem            Size  Used Avail Use% Mounted on
/dev/sde1             1.8T  1.5T  372G  80% /brick/e

stor3:~/ df -h /brick/b
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             1.8T  122G  1.7T   7% /brick/b

Clearly over a TiB of data is missing.

Comment 4 Niels de Vos 2014-11-27 14:53:48 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.


Note You need to log in before you can comment on or make changes to this bug.