Bug 851068 - replace-brick reports "Migration complete" while data are not migrated
replace-brick reports "Migration complete" while data are not migrated
Product: GlusterFS
Classification: Community
Component: unclassified (Show other bugs)
x86_64 Linux
high Severity unspecified
: ---
: ---
Assigned To: krishnan parthasarathi
Depends On:
  Show dependency treegraph
Reported: 2012-08-23 04:00 EDT by Christos Triantafyllidis
Modified: 2015-11-03 18:04 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2014-12-14 14:40:29 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server (23.84 KB, application/octet-stream)
2012-08-23 04:00 EDT, Christos Triantafyllidis
no flags Details

  None (edit)
Description Christos Triantafyllidis 2012-08-23 04:00:51 EDT
Created attachment 606471 [details]
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server

Description of problem:
The replace-brick command reports "Migration complete" after some time of migrating while it doesn't finish the migration. I'm attaching part of the log file at the destination gluster server/brick:

In the brick log on the source brick i see many lines (could be that many as one per each missing file on the destination) like this (the full log is quite big, let me know if you want me to check something in specific or attach part of it):
[2012-08-22 20:51:25.620159] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-cpg-pump: path /backup on subvolume cpg-replace-brick => -1 (No such file or directory)

I've verified that at least some (random selected) files that are not migrated are:
a) visible and accessible on brick's FS
b) visible and accessible from the FUSE client on another node

Version-Release number of selected component (if applicable):

How reproducible:
Not sure what caused it, but now i can replicate it on every brick i'm trying to migrate from and to any gluster server

Steps to Reproduce:
1. Start a replace-brick process
2. Wait a few hours
3. Get the "Migration completed" message and check the "df" or "find . | wc -l" output of the source and destination brick
Actual results:
The brick is partly migrated

Expected results:
The brick should fully migrate

Additional info:
As said above logs are available but i'm not sure what to attach, let me know if something in specific is needed.
Comment 1 Christos Triantafyllidis 2012-08-24 03:42:36 EDT
Hi any update?

Comment 2 Christos Triantafyllidis 2012-08-24 05:32:30 EDT
Some update from my end.

I tried a new migration on another volume. After a while i got at the logs of source server:
[2012-08-24 12:01:35.615266] W [socket.c:195:__socket_rwv] 0-software-replace-brick: readv failed (Connection reset by peer)
[2012-08-24 12:01:35.615330] W [socket.c:1512:__socket_proto_state_machine] 0-software-replace-brick: reading from socket failed. Error (Connection reset by peer), peer (
[2012-08-24 12:01:35.621058] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x3a25a0f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x3a25a0f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3a25a0ef0e]))) 0-software-replace-brick: forced unwinding frame type(GlusterFS 3.1) op(LINK(9)) called at 2012-08-24 12:01:28.717036 (xid=0x1847026x)
[2012-08-24 12:01:35.621098] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-software-replace-brick: remote operation failed: Transport endpoint is not connected (00000000-0000-0000-0000-000000000000 -> /centos/6/os/i386/Packages/libcmpiCppImpl0-2.0.1-5.el6.i686.rpm)

and after that a huge list with "operation failed: Transport endpoint is not connected" errors.

and in about the same time in destination server:
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-08-24 12:01:04
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0

Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=848859 ? If so we now have it with single volume transfers too :(.
Comment 3 hans 2013-06-07 04:47:52 EDT
this reproduces on 3.3.2qa3 :

stor1:~/ gluster volume replace-brick vol01 stor1:/brick/e stor3:/brick/b status
Number of files migrated = 3385012        Migration complete

stor1:~/ df -h /brick/e
Filesystem            Size  Used Avail Use% Mounted on
/dev/sde1             1.8T  1.5T  372G  80% /brick/e

stor3:~/ df -h /brick/b
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             1.8T  122G  1.7T   7% /brick/b

Clearly over a TiB of data is missing.
Comment 4 Niels de Vos 2014-11-27 09:53:48 EST
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.

Note You need to log in before you can comment on or make changes to this bug.