Bug 851068

Summary:

replace-brick reports "Migration complete" while data are not migrated

Product:

[Community] GlusterFS

Reporter:

Christos Triantafyllidis <christos.triantafyllidis>

Component:

unclassified

Assignee:

krishnan parthasarathi <kparthas>

Status:

CLOSED DEFERRED

QA Contact:

Severity:

unspecified

Docs Contact:

Priority:

high

Version:

3.3.0

CC:

bugs, ctrianta, gluster-bugs, hans, nsathyan, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-12-14 19:40:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server	none

Description Christos Triantafyllidis 2012-08-23 08:00:51 UTC

Created attachment 606471 [details]
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server

Description of problem:
The replace-brick command reports "Migration complete" after some time of migrating while it doesn't finish the migration. I'm attaching part of the log file at the destination gluster server/brick:
/var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log

In the brick log on the source brick i see many lines (could be that many as one per each missing file on the destination) like this (the full log is quite big, let me know if you want me to check something in specific or attach part of it):
[2012-08-22 20:51:25.620159] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-cpg-pump: path /backup on subvolume cpg-replace-brick => -1 (No such file or directory)

I've verified that at least some (random selected) files that are not migrated are:
a) visible and accessible on brick's FS
b) visible and accessible from the FUSE client on another node

Version-Release number of selected component (if applicable):
3.3.0

How reproducible:
Not sure what caused it, but now i can replicate it on every brick i'm trying to migrate from and to any gluster server


Steps to Reproduce:
1. Start a replace-brick process
2. Wait a few hours
3. Get the "Migration completed" message and check the "df" or "find . | wc -l" output of the source and destination brick
  
Actual results:
The brick is partly migrated

Expected results:
The brick should fully migrate

Additional info:
As said above logs are available but i'm not sure what to attach, let me know if something in specific is needed.

Comment 1 Christos Triantafyllidis 2012-08-24 07:42:36 UTC

Hi any update?

Cheers,
Christos

Comment 2 Christos Triantafyllidis 2012-08-24 09:32:30 UTC

Some update from my end.

I tried a new migration on another volume. After a while i got at the logs of source server:
[2012-08-24 12:01:35.615266] W [socket.c:195:__socket_rwv] 0-software-replace-brick: readv failed (Connection reset by peer)
[2012-08-24 12:01:35.615330] W [socket.c:1512:__socket_proto_state_machine] 0-software-replace-brick: reading from socket failed. Error (Connection reset by peer), peer (10.250.121.162:24026)
[2012-08-24 12:01:35.621058] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x3a25a0f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x3a25a0f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3a25a0ef0e]))) 0-software-replace-brick: forced unwinding frame type(GlusterFS 3.1) op(LINK(9)) called at 2012-08-24 12:01:28.717036 (xid=0x1847026x)
[2012-08-24 12:01:35.621098] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-software-replace-brick: remote operation failed: Transport endpoint is not connected (00000000-0000-0000-0000-000000000000 -> /centos/6/os/i386/Packages/libcmpiCppImpl0-2.0.1-5.el6.i686.rpm)

and after that a huge list with "operation failed: Transport endpoint is not connected" errors.

and in about the same time in destination server:
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-08-24 12:01:04
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0
/lib64/libc.so.6[0x3208a32920]
/lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x320920c170]
/usr/lib64/libglusterfs.so.0(iobuf_unref+0x27)[0x320a242387]
/usr/lib64/libglusterfs.so.0(iobref_destroy+0x28)[0x320a242468]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_submit_reply+0x22c)[0x7f39dd90e18c]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_cbk+0x3c9)[0x7f39dd929f99]
/usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup_cbk+0x122)[0x7f39ddb45302]
/usr/lib64/glusterfs/3.3.0/xlator/storage/posix.so(posix_lookup+0x415)[0x7f39ddd626a5]
/usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup+0x1e4)[0x7f39ddb44b44]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_resume+0x125)[0x7f39dd92a935]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_done+0x33)[0x7f39dd910833]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xc5)[0x7f39dd9110b5]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0x8d)[0x7f39dd910f4d]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xbe)[0x7f39dd9110ae]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_inode+0x35)[0x7f39dd910e75]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0xa8)[0x7f39dd910f68]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0x9e)[0x7f39dd91108e]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(resolve_and_resume+0x14)[0x7f39dd911154]
/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0x18f)[0x7f39dd92a62f]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293)[0x320b20a443]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93)[0x320b20a5b3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x320b20b018]
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f39dcfda954]
/usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f39dcfdaa37]
/usr/lib64/libglusterfs.so.0[0x320a23ed44]
/usr/sbin/glusterfs(main+0x58a)[0x4073ca]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3208a1ecdd]
/usr/sbin/glusterfs[0x404379]


Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=848859 ? If so we now have it with single volume transfers too :(.

Comment 3 hans 2013-06-07 08:47:52 UTC

this reproduces on 3.3.2qa3 :

stor1:~/ gluster volume replace-brick vol01 stor1:/brick/e stor3:/brick/b status
Number of files migrated = 3385012        Migration complete

stor1:~/ df -h /brick/e
Filesystem            Size  Used Avail Use% Mounted on
/dev/sde1             1.8T  1.5T  372G  80% /brick/e

stor3:~/ df -h /brick/b
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1             1.8T  122G  1.7T   7% /brick/b

Clearly over a TiB of data is missing.

Comment 4 Niels de Vos 2014-11-27 14:53:48 UTC

The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.