Bug 851068
Summary: | replace-brick reports "Migration complete" while data are not migrated | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Christos Triantafyllidis <christos.triantafyllidis> | ||||
Component: | unclassified | Assignee: | krishnan parthasarathi <kparthas> | ||||
Status: | CLOSED DEFERRED | QA Contact: | |||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.3.0 | CC: | bugs, ctrianta, gluster-bugs, hans, nsathyan, vbellur | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-12-14 19:40:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Hi any update? Cheers, Christos Some update from my end. I tried a new migration on another volume. After a while i got at the logs of source server: [2012-08-24 12:01:35.615266] W [socket.c:195:__socket_rwv] 0-software-replace-brick: readv failed (Connection reset by peer) [2012-08-24 12:01:35.615330] W [socket.c:1512:__socket_proto_state_machine] 0-software-replace-brick: reading from socket failed. Error (Connection reset by peer), peer (10.250.121.162:24026) [2012-08-24 12:01:35.621058] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x3a25a0f7e8] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x3a25a0f4a0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3a25a0ef0e]))) 0-software-replace-brick: forced unwinding frame type(GlusterFS 3.1) op(LINK(9)) called at 2012-08-24 12:01:28.717036 (xid=0x1847026x) [2012-08-24 12:01:35.621098] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-software-replace-brick: remote operation failed: Transport endpoint is not connected (00000000-0000-0000-0000-000000000000 -> /centos/6/os/i386/Packages/libcmpiCppImpl0-2.0.1-5.el6.i686.rpm) and after that a huge list with "operation failed: Transport endpoint is not connected" errors. and in about the same time in destination server: pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-08-24 12:01:04 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0 /lib64/libc.so.6[0x3208a32920] /lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x320920c170] /usr/lib64/libglusterfs.so.0(iobuf_unref+0x27)[0x320a242387] /usr/lib64/libglusterfs.so.0(iobref_destroy+0x28)[0x320a242468] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_submit_reply+0x22c)[0x7f39dd90e18c] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_cbk+0x3c9)[0x7f39dd929f99] /usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup_cbk+0x122)[0x7f39ddb45302] /usr/lib64/glusterfs/3.3.0/xlator/storage/posix.so(posix_lookup+0x415)[0x7f39ddd626a5] /usr/lib64/glusterfs/3.3.0/xlator/features/locks.so(pl_lookup+0x1e4)[0x7f39ddb44b44] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup_resume+0x125)[0x7f39dd92a935] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_done+0x33)[0x7f39dd910833] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xc5)[0x7f39dd9110b5] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0x8d)[0x7f39dd910f4d] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0xbe)[0x7f39dd9110ae] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_inode+0x35)[0x7f39dd910e75] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve+0xa8)[0x7f39dd910f68] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_resolve_all+0x9e)[0x7f39dd91108e] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(resolve_and_resume+0x14)[0x7f39dd911154] /usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0x18f)[0x7f39dd92a62f] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293)[0x320b20a443] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93)[0x320b20a5b3] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x320b20b018] /usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7f39dcfda954] /usr/lib64/glusterfs/3.3.0/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7f39dcfdaa37] /usr/lib64/libglusterfs.so.0[0x320a23ed44] /usr/sbin/glusterfs(main+0x58a)[0x4073ca] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3208a1ecdd] /usr/sbin/glusterfs[0x404379] Could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=848859 ? If so we now have it with single volume transfers too :(. this reproduces on 3.3.2qa3 : stor1:~/ gluster volume replace-brick vol01 stor1:/brick/e stor3:/brick/b status Number of files migrated = 3385012 Migration complete stor1:~/ df -h /brick/e Filesystem Size Used Avail Use% Mounted on /dev/sde1 1.8T 1.5T 372G 80% /brick/e stor3:~/ df -h /brick/b Filesystem Size Used Avail Use% Mounted on /dev/sdb1 1.8T 122G 1.7T 7% /brick/b Clearly over a TiB of data is missing. The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed. |
Created attachment 606471 [details] /var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log at destination server Description of problem: The replace-brick command reports "Migration complete" after some time of migrating while it doesn't finish the migration. I'm attaching part of the log file at the destination gluster server/brick: /var/log/glusterfs/var-lib-glusterd-vols-cpg-rb_dst_brick.vol.log In the brick log on the source brick i see many lines (could be that many as one per each missing file on the destination) like this (the full log is quite big, let me know if you want me to check something in specific or attach part of it): [2012-08-22 20:51:25.620159] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-cpg-pump: path /backup on subvolume cpg-replace-brick => -1 (No such file or directory) I've verified that at least some (random selected) files that are not migrated are: a) visible and accessible on brick's FS b) visible and accessible from the FUSE client on another node Version-Release number of selected component (if applicable): 3.3.0 How reproducible: Not sure what caused it, but now i can replicate it on every brick i'm trying to migrate from and to any gluster server Steps to Reproduce: 1. Start a replace-brick process 2. Wait a few hours 3. Get the "Migration completed" message and check the "df" or "find . | wc -l" output of the source and destination brick Actual results: The brick is partly migrated Expected results: The brick should fully migrate Additional info: As said above logs are available but i'm not sure what to attach, let me know if something in specific is needed.