Description of problem: ======================= Triggered remove-brick start and while migration is in progress I had continuously sent lookup and renames on a directory. Remove-brick status is getting failed during this process. Version-Release number of selected component (if applicable): 3.7.9-10.el7rhgs.x86_64 How reproducible: ================= Always Steps to Reproduce: =================== 1. Create a distributed EC volume.. let's say 3*(8+3) and start it. 2. FUSE mount the volume on a client. 3. From mount point, start untarring Linux kernel package. 4. Run a continuous loop of ls -lRt from the mount point. 5. Remove few bricks from the volume. Let's say remove 11 bricks to make it as 2*(8+3). 6. While migration is in progress keep on renaming a directory. ex : mv dir1 dir2 mv dir2 dir 3 mv dir3 dir4 ….. 7. Check for the remove-brick status. Actual results: ============== remove-brick is getting failed. Expected results: ================= Remove-brick start operation should not fail. Additional info: ================ Here are the output snippets, [root@dhcp43-57 glusterfs]# gluster v status Status of volume: dist_disperse_vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.43.57:/bricks/brick0/b0 49221 0 Y 3278 Brick 10.70.43.185:/bricks/brick0/b0 49212 0 Y 29196 Brick 10.70.41.176:/bricks/brick0/b0 49213 0 Y 28199 Brick 10.70.43.95:/bricks/brick0/b0 49213 0 Y 26036 Brick 10.70.43.57:/bricks/brick1/b1 49222 0 Y 3297 Brick 10.70.43.185:/bricks/brick1/b1 49213 0 Y 29215 Brick 10.70.41.176:/bricks/brick1/b1 49214 0 Y 28218 Brick 10.70.43.95:/bricks/brick1/b1 49214 0 Y 26055 Brick 10.70.43.57:/bricks/brick2/b2 49223 0 Y 3316 Brick 10.70.43.185:/bricks/brick2/b2 49214 0 Y 29234 Brick 10.70.41.176:/bricks/brick2/b2 49215 0 Y 28237 Brick 10.70.43.95:/bricks/brick2/b2 49215 0 Y 26074 Brick 10.70.43.57:/bricks/brick3/b3 49224 0 Y 3335 Brick 10.70.43.185:/bricks/brick3/b3 49215 0 Y 29253 Brick 10.70.41.176:/bricks/brick3/b3 49216 0 Y 28256 Brick 10.70.43.95:/bricks/brick3/b3 49216 0 Y 26093 Brick 10.70.43.57:/bricks/brick4/b4 49225 0 Y 3354 Brick 10.70.43.185:/bricks/brick4/b4 49216 0 Y 29272 Brick 10.70.41.176:/bricks/brick4/b4 49217 0 Y 28275 Brick 10.70.43.95:/bricks/brick4/b4 49217 0 Y 26112 Brick 10.70.43.57:/bricks/brick5/b5 49226 0 Y 3373 Brick 10.70.43.185:/bricks/brick5/b5 49217 0 Y 29291 Brick 10.70.41.176:/bricks/brick5/b5 49218 0 Y 28294 Brick 10.70.43.95:/bricks/brick5/b5 49218 0 Y 26131 Brick 10.70.43.57:/bricks/brick6/b6 49227 0 Y 3392 Brick 10.70.43.185:/bricks/brick6/b6 49218 0 Y 29310 Brick 10.70.41.176:/bricks/brick6/b6 49219 0 Y 28313 Brick 10.70.43.95:/bricks/brick6/b6 49219 0 Y 26150 Brick 10.70.43.57:/bricks/brick7/b7 49228 0 Y 3411 Brick 10.70.43.185:/bricks/brick7/b7 49219 0 Y 29329 Brick 10.70.41.176:/bricks/brick7/b7 49220 0 Y 28332 Brick 10.70.43.95:/bricks/brick7/b7 49220 0 Y 26169 Brick 10.70.43.57:/bricks/brick8/b8 49229 0 Y 3430 NFS Server on localhost 2049 0 Y 3451 Self-heal Daemon on localhost N/A N/A Y 3457 NFS Server on 10.70.43.185 2049 0 Y 29350 Self-heal Daemon on 10.70.43.185 N/A N/A Y 29356 NFS Server on 10.70.43.95 2049 0 Y 26189 Self-heal Daemon on 10.70.43.95 N/A N/A Y 26197 NFS Server on 10.70.41.176 2049 0 Y 28353 Self-heal Daemon on 10.70.41.176 N/A N/A Y 28360 Task Status of Volume dist_disperse_vol ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp43-57 glusterfs]# gluster v remove-brick dist_disperse_vol 10.70.41.176:/bricks/brick5/b5 10.70.43.95:/bricks/brick5/b5 10.70.43.57:/bricks/brick6/b6 10.70.43.185:/bricks/brick6/b6 10.70.41.176:/bricks/brick6/b6 10.70.43.95:/bricks/brick6/b6 10.70.43.57:/bricks/brick7/b7 10.70.43.185:/bricks/brick7/b7 10.70.41.176:/bricks/brick7/b7 10.70.43.95:/bricks/brick7/b7 10.70.43.57:/bricks/brick8/b8 start volume remove-brick start: success ID: 0d88c44b-a3a5-458a-93c6-6fd80199fa28 [root@dhcp43-57 glusterfs]# gluster v remove-brick dist_disperse_vol 10.70.41.176:/bricks/brick5/b5 10.70.43.95:/bricks/brick5/b5 10.70.43.57:/bricks/brick6/b6 10.70.43.185:/bricks/brick6/b6 10.70.41.176:/bricks/brick6/b6 10.70.43.95:/bricks/brick6/b6 10.70.43.57:/bricks/brick7/b7 10.70.43.185:/bricks/brick7/b7 10.70.41.176:/bricks/brick7/b7 10.70.43.95:/bricks/brick7/b7 10.70.43.57:/bricks/brick8/b8 status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 0 failed 0:0:2 10.70.43.185 0 0Bytes 0 0 0 failed 0:0:1 10.70.41.176 0 0Bytes 0 0 0 failed 0:0:2 10.70.43.95 0 0Bytes 0 0 0 failed 0:0:1 root@dhcp43-57 glusterfs]# gluster volume status Status of volume: dist_disperse_vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.43.57:/bricks/brick0/b0 49221 0 Y 3278 Brick 10.70.43.185:/bricks/brick0/b0 49212 0 Y 29196 Brick 10.70.41.176:/bricks/brick0/b0 49213 0 Y 28199 Brick 10.70.43.95:/bricks/brick0/b0 49213 0 Y 26036 Brick 10.70.43.57:/bricks/brick1/b1 49222 0 Y 3297 Brick 10.70.43.185:/bricks/brick1/b1 49213 0 Y 29215 Brick 10.70.41.176:/bricks/brick1/b1 49214 0 Y 28218 Brick 10.70.43.95:/bricks/brick1/b1 49214 0 Y 26055 Brick 10.70.43.57:/bricks/brick2/b2 49223 0 Y 3316 Brick 10.70.43.185:/bricks/brick2/b2 49214 0 Y 29234 Brick 10.70.41.176:/bricks/brick2/b2 49215 0 Y 28237 Brick 10.70.43.95:/bricks/brick2/b2 49215 0 Y 26074 Brick 10.70.43.57:/bricks/brick3/b3 49224 0 Y 3335 Brick 10.70.43.185:/bricks/brick3/b3 49215 0 Y 29253 Brick 10.70.41.176:/bricks/brick3/b3 49216 0 Y 28256 Brick 10.70.43.95:/bricks/brick3/b3 49216 0 Y 26093 Brick 10.70.43.57:/bricks/brick4/b4 49225 0 Y 3354 Brick 10.70.43.185:/bricks/brick4/b4 49216 0 Y 29272 Brick 10.70.41.176:/bricks/brick4/b4 49217 0 Y 28275 Brick 10.70.43.95:/bricks/brick4/b4 49217 0 Y 26112 Brick 10.70.43.57:/bricks/brick5/b5 49226 0 Y 3373 Brick 10.70.43.185:/bricks/brick5/b5 49217 0 Y 29291 Brick 10.70.41.176:/bricks/brick5/b5 49218 0 Y 28294 Brick 10.70.43.95:/bricks/brick5/b5 49218 0 Y 26131 Brick 10.70.43.57:/bricks/brick6/b6 49227 0 Y 3392 Brick 10.70.43.185:/bricks/brick6/b6 49218 0 Y 29310 Brick 10.70.41.176:/bricks/brick6/b6 49219 0 Y 28313 Brick 10.70.43.95:/bricks/brick6/b6 49219 0 Y 26150 Brick 10.70.43.57:/bricks/brick7/b7 49228 0 Y 3411 Brick 10.70.43.185:/bricks/brick7/b7 49219 0 Y 29329 Brick 10.70.41.176:/bricks/brick7/b7 49220 0 Y 28332 Brick 10.70.43.95:/bricks/brick7/b7 49220 0 Y 26169 Brick 10.70.43.57:/bricks/brick8/b8 49229 0 Y 3430 NFS Server on localhost 2049 0 Y 3451 Self-heal Daemon on localhost N/A N/A Y 3457 NFS Server on 10.70.43.185 2049 0 Y 29350 Self-heal Daemon on 10.70.43.185 N/A N/A Y 29356 NFS Server on 10.70.43.95 2049 0 Y 26189 Self-heal Daemon on 10.70.43.95 N/A N/A Y 26197 NFS Server on 10.70.41.176 2049 0 Y 28353 Self-heal Daemon on 10.70.41.176 N/A N/A Y 28360 Task Status of Volume dist_disperse_vol ------------------------------------------------------------------------------ Task : Remove brick ID : 0d88c44b-a3a5-458a-93c6-6fd80199fa28 Removed bricks: 10.70.41.176:/bricks/brick5/b5 10.70.43.95:/bricks/brick5/b5 10.70.43.57:/bricks/brick6/b6 10.70.43.185:/bricks/brick6/b6 10.70.41.176:/bricks/brick6/b6 10.70.43.95:/bricks/brick6/b6 10.70.43.57:/bricks/brick7/b7 10.70.43.185:/bricks/brick7/b7 10.70.41.176:/bricks/brick7/b7 10.70.43.95:/bricks/brick7/b7 10.70.43.57:/bricks/brick8/b8 Status : failed Rebalance log: ============== [2016-08-18 09:00:43.002468] N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop] 0-dist_disperse_vol-disperse-0: Mismatching dictionary in answers of 'GF_FOP_XATTROP' The message "N [MSGID: 122031] [ec-generic.c:1130:ec_combine_xattrop] 0-dist_disperse_vol-disperse-0: Mismatching dictionary in answers of 'GF_FOP_XATTROP'" repeated 8 times between [2016-08-18 09:00:43.002468] and [2016-08-18 09:00:43.003907] [2016-08-18 09:00:43.004120] W [MSGID: 122040] [ec-common.c:919:ec_prepare_update_cbk] 0-dist_disperse_vol-disperse-0: Failed to get size and version [Input/output error] [2016-08-18 09:00:43.004213] E [MSGID: 109039] [dht-common.c:2777:dht_find_local_subvol_cbk] 0-dist_disperse_vol-dht: getxattr err for dir [Input/output error] [2016-08-18 09:00:43.004722] E [MSGID: 0] [dht-rebalance.c:3544:gf_defrag_start_crawl] 0-dist_disperse_vol-dht: local subvolume determination failed with error: 5 [2016-08-18 09:00:43.004820] I [MSGID: 109028] [dht-rebalance.c:3872:gf_defrag_status_get] 0-dist_disperse_vol-dht: Rebalance is failed. Time taken is 2.00 secs [2016-08-18 09:00:43.004856] I [MSGID: 109028] [dht-rebalance.c:3876:gf_defrag_status_get] 0-dist_disperse_vol-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0 [2016-08-18 09:00:43.005569] W [glusterfsd.c:1251:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f05847c2dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f0585e3c915] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f0585e3c78b] ) 0-: received signum (15), shutting down Will be attaching SOS reports.
This issue is also seen with distributed replica volume.
Tested the same scenario with a pure distribute volume. Rebalance has started and no such error messages as updated in the BZ were seen. The issue was seen with a distributed disperse volume, hence changing the component to disperse.
I am able to reproduce this issue consistently on my system. Followed the same steps mentioned in this BZ. As soon as we trigger remove-brick command we see following error on mount point - (Sometimes, remove-brick on first set was successful but on second set it gave the same error.) [root@apandey vol]# /home/apandey/test.sh mv: cannot move ‘dir-841’ to ‘dir-842’: Transport endpoint is not connected mv: cannot stat ‘dir-842’: No such file or directory This error we saw only for rename directory. All the other IO like untar of linux and "ls -lRt" did not give any issue. I talked to Raghavendra Gowdappa about this issue and he doubt on glusterd handling volume file change while some IO is going on. Second issue - Although we saw ENOTCONN on mount point for rename directory, rebalance was started and after sometime it failed and threw assertion. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/vol --xlator-opti'. Program terminated with signal SIGABRT, Aborted. #0 0x00007fb1798628d7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-14.fc21.x86_64 elfutils-libelf-0.163-4.fc21.x86_64 elfutils-libs-0.163-4.fc21.x86_64 glibc-2.20-8.fc21.x86_64 keyutils-libs-1.5.9-4.fc21.x86_64 krb5-libs-1.12.2-19.fc21.x86_64 libcom_err-1.42.12-4.fc21.x86_64 libgcc-4.9.2-6.fc21.x86_64 libselinux-2.3-10.fc21.x86_64 libuuid-2.25.2-3.fc21.x86_64 nss-mdns-0.10-15.fc21.x86_64 openssl-libs-1.0.1k-12.fc21.x86_64 pcre-8.35-14.fc21.x86_64 sssd-client-1.12.5-5.fc21.x86_64 systemd-libs-216-25.fc21.x86_64 xz-libs-5.1.2-14alpha.fc21.x86_64 zlib-1.2.8-7.fc21.x86_64 (gdb) bt #0 0x00007fb1798628d7 in raise () from /lib64/libc.so.6 #1 0x00007fb17986453a in abort () from /lib64/libc.so.6 #2 0x00007fb17985b47d in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007fb17985b532 in __assert_fail () from /lib64/libc.so.6 #4 0x00007fb16cf68d1b in ec_manager_setattr (fop=0x7fb1680b965c, state=4) at ec-inode-write.c:394 #5 0x00007fb16cf4a1da in __ec_manager (fop=0x7fb1680b965c, error=0) at ec-common.c:2283 #6 0x00007fb16cf45bbf in ec_resume (fop=0x7fb1680b965c, error=0) at ec-common.c:289 #7 0x00007fb16cf45de7 in ec_complete (fop=0x7fb1680b965c) at ec-common.c:362 #8 0x00007fb16cf674fb in ec_inode_write_cbk (frame=0x7fb1682a188c, this=0x7fb16801e750, cookie=0x3, op_ret=0, op_errno=0, prestat=0x7fb167efd970, poststat=0x7fb167efd900, xdata=0x7fb15c16051c) at ec-inode-write.c:65 #9 0x00007fb16cf68816 in ec_setattr_cbk (frame=0x7fb1682a188c, cookie=0x3, this=0x7fb16801e750, op_ret=0, op_errno=0, prestat=0x7fb167efd970, poststat=0x7fb167efd900, xdata=0x7fb15c16051c) at ec-inode-write.c:349 #10 0x00007fb16d20168f in client3_3_setattr_cbk (req=0x7fb15c2ab06c, iov=0x7fb15c2ab0ac, count=1, myframe=0x7fb15c1f4c0c) at client-rpc-fops.c:2264 #11 0x00007fb17af583aa in rpc_clnt_handle_reply (clnt=0x7fb168067ed0, pollin=0x7fb15c017b40) at rpc-clnt.c:790 #12 0x00007fb17af58903 in rpc_clnt_notify (trans=0x7fb168068330, mydata=0x7fb168067f00, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fb15c017b40) at rpc-clnt.c:961 #13 0x00007fb17af54b7b in rpc_transport_notify (this=0x7fb168068330, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fb15c017b40) at rpc-transport.c:541 #14 0x00007fb1705a4d0d in socket_event_poll_in (this=0x7fb168068330) at socket.c:2265 #15 0x00007fb1705a525c in socket_event_handler (fd=17, idx=8, data=0x7fb168068330, poll_in=1, poll_out=0, poll_err=0) at socket.c:2395 #16 0x00007fb17b1fa579 in event_dispatch_epoll_handler (event_pool=0x1c48ff0, event=0x7fb167efdea0) at event-epoll.c:571 #17 0x00007fb17b1fa959 in event_dispatch_epoll_worker (data=0x7fb16803ded0) at event-epoll.c:674 #18 0x00007fb179fdf52a in start_thread () from /lib64/libpthread.so.0 #19 0x00007fb17992e22d in clone () from /lib64/libc.so.6 Although assertion was coming from EC, I think this is also related to the issue described above.
Given there is an upstream patch http://review.gluster.org/#/c/15846 posted, moving this bug to POST.
Clearing Needinfo.