Created attachment 583989 [details]
Description of problem:
While rebalance is running triggered self heal and restarted glusterd couple of times, after rebalance is completed running arequal on the mount point says "structure needs cleaning" and "short read"
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. created a 2x2 replicate volume
2. Created a data set of following on the mount point
a. 100 files each of 1MB
b. directory depth of 100 , each level containing a file of 1MB
c. directory depth of 500 without any files in any of the levels
3. brought down one of the brick from replica pair and started rebalance
4. After some time start force the volume
5. While still rebalance is happening restart glusterd.
6. once rebalance is complete run arequal
[root@gqac022 mnt]# /opt/qa/tools/arequal-checksum .
md5sum: ./1000/1001/1002/1003/1004/1005/1006/1007/1008/1009/1010/1011/1012/1013/1014/1015/1016/1017/1018/1019/1020/1021/1022/1023/1024/1025/1025: Structure needs cleaning
./1000/1001/1002/1003/1004/1005/1006/1007/1008/1009/1010/1011/1012/1013/1014/1015/1016/1017/1018/1019/1020/1021/1022/1023/1024/1025/1025: short read
ftw (.) returned -1 (Success), terminating
Attached the sosreport: volume name is test. check for the log path var/log/glusterfs
Missed info at step 3 did add-brick.
Can you confirm that the initial volume was indeed a 2x2 and not a non-distribute volume?
It was a 2x2 distribute-replicate while creating the volume.
Shylesh, can you confirm applying patch @ http://review.gluster.com/2919 before restarting 'glusterd' in your step 5, has some different behavior?
[root@gqac022 ~]# grep "17 12:08:54" /usr/local/var/log/glusterfs/*
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.504399] W [client3_1-fops.c:1127:client3_1_fgetxattr_cbk] 1-repl-client-2: remote operation failed: No data available
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.505838] W [client3_1-fops.c:418:client3_1_open_cbk] 1-repl-client-0: remote operation failed: No such file or directory. Path: /36 (e298b48c-66be-4a90-b4d4-d6ef9f40f83a)
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506055] W [client3_1-fops.c:418:client3_1_open_cbk] 1-repl-client-1: remote operation failed: No such file or directory. Path: /36 (e298b48c-66be-4a90-b4d4-d6ef9f40f83a)
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506125] E [dht-helper.c:760:dht_migration_complete_check_task] 1-repl-dht: (null): failed to send open() on target file at repl-replicate-0
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506217] W [page.c:984:__ioc_page_error] 1-repl-io-cache: page error for page = 0x7f2974045610 & waitq = 0x7f29740506e0
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506314] W [page.c:984:__ioc_page_error] 1-repl-io-cache: page error for page = 0x7f297403e1d0 & waitq = 0x7f2974048810
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506912] W [fuse-bridge.c:512:fuse_attr_cbk] 0-glusterfs-fuse: 137412: FSTAT() /36 => -1 (Invalid argument)
[root@gqac022 ~]# grep "17 12:08:54" /usr/local/var/log/glusterfs/bricks/*
/usr/local/var/log/glusterfs/bricks/home-bricks-r1.log:[2012-05-17 12:08:54.505764] I [server3_1-fops.c:1533:server_open_cbk] 0-repl-server: 105124: OPEN (null) (--) ==> -1 (No such file or directory)
While trying to reproduce the issue (on 638a4740cc553c96bc01d1dfe4a2b7acf0b406e6), got below error.
[2012-05-23 16:41:42.380946] I [afr-lk-common.c:1454:afr_nonblocking_inodelk] 0-b-replicate-0: unable to get fd ctx for fd=0xf2d260
[2012-05-23 16:41:42.381895] W [fuse-bridge.c:2024:fuse_writev_cbk] 0-glusterfs-fuse: 10181: WRITE => -1 (Invalid argument)
* I was having a file open, and the file had dht link file (on 2x2 volume, create file with name '1' and move it to '3', open 3)
* the directory also had more files, now start 'rebalance' operation (idea is the file which is open will get rebalance because it has a dht link file).
* after issuing rebalance process, do 'rm 3'
* after rebalance is complete, issue another 'write()' call on fd, i got above error.
 - used extras/test/open-fd-tests.c on master.
taking it off the 3.3.0 release blocker, considering the situation how this error happened.
The issue happened because of 'open' failure after the file got migrated. This situation doesn't have a proper fix at the moment, this issue can be created when a 'rm' operation is happening on the entry while file is getting migrated.
More description of the similar issue in https://bugzilla.redhat.com/show_bug.cgi?id=823181#c1
Will be keeping the bug open, but will work on this issue post 3.3 release
this seems to be the issue with the permission of linkfiles for which we recently submitted a patch http://review.gluster.org/4304
this should now be fixed with dht permission issues fixed. Shishir, can you use this bug to send the last of the fix to handle permission changes of linkfile in setattr() fop too?
Linkfile creation with correct ownership, and opening files in migration as root:root to prevent crashes has been fixed as part of bug 884597
*** This bug has been marked as a duplicate of bug 884597 ***