Bug 821139 - Running arequal after rebalance says "short read"
Summary: Running arequal after rebalance says "short read"
Status: CLOSED DUPLICATE of bug 884597
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: pre-release
Hardware: x86_64
OS: Linux
Target Milestone: ---
Assignee: shishir gowda
QA Contact: shylesh
Depends On:
TreeView+ depends on / blocked
Reported: 2012-05-12 12:30 UTC by shylesh
Modified: 2013-12-09 01:31 UTC (History)
6 users (show)

Clone Of:
Last Closed: 2013-03-12 10:02:52 UTC

Attachments (Terms of Use)
sosreport (3.94 MB, application/x-xz)
2012-05-12 12:30 UTC, shylesh
no flags Details

Description shylesh 2012-05-12 12:30:23 UTC
Created attachment 583989 [details]

Description of problem:
While rebalance is running triggered self heal and restarted glusterd couple of times, after rebalance is completed running arequal on the mount point says "structure needs cleaning" and "short read"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. created a 2x2 replicate volume
2. Created a data set of following on the mount point
   a. 100 files each of 1MB
   b. directory depth of 100 , each level containing a file of 1MB
   c. directory depth of 500 without any files in any of the levels
3. brought down one of the brick from replica pair and started rebalance 
4. After some time start force the volume
5. While still rebalance is happening restart glusterd.
6. once rebalance is complete run arequal
Actual results:
[root@gqac022 mnt]# /opt/qa/tools/arequal-checksum .  
md5sum: ./1000/1001/1002/1003/1004/1005/1006/1007/1008/1009/1010/1011/1012/1013/1014/1015/1016/1017/1018/1019/1020/1021/1022/1023/1024/1025/1025: Structure needs cleaning
./1000/1001/1002/1003/1004/1005/1006/1007/1008/1009/1010/1011/1012/1013/1014/1015/1016/1017/1018/1019/1020/1021/1022/1023/1024/1025/1025: short read
ftw (.) returned -1 (Success), terminating

Attached the sosreport: volume name is test. check for the log path var/log/glusterfs

Comment 1 shylesh 2012-05-12 12:31:22 UTC
Missed info at step 3 did add-brick.

Comment 2 shishir gowda 2012-05-15 04:59:25 UTC
Can you confirm that the initial volume was indeed a 2x2 and not a non-distribute volume?

Comment 3 shylesh 2012-05-15 05:49:28 UTC
It was a 2x2 distribute-replicate while creating the volume.

Comment 4 Amar Tumballi 2012-05-15 07:01:04 UTC
Shylesh, can you confirm applying patch @ http://review.gluster.com/2919 before restarting 'glusterd' in your step 5, has some different behavior?

Comment 5 shylesh 2012-05-18 06:59:55 UTC
[root@gqac022 ~]# grep "17 12:08:54" /usr/local/var/log/glusterfs/*

/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.504399] W [client3_1-fops.c:1127:client3_1_fgetxattr_cbk] 1-repl-client-2: remote operation failed: No data available
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.505838] W [client3_1-fops.c:418:client3_1_open_cbk] 1-repl-client-0: remote operation failed: No such file or directory. Path: /36 (e298b48c-66be-4a90-b4d4-d6ef9f40f83a)
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506055] W [client3_1-fops.c:418:client3_1_open_cbk] 1-repl-client-1: remote operation failed: No such file or directory. Path: /36 (e298b48c-66be-4a90-b4d4-d6ef9f40f83a)
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506125] E [dht-helper.c:760:dht_migration_complete_check_task] 1-repl-dht: (null): failed to send open() on target file at repl-replicate-0
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506217] W [page.c:984:__ioc_page_error] 1-repl-io-cache: page error for page = 0x7f2974045610 & waitq = 0x7f29740506e0
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506314] W [page.c:984:__ioc_page_error] 1-repl-io-cache: page error for page = 0x7f297403e1d0 & waitq = 0x7f2974048810
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506912] W [fuse-bridge.c:512:fuse_attr_cbk] 0-glusterfs-fuse: 137412: FSTAT() /36 => -1 (Invalid argument)

[root@gqac022 ~]# grep "17 12:08:54" /usr/local/var/log/glusterfs/bricks/*

/usr/local/var/log/glusterfs/bricks/home-bricks-r1.log:[2012-05-17 12:08:54.505764] I [server3_1-fops.c:1533:server_open_cbk] 0-repl-server: 105124: OPEN (null) (--) ==> -1 (No such file or directory)

Comment 6 Amar Tumballi 2012-05-23 11:28:05 UTC
While trying to reproduce the issue (on 638a4740cc553c96bc01d1dfe4a2b7acf0b406e6), got below error.

[2012-05-23 16:41:42.380946] I [afr-lk-common.c:1454:afr_nonblocking_inodelk] 0-b-replicate-0: unable to get fd ctx for fd=0xf2d260
[2012-05-23 16:41:42.381895] W [fuse-bridge.c:2024:fuse_writev_cbk] 0-glusterfs-fuse: 10181: WRITE => -1 (Invalid argument)

* I was having a file open, and the file had dht link file (on 2x2 volume, create file with name '1' and move it to '3', open 3)

* the directory also had more files, now start 'rebalance' operation (idea is the file which is open will get rebalance because it has a dht link file).

* after issuing rebalance process, do 'rm 3'

* after rebalance is complete, issue another 'write()' call on fd, i got above error.

[1] - used extras/test/open-fd-tests.c on master.

Comment 7 Amar Tumballi 2012-05-25 09:20:20 UTC
taking it off the 3.3.0 release blocker, considering the situation how this error happened.

The issue happened because of 'open' failure after the file got migrated. This situation doesn't have a proper fix at the moment, this issue can be created when a 'rm' operation is happening on the entry while file is getting migrated.

More description of the similar issue in https://bugzilla.redhat.com/show_bug.cgi?id=823181#c1

Will be keeping the bug open, but will work on this issue post 3.3 release

Comment 8 Amar Tumballi 2012-12-21 10:17:12 UTC
this seems to be the issue with the permission of linkfiles for which we recently submitted a patch http://review.gluster.org/4304

Comment 9 Amar Tumballi 2013-02-15 11:53:36 UTC
this should now be fixed with dht permission issues fixed. Shishir, can you use this bug to send the last of the fix to handle permission changes of linkfile in setattr() fop too?

Comment 10 shishir gowda 2013-03-12 10:02:52 UTC
Linkfile creation with correct ownership, and opening files in migration as root:root to prevent crashes has been fixed as part of bug 884597

*** This bug has been marked as a duplicate of bug 884597 ***

Note You need to log in before you can comment on or make changes to this bug.