Bug 821139 - Running arequal after rebalance says "short read"
Running arequal after rebalance says "short read"
Status: CLOSED DUPLICATE of bug 884597
Product: GlusterFS
Classification: Community
Component: core (Show other bugs)
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: shishir gowda
Depends On:
  Show dependency treegraph
Reported: 2012-05-12 08:30 EDT by shylesh
Modified: 2013-12-08 20:31 EST (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0qa6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-03-12 06:02:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
sosreport (3.94 MB, application/x-xz)
2012-05-12 08:30 EDT, shylesh
no flags Details

  None (edit)
Description shylesh 2012-05-12 08:30:23 EDT
Created attachment 583989 [details]

Description of problem:
While rebalance is running triggered self heal and restarted glusterd couple of times, after rebalance is completed running arequal on the mount point says "structure needs cleaning" and "short read"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. created a 2x2 replicate volume
2. Created a data set of following on the mount point
   a. 100 files each of 1MB
   b. directory depth of 100 , each level containing a file of 1MB
   c. directory depth of 500 without any files in any of the levels
3. brought down one of the brick from replica pair and started rebalance 
4. After some time start force the volume
5. While still rebalance is happening restart glusterd.
6. once rebalance is complete run arequal
Actual results:
[root@gqac022 mnt]# /opt/qa/tools/arequal-checksum .  
md5sum: ./1000/1001/1002/1003/1004/1005/1006/1007/1008/1009/1010/1011/1012/1013/1014/1015/1016/1017/1018/1019/1020/1021/1022/1023/1024/1025/1025: Structure needs cleaning
./1000/1001/1002/1003/1004/1005/1006/1007/1008/1009/1010/1011/1012/1013/1014/1015/1016/1017/1018/1019/1020/1021/1022/1023/1024/1025/1025: short read
ftw (.) returned -1 (Success), terminating

Attached the sosreport: volume name is test. check for the log path var/log/glusterfs
Comment 1 shylesh 2012-05-12 08:31:22 EDT
Missed info at step 3 did add-brick.
Comment 2 shishir gowda 2012-05-15 00:59:25 EDT
Can you confirm that the initial volume was indeed a 2x2 and not a non-distribute volume?
Comment 3 shylesh 2012-05-15 01:49:28 EDT
It was a 2x2 distribute-replicate while creating the volume.
Comment 4 Amar Tumballi 2012-05-15 03:01:04 EDT
Shylesh, can you confirm applying patch @ http://review.gluster.com/2919 before restarting 'glusterd' in your step 5, has some different behavior?
Comment 5 shylesh 2012-05-18 02:59:55 EDT
[root@gqac022 ~]# grep "17 12:08:54" /usr/local/var/log/glusterfs/*

/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.504399] W [client3_1-fops.c:1127:client3_1_fgetxattr_cbk] 1-repl-client-2: remote operation failed: No data available
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.505838] W [client3_1-fops.c:418:client3_1_open_cbk] 1-repl-client-0: remote operation failed: No such file or directory. Path: /36 (e298b48c-66be-4a90-b4d4-d6ef9f40f83a)
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506055] W [client3_1-fops.c:418:client3_1_open_cbk] 1-repl-client-1: remote operation failed: No such file or directory. Path: /36 (e298b48c-66be-4a90-b4d4-d6ef9f40f83a)
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506125] E [dht-helper.c:760:dht_migration_complete_check_task] 1-repl-dht: (null): failed to send open() on target file at repl-replicate-0
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506217] W [page.c:984:__ioc_page_error] 1-repl-io-cache: page error for page = 0x7f2974045610 & waitq = 0x7f29740506e0
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506314] W [page.c:984:__ioc_page_error] 1-repl-io-cache: page error for page = 0x7f297403e1d0 & waitq = 0x7f2974048810
/usr/local/var/log/glusterfs/mnt.log:[2012-05-17 12:08:54.506912] W [fuse-bridge.c:512:fuse_attr_cbk] 0-glusterfs-fuse: 137412: FSTAT() /36 => -1 (Invalid argument)

[root@gqac022 ~]# grep "17 12:08:54" /usr/local/var/log/glusterfs/bricks/*

/usr/local/var/log/glusterfs/bricks/home-bricks-r1.log:[2012-05-17 12:08:54.505764] I [server3_1-fops.c:1533:server_open_cbk] 0-repl-server: 105124: OPEN (null) (--) ==> -1 (No such file or directory)
Comment 6 Amar Tumballi 2012-05-23 07:28:05 EDT
While trying to reproduce the issue (on 638a4740cc553c96bc01d1dfe4a2b7acf0b406e6), got below error.

[2012-05-23 16:41:42.380946] I [afr-lk-common.c:1454:afr_nonblocking_inodelk] 0-b-replicate-0: unable to get fd ctx for fd=0xf2d260
[2012-05-23 16:41:42.381895] W [fuse-bridge.c:2024:fuse_writev_cbk] 0-glusterfs-fuse: 10181: WRITE => -1 (Invalid argument)

* I was having a file open, and the file had dht link file (on 2x2 volume, create file with name '1' and move it to '3', open 3)

* the directory also had more files, now start 'rebalance' operation (idea is the file which is open will get rebalance because it has a dht link file).

* after issuing rebalance process, do 'rm 3'

* after rebalance is complete, issue another 'write()' call on fd, i got above error.

[1] - used extras/test/open-fd-tests.c on master.
Comment 7 Amar Tumballi 2012-05-25 05:20:20 EDT
taking it off the 3.3.0 release blocker, considering the situation how this error happened.

The issue happened because of 'open' failure after the file got migrated. This situation doesn't have a proper fix at the moment, this issue can be created when a 'rm' operation is happening on the entry while file is getting migrated.

More description of the similar issue in https://bugzilla.redhat.com/show_bug.cgi?id=823181#c1

Will be keeping the bug open, but will work on this issue post 3.3 release
Comment 8 Amar Tumballi 2012-12-21 05:17:12 EST
this seems to be the issue with the permission of linkfiles for which we recently submitted a patch http://review.gluster.org/4304
Comment 9 Amar Tumballi 2013-02-15 06:53:36 EST
this should now be fixed with dht permission issues fixed. Shishir, can you use this bug to send the last of the fix to handle permission changes of linkfile in setattr() fop too?
Comment 10 shishir gowda 2013-03-12 06:02:52 EDT
Linkfile creation with correct ownership, and opening files in migration as root:root to prevent crashes has been fixed as part of bug 884597

*** This bug has been marked as a duplicate of bug 884597 ***

Note You need to log in before you can comment on or make changes to this bug.