Bug 786146

Summary: stat on a directory fails with ENOENT when rebalance is issued.
Product: [Community] GlusterFS Reporter: M S Vishwanath Bhat <vbhat>
Component: access-controlAssignee: shishir gowda <sgowda>
Status: CLOSED CURRENTRELEASE QA Contact: M S Vishwanath Bhat <vbhat>
Severity: high Docs Contact:
Priority: high    
Version: pre-releaseCC: amarts, gluster-bugs, mzywusko, nsathyan, rfortier
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 18:02:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description M S Vishwanath Bhat 2012-01-31 15:31:00 UTC
Description of problem:
stat on a dir fails with ENOENT when rebalance is issued. ls -l gives following output.
[root@RHEL6 mnt]# ls -lrt
ls: cannot access hosa_dir: No such file or directory
total 66816
???????????  ? ?    ?            ?            ? hosa_dir
-rw-r--r--.  1 root root  96694860 Aug 15 21:45 linux-3.0.2.tar.gz
-rw-r--r--.  1 root root   2540052 Jan 24 03:02 glusterfs-3.3.0qa20.tar.gz
drwxrwxr-x. 23 root root       669 Jan 31 10:08 linux-3.0.2
drwxrwxr-x. 11 1031 wheel      853 Jan 31 10:10 glusterfs-3.3.0qa20
[root@RHEL6 mnt]#


Version-Release number of selected component (if applicable):
glusterfs-3.3.0qa20

How reproducible:
1/1

Steps to Reproduce:
1. Created and started a stripe-replicated volume.
2. Untarred glusterfs and linux kernel source on the mountpoint.
3. created a dir 'hosa_dir' on mountpoint and used the following command to rename a file 10000 times.
   touch 1 && for i in `seq 1 10000`; do j=$(( i + 1)); mv $i $j; ls; done
4. From mountpoint started removing all the files from linux source with following command.
   find linux-3.0.2 -type f | xargs rm -f
5. Now added 4 more bricks to make it dist-stripe-replicate.
6. started rebalance.
  
Actual results:
find failed with following error. 
[root@RHEL6 mnt]# find linux-3.0.2 -type f | xargs rm -f
find: `linux-3.0.2/arch/arm/mach-mxs/devices/Kconfig': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/Makefile': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/amba-duart.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-auart.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-dma.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-fec.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-flexcan.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-mxs-i2c.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-mxs-mmc.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-mxs-pwm.c': No such file or directory
find: `linux-3.0.2/arch/arm/mach-mxs/devices/platform-mxsfb.c': No such file or directory

and ls -lrt on the mountpoint gave following output.
[root@RHEL6 mnt]# ls -lrt
ls: cannot access hosa_dir: No such file or directory
total 66816
???????????  ? ?    ?            ?            ? hosa_dir
-rw-r--r--.  1 root root  96694860 Aug 15 21:45 linux-3.0.2.tar.gz
-rw-r--r--.  1 root root   2540052 Jan 24 03:02 glusterfs-3.3.0qa20.tar.gz
drwxrwxr-x. 23 root root       669 Jan 31 10:08 linux-3.0.2
drwxrwxr-x. 11 1031 wheel      853 Jan 31 10:10 glusterfs-3.3.0qa20

The last dir glusterfs-3.3.0qa20 has uid set to 1031, which doesn't exist.


Additional info:

log entries in fuse client.


[2012-01-31 10:08:27.343373] W [client3_1-fops.c:1312:client3_1_entrylk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.349250] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.351229] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.373082] W [client3_1-fops.c:1312:client3_1_entrylk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.375642] W [client3_1-fops.c:1312:client3_1_entrylk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.377702] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.379300] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 1-hosdu-client-7: remote operation failed: No such file or directory
[2012-01-31 10:08:27.796676] I [afr-common.c:1302:afr_launch_self_heal] 1-hosdu-replicate-2: background  entry self-heal triggered. path: /, reason: lookup detected pending operations
[2012-01-31 10:08:27.800562] I [afr-self-heal-entry.c:2145:afr_sh_entry_fix] 1-hosdu-replicate-2: /: Performing conservative merge
[2012-01-31 10:08:27.808168] I [afr-self-heal-common.c:2014:afr_self_heal_completion_cbk] 1-hosdu-replicate-2: background  entry self-heal completed on /
[2012-01-31 10:08:27.951829] I [afr-common.c:1302:afr_launch_self_heal] 1-hosdu-replicate-2: background  meta-data entry self-heal triggered. path: /linux-3.0.2, reason: lookup detected pending operations
[2012-01-31 10:08:27.960169] I [afr-self-heal-entry.c:2145:afr_sh_entry_fix] 1-hosdu-replicate-2: /linux-3.0.2: Performing conservative merge
[2012-01-31 10:08:27.966739] I [afr-self-heal-common.c:2014:afr_self_heal_completion_cbk] 1-hosdu-replicate-2: background  meta-data entry self-heal completed on /linux-3.0.2
[2012-01-31 10:08:27.975335] I [afr-common.c:1302:afr_launch_self_heal] 1-hosdu-replicate-2: background  meta-data entry self-heal triggered. path: /linux-3.0.2/arch, reason: lookup detected pending operations
[2012-01-31 10:08:27.983807] I [afr-self-heal-entry.c:2145:afr_sh_entry_fix] 1-hosdu-replicate-2: /linux-3.0.2/arch: Performing conservative merge
[2012-01-31 10:08:27.991216] I [afr-self-heal-common.c:2014:afr_self_heal_completion_cbk] 1-hosdu-replicate-2: background  meta-data entry self-heal completed on /linux-3.0.2/arch
[2012-01-31 10:08:28.012626] I [afr-common.c:1302:afr_launch_self_heal] 1-hosdu-replicate-2: background  meta-data entry self-heal triggered. path: /linux-3.0.2/arch/arm, reason: lookup detected pending operations
[2012-01-31 10:08:28.034134] E [afr-self-heal-common.c:998:afr_sh_common_lookup_resp_handler] 1-hosdu-replicate-2: path /linux-3.0.2/arch/arm/mach-ixp2000 on subvolume hosdu-client-5 => -1 (No such file or directory)
[2012-01-31 10:08:28.035383] E [afr-self-heal-common.c:998:afr_sh_common_lookup_resp_handler] 1-hosdu-replicate-2: path /linux-3.0.2/arch/arm/mach-ixp23xx on subvolume hosdu-client-5 => -1 (No such file or directory)
[2012-01-31 10:08:28.046615] I [afr-self-heal-common.c:2014:afr_self_heal_completion_cbk] 1-hosdu-replicate-2: background  meta-data entry self-heal completed on /linux-3.0.2/arch/arm
[2012-01-31 10:08:55.494328] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:08:55.496142] I [dht-layout.c:701:dht_layout_dir_mismatch] 1-hosdu-dht: subvol: hosdu-stripe-0; inode layout - 0 - 4294967295; disk layout - 0 - 2147483646
[2012-01-31 10:08:55.496170] I [dht-common.c:600:dht_revalidate_cbk] 1-hosdu-dht: mismatching layouts for /linux-3.0.2
[2012-01-31 10:08:55.496235] I [dht-layout.c:701:dht_layout_dir_mismatch] 1-hosdu-dht: subvol: hosdu-stripe-1; inode layout - 0 - 0; disk layout - 2147483647 - 4294967295
[2012-01-31 10:08:55.496250] I [dht-common.c:600:dht_revalidate_cbk] 1-hosdu-dht: mismatching layouts for /linux-3.0.2
[2012-01-31 10:09:01.599031] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:09:26.036434] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:09:48.860556] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:09:51.396528] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:09:58.027263] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:12:12.951120] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:13:14.337162] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:13:17.244777] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:17:05.670415] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:20:45.825349] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:24:43.926470] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL


I'm going to leave the set-up unchanged till someone takes a look at this.

Comment 1 M S Vishwanath Bhat 2012-01-31 15:33:30 UTC
profiling was enabled and stripe-block-size was set to 64MB. Output from volume info. 


[root@RHEL6 /]# gluster v i
 
Volume Name: hosdu
Type: Distributed-Striped-Replicate
Volume ID: 34b3f3bd-53d0-4795-bcf0-e5fb81bddeea
Status: Started
Number of Bricks: 2 x 2 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 10.1.11.113:/data/brick/hosdu_brick1
Brick2: 10.1.11.114:/data/brick/hosdu_brick2
Brick3: 10.1.11.136:/data/brick/hosdu_brick3
Brick4: 10.1.11.137:/data/brick/hosdu_brick4
Brick5: 10.1.11.113:/data/brick/hosdu_brick5
Brick6: 10.1.11.114:/data/brick/hosdu_brick6
Brick7: 10.1.11.136:/data/brick/hosdu_brick7
Brick8: 10.1.11.137:/data/brick/hosdu_brick8
Options Reconfigured:
cluster.stripe-block-size: 64MB
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on

Comment 2 M S Vishwanath Bhat 2012-02-01 09:16:07 UTC
last entries in the log file when stat is issued on 'hosa_dir'


tail /usr/local/var/log/glusterfs/mnt-.log 
[2012-01-31 10:17:05.670415] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:20:45.825349] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-01-31 10:24:43.926470] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:08:22.649230] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:08:24.843136] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:08:32.447879] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:08:33.219337] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:09:31.144304] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:09:32.049503] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL
[2012-02-01 04:10:30.911489] E [fuse-resolve.c:192:fuse_resolve_gfid] 0-glusterfs-fuse: xl is NULL

Comment 3 Amar Tumballi 2012-02-01 11:55:27 UTC
looks like the patch to implement proper resolve in fuse fixes this. (http://review.gluster.com/2703)

Comment 4 M S Vishwanath Bhat 2012-05-11 10:01:29 UTC
Can't verify the bug as it's verification is blocked by bug https://bugzilla.redhat.com/show_bug.cgi?id=820355

Because of the above bug I can't go beyond step 2 while verifying.

Comment 5 M S Vishwanath Bhat 2012-05-31 09:46:38 UTC
Followed the setps and issue is not happening. Moving to verified.