Bug 764191 (GLUSTER-2459)

Summary: Some files are inaccessible until root reads them
Product: [Community] GlusterFS Reporter: Need Real Name <landman>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 3.1.2CC: gluster-bugs, landman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: DP CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
nfs-client log with the problem
none
gluster logs from affected brick with trace level debug none

Description Need Real Name 2011-02-23 20:46:26 UTC
I have since tested with updated kernels.  I have the same exact failure.


[landman@manager ~]$ cat /gevol/assets/.config/PacketLevel.taskrule
cat: /gevol/assets/.config/PacketLevel.taskrule: Input/output error

It looks like attributes are not being distributed correctly from the actual storage brick to the brick providing the file service.  

This is fairly easy to reproduce at this point.

Comment 1 Need Real Name 2011-02-23 22:46:37 UTC
On a RHEL5.4 client


[landman@blackbird ~]$ cat /etc/redhat-release 
CentOS release 5.4 (Final)


[landman@blackbird ~]$ cat /gevol/assets/.config/volumes.xml
cat: /gevol/assets/.config/volumes.xml: Operation not permitted

Then as root on the same machine


[root@blackbird ~]# cat /gevol/assets/.config/volumes.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<VolumeDefList>

  <volumedefs>
    <item key="assetroot">
      <netpath>/gevol/assets</netpath>
      <host>xxx</host>
      <localpath>/gevol/assets</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="opt">
      <netpath>/opt/google/share/tutorials</netpath>
      <host>xxx</host>
      <localpath>/opt/google/share/tutorials</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="src">
      <netpath>/gevol/src</netpath>
      <host>xxx</host>
      <localpath>/gevol/src</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
  </volumedefs>

</VolumeDefList>

then the same client in the same window that failed moments before

[landman@blackbird ~]$ cat /gevol/assets/.config/volumes.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<VolumeDefList>

  <volumedefs>
    <item key="assetroot">
      <netpath>/gevol/assets</netpath>
      <host>xxx</host>
      <localpath>/gevol/assets</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="opt">
      <netpath>/opt/google/share/tutorials</netpath>
      <host>xxx</host>
      <localpath>/opt/google/share/tutorials</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="src">
      <netpath>/gevol/src</netpath>
      <host>xxx</host>
      <localpath>/gevol/src</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
  </volumedefs>
</VolumeDefList>

The volume is as follows:


[root@manager ~]# gluster volume info fusion

Volume Name: fusion
Type: Distribute
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: dv4-1:/data/brick-md1/fusion
Brick2: dv4-2:/data/brick-md1/fusion
Brick3: dv4-3:/data/brick-md1/fusion
Brick4: dv4-1:/data/brick-md2/fusion
Brick5: dv4-2:/data/brick-md2/fusion
Brick6: dv4-3:/data/brick-md2/fusion
Options Reconfigured:
performance.cache-refresh-timeout: 0
performance.stat-prefetch: 0
auth.allow: *

mount options on the clients are

xxx:/fusion/blackbird  /gevol nfs rw,nosuid,nodev,intr,hard,noacl,nolock,noac 0 0

So this appears to be an attribute caching problem.  Since I can recreate this user on the manager node, mount the directory, and have no issues whatsoever in accessing any of the files.

This may or may not be related to a kernel client side bug in the NFS client.  Given the age of the kernel we aren't sure.  We will test with a newer kernel.

Is there anything we can do in terms of a near term workaround?  They need to use NFS

Comment 2 Shehjar Tikoo 2011-02-24 03:52:09 UTC
(In reply to comment #0)
> xxx:/fusion/blackbird  /gevol nfs rw,nosuid,nodev,intr,hard,noacl,nolock,noac 0
> 0
> 
> So this appears to be an attribute caching problem.  Since I can recreate this
> user on the manager node, mount the directory, and have no issues whatsoever in
> accessing any of the files.


Why do you think it is related to attribute caching? The noac mount option disables attribute caching.

Comment 3 Shehjar Tikoo 2011-02-24 03:57:10 UTC
I am trying to reproduce it. Please provide the ls -l output for this file.

Comment 4 Need Real Name 2011-02-24 11:17:10 UTC
on the affected machines:


[landman@blackbird ~]$ ls -l /gevol/assets/.config/*
-rwxr-xr-x 1 pollreid       52030 249 Feb  4 04:51 /gevol/assets/.config/CombinedTerrain.taskrule
-rw-rw-r-- 1 landman      wedge     0 Feb 23 13:30 /gevol/assets/.config/garbage
-rwxr-xr-x 1 pollreid       52030 247 Feb  4 04:51 /gevol/assets/.config/MapLayerLevel.taskrule
-rw-r--r-- 1 gefusionuser gegroup 290 Dec 21 13:44 /gevol/assets/.config/misc.xml
-rwxr-xr-x 1 pollreid       52030 220 Feb  4 04:51 /gevol/assets/.config/PacketLevel.taskrule
-rw-r--r-- 1 gefusionuser gegroup 828 Dec 21 13:44 /gevol/assets/.config/volumes.xml
[landman@blackbird ~]$ cat /gevol/assets/.config/garbage
cat: /gevol/assets/.config/garbage: Input/output error


yet from a mount that doesn't exhibit this problem


[landman@manager ~]$  ls -l /gevol/assets/.config/*
-rwxr-xr-x 1   52030 52030 249 Feb  4 04:51 /gevol/assets/.config/CombinedTerrain.taskrule
-rw-rw-r-- 1 landman wedge   0 Feb 23 13:30 /gevol/assets/.config/garbage
-rwxr-xr-x 1   52030 52030 247 Feb  4 04:51 /gevol/assets/.config/MapLayerLevel.taskrule
-rw-r--r-- 1     312   315 290 Dec 21 13:44 /gevol/assets/.config/misc.xml
-rwxr-xr-x 1   52030 52030 220 Feb  4 04:51 /gevol/assets/.config/PacketLevel.taskrule
-rw-r--r-- 1     312   315 828 Dec 21 13:44 /gevol/assets/.config/volumes.xml
[landman@manager ~]$ cat /gevol/assets/.config/garbage
[landman@manager ~]$

Comment 5 Need Real Name 2011-02-24 11:19:13 UTC
I agree that the noac should disable attribute caching on the client.  This appears to be a server side attribute caching issue.  Customer has noted that it more often occurs when the files in question aren't on the same computer as the NFS export being mounted.

Comment 6 Shehjar Tikoo 2011-02-25 04:02:42 UTC
Thanks. Here is what I need now:

1. Before doing the cat again on the affected system, set the log-level for the NFS server to TRACE.

2. Run:

dmesg -c >/dev/null;

3. Run:

echo 65535 > /proc/sys/sunrpc/nfs_debug

761736. Run the cat command.

5. Run:

dmesg > /tmp/nfs-client.log.

4. If it fails again with IO error, please attach here the nfs.log file from the glusterd logs directory and /tmp/nfs-client.log

Comment 7 Need Real Name 2011-02-25 15:51:42 UTC
Created attachment 440 [details]
This file contains no-problem Turkish consolefonts (12, 14 and 16 weight)

nfs-client.log.gz :  uncompress with "gzip -d nfs-client.log.gz"

Comment 8 Need Real Name 2011-02-25 15:52:52 UTC
Created attachment 441


logs from /var/log/gluster on the server

Comment 9 Need Real Name 2011-02-25 15:53:54 UTC
Files attached as per instructions.  Output from cat was this:


[landman@compute-0-2 ~]$ cat /gevol/assets/.config/*
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<TaskRule>
	<taskname>CombinedTerrain</taskname>
	<inputConstraints/>
	<outputConstraints/>
	<cpuConstraint>
		<minNumCPU>4</minNumCPU>
		<maxNumCPU>4</maxNumCPU>
	</cpuConstraint>
</TaskRule>
cat: /gevol/assets/.config/garbage: Input/output error
cat: /gevol/assets/.config/MapLayerLevel.taskrule: Operation not permitted
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<MiscConfigStorage>

  <NFSVisibilityDelay>300</NFSVisibilityDelay>

  <AssetCacheSize>32000</AssetCacheSize>

  <VersionCacheSize>32000</VersionCacheSize>

  <GenerateProductPreviews>1</GenerateProductPreviews>

</MiscConfigStorage>
cat: /gevol/assets/.config/PacketLevel.taskrule: Operation not permitted
cat: /gevol/assets/.config/volumes.xml: Operation not permitted

Comment 10 Need Real Name 2011-02-25 15:54:32 UTC
adding me into cc list

Comment 11 Shehjar Tikoo 2011-03-01 02:43:37 UTC
Hi

The problem operation is:

[landman@blackbird ~]$ cat /gevol/assets/.config/garbage
cat: /gevol/assets/.config/garbage: Input/output error

It shows up in the nfs.log as:

[2011-02-25 11:41:17.863139] D [nfs3-helpers.c:2389:nfs3_log_rw_call] nfs-nfsv3: XID: c2523400, READ: args: FH: hashcount 4, exportid 6522e543-ea10-4684-8312-1fda37dbdb2f, gfid 7e25e922-6869-4336-9600-27d6354b25bd, offset: 0,  count: 4096
[2011-02-25 11:41:17.863150] T [nfs3.c:1791:nfs3_read] nfs-nfsv3: FH to Volume: fusion
[2011-02-25 11:41:17.863159] T [nfs3-helpers.c:3098:nfs3_fh_resolve_inode] nfs-nfsv3: FH needs inode resolution
[2011-02-25 11:41:17.863167] T [nfs3-helpers.c:2523:nfs3_fh_resolve_inode_done] nfs-nfsv3: FH inode resolved
[2011-02-25 11:41:17.863177] T [nfs3-helpers.c:2238:nfs3_file_open_and_resume] nfs-nfsv3: Opening: /blackbird/assets/.config/garbage
[2011-02-25 11:41:17.863185] T [nfs3-helpers.c:2218:nfs3_fdcache_getfd] nfs-nfsv3: fd found in state: 2
[2011-02-25 11:41:17.863193] T [nfs3-helpers.c:1926:__nfs3_fdcache_update_entry] nfs-nfsv3: Updating fd: 0x7f9b602db024
[2011-02-25 11:41:17.863209] T [nfs.c:412:nfs_user_create] nfs: uid: 52033, gid 311, gids: 1
[2011-02-25 11:41:17.863218] T [nfs.c:420:nfs_user_create] nfs: gid: 311
[2011-02-25 11:41:17.863225] T [nfs-fops.c:133:nfs_create_frame] nfs: uid: 52033, gid 311, gids: 1
[2011-02-25 11:41:17.863233] T [nfs-fops.c:135:nfs_create_frame] nfs: gid: 311
[2011-02-25 11:41:17.863246] T [write-behind.c:442:wb_sync] fusion-write-behind: no vectors are to besynced
[2011-02-25 11:41:17.863262] T [rpc-clnt.c:1295:rpc_clnt_record] : Auth Info: pid: 0, uid: 52033, gid: 311, owner: 260
[2011-02-25 11:41:17.863272] T [rpc-clnt.c:1195:rpc_clnt_record_build_header] rpc-clnt: Request fraglen 152, payload: 24, rpc hdr: 128
[2011-02-25 11:41:17.863301] T [rpc-clnt.c:1499:rpc_clnt_submit] rpc-clnt: submitted request (XID: 0x296x Program: GlusterFS 3.1, ProgVers: 310, Proc: 25) to rpc-transport (fusion-client-1)
[2011-02-25 11:41:17.863631] T [rpc-clnt.c:631:rpc_clnt_reply_init] rpc-clnt: recieved rpc message (RPC XID: 0x296x Program: GlusterFS 3.1, ProgVers: 310, Proc: 25) from rpc-transport (fusion-client-1)
[2011-02-25 11:41:17.863666] T [write-behind.c:442:wb_sync] fusion-write-behind: no vectors are to besynced
[2011-02-25 11:41:17.863695] D [nfs3-helpers.c:2431:nfs3_log_read_res] nfs-nfsv3: XID: c2523400, READ: NFS: 0(Call completed successfully.), POSIX: -1(Unknown error 18446744073709551615), count: 0, is_eof: 0

Which means, a 0 length read is returned without the EOF flag set.

For a potential work-around, please try disabling io-cache and quick-read. We fixed a bug each both post-3.1.2.

Comment 12 Shehjar Tikoo 2011-03-03 06:41:27 UTC
(In reply to comment #11)
> For a potential work-around, please try disabling io-cache and quick-read. We
> fixed a bug each both post-3.1.2.

Joe, please try with the work-around above and let us know, thanks.

Comment 13 Shehjar Tikoo 2011-03-08 07:34:33 UTC
Closing....Please re-open if the work-around didnt work. Thanks.