Bug 764191 - (GLUSTER-2459) Some files are inaccessible until root reads them
Some files are inaccessible until root reads them
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: nfs (Show other bugs)
3.1.2
x86_64 Linux
low Severity medium
: ---
: ---
Assigned To: Shehjar Tikoo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-23 17:46 EST by Need Real Name
Modified: 2011-03-08 05:34 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: RTP
Mount Type: nfs
Documentation: DP
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
nfs-client log with the problem (1.39 KB, application/gzip)
2011-02-25 10:51 EST, Need Real Name
no flags Details
gluster logs from affected brick with trace level debug (788.07 KB, application/gzip)
2011-02-25 10:52 EST, Need Real Name
no flags Details

  None (edit)
Description Need Real Name 2011-02-23 15:46:26 EST
I have since tested with updated kernels.  I have the same exact failure.


[landman@manager ~]$ cat /gevol/assets/.config/PacketLevel.taskrule
cat: /gevol/assets/.config/PacketLevel.taskrule: Input/output error

It looks like attributes are not being distributed correctly from the actual storage brick to the brick providing the file service.  

This is fairly easy to reproduce at this point.
Comment 1 Need Real Name 2011-02-23 17:46:37 EST
On a RHEL5.4 client


[landman@blackbird ~]$ cat /etc/redhat-release 
CentOS release 5.4 (Final)


[landman@blackbird ~]$ cat /gevol/assets/.config/volumes.xml
cat: /gevol/assets/.config/volumes.xml: Operation not permitted

Then as root on the same machine


[root@blackbird ~]# cat /gevol/assets/.config/volumes.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<VolumeDefList>

  <volumedefs>
    <item key="assetroot">
      <netpath>/gevol/assets</netpath>
      <host>xxx</host>
      <localpath>/gevol/assets</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="opt">
      <netpath>/opt/google/share/tutorials</netpath>
      <host>xxx</host>
      <localpath>/opt/google/share/tutorials</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="src">
      <netpath>/gevol/src</netpath>
      <host>xxx</host>
      <localpath>/gevol/src</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
  </volumedefs>

</VolumeDefList>

then the same client in the same window that failed moments before

[landman@blackbird ~]$ cat /gevol/assets/.config/volumes.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<VolumeDefList>

  <volumedefs>
    <item key="assetroot">
      <netpath>/gevol/assets</netpath>
      <host>xxx</host>
      <localpath>/gevol/assets</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="opt">
      <netpath>/opt/google/share/tutorials</netpath>
      <host>xxx</host>
      <localpath>/opt/google/share/tutorials</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
    <item key="src">
      <netpath>/gevol/src</netpath>
      <host>xxx</host>
      <localpath>/gevol/src</localpath>
      <reserveSpace>100000000</reserveSpace>
      <isTmp>0</isTmp>
    </item>
  </volumedefs>
</VolumeDefList>

The volume is as follows:


[root@manager ~]# gluster volume info fusion

Volume Name: fusion
Type: Distribute
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: dv4-1:/data/brick-md1/fusion
Brick2: dv4-2:/data/brick-md1/fusion
Brick3: dv4-3:/data/brick-md1/fusion
Brick4: dv4-1:/data/brick-md2/fusion
Brick5: dv4-2:/data/brick-md2/fusion
Brick6: dv4-3:/data/brick-md2/fusion
Options Reconfigured:
performance.cache-refresh-timeout: 0
performance.stat-prefetch: 0
auth.allow: *

mount options on the clients are

xxx:/fusion/blackbird  /gevol nfs rw,nosuid,nodev,intr,hard,noacl,nolock,noac 0 0

So this appears to be an attribute caching problem.  Since I can recreate this user on the manager node, mount the directory, and have no issues whatsoever in accessing any of the files.

This may or may not be related to a kernel client side bug in the NFS client.  Given the age of the kernel we aren't sure.  We will test with a newer kernel.

Is there anything we can do in terms of a near term workaround?  They need to use NFS
Comment 2 Shehjar Tikoo 2011-02-23 22:52:09 EST
(In reply to comment #0)
> xxx:/fusion/blackbird  /gevol nfs rw,nosuid,nodev,intr,hard,noacl,nolock,noac 0
> 0
> 
> So this appears to be an attribute caching problem.  Since I can recreate this
> user on the manager node, mount the directory, and have no issues whatsoever in
> accessing any of the files.


Why do you think it is related to attribute caching? The noac mount option disables attribute caching.
Comment 3 Shehjar Tikoo 2011-02-23 22:57:10 EST
I am trying to reproduce it. Please provide the ls -l output for this file.
Comment 4 Need Real Name 2011-02-24 06:17:10 EST
on the affected machines:


[landman@blackbird ~]$ ls -l /gevol/assets/.config/*
-rwxr-xr-x 1 pollreid       52030 249 Feb  4 04:51 /gevol/assets/.config/CombinedTerrain.taskrule
-rw-rw-r-- 1 landman      wedge     0 Feb 23 13:30 /gevol/assets/.config/garbage
-rwxr-xr-x 1 pollreid       52030 247 Feb  4 04:51 /gevol/assets/.config/MapLayerLevel.taskrule
-rw-r--r-- 1 gefusionuser gegroup 290 Dec 21 13:44 /gevol/assets/.config/misc.xml
-rwxr-xr-x 1 pollreid       52030 220 Feb  4 04:51 /gevol/assets/.config/PacketLevel.taskrule
-rw-r--r-- 1 gefusionuser gegroup 828 Dec 21 13:44 /gevol/assets/.config/volumes.xml
[landman@blackbird ~]$ cat /gevol/assets/.config/garbage
cat: /gevol/assets/.config/garbage: Input/output error


yet from a mount that doesn't exhibit this problem


[landman@manager ~]$  ls -l /gevol/assets/.config/*
-rwxr-xr-x 1   52030 52030 249 Feb  4 04:51 /gevol/assets/.config/CombinedTerrain.taskrule
-rw-rw-r-- 1 landman wedge   0 Feb 23 13:30 /gevol/assets/.config/garbage
-rwxr-xr-x 1   52030 52030 247 Feb  4 04:51 /gevol/assets/.config/MapLayerLevel.taskrule
-rw-r--r-- 1     312   315 290 Dec 21 13:44 /gevol/assets/.config/misc.xml
-rwxr-xr-x 1   52030 52030 220 Feb  4 04:51 /gevol/assets/.config/PacketLevel.taskrule
-rw-r--r-- 1     312   315 828 Dec 21 13:44 /gevol/assets/.config/volumes.xml
[landman@manager ~]$ cat /gevol/assets/.config/garbage
[landman@manager ~]$
Comment 5 Need Real Name 2011-02-24 06:19:13 EST
I agree that the noac should disable attribute caching on the client.  This appears to be a server side attribute caching issue.  Customer has noted that it more often occurs when the files in question aren't on the same computer as the NFS export being mounted.
Comment 6 Shehjar Tikoo 2011-02-24 23:02:42 EST
Thanks. Here is what I need now:

1. Before doing the cat again on the affected system, set the log-level for the NFS server to TRACE.

2. Run:

dmesg -c >/dev/null;

3. Run:

echo 65535 > /proc/sys/sunrpc/nfs_debug

761736. Run the cat command.

5. Run:

dmesg > /tmp/nfs-client.log.

4. If it fails again with IO error, please attach here the nfs.log file from the glusterd logs directory and /tmp/nfs-client.log
Comment 7 Need Real Name 2011-02-25 10:51:42 EST
Created attachment 440 [details]
This file contains no-problem Turkish consolefonts (12, 14 and 16 weight)

nfs-client.log.gz :  uncompress with "gzip -d nfs-client.log.gz"
Comment 8 Need Real Name 2011-02-25 10:52:52 EST
Created attachment 441


logs from /var/log/gluster on the server
Comment 9 Need Real Name 2011-02-25 10:53:54 EST
Files attached as per instructions.  Output from cat was this:


[landman@compute-0-2 ~]$ cat /gevol/assets/.config/*
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<TaskRule>
	<taskname>CombinedTerrain</taskname>
	<inputConstraints/>
	<outputConstraints/>
	<cpuConstraint>
		<minNumCPU>4</minNumCPU>
		<maxNumCPU>4</maxNumCPU>
	</cpuConstraint>
</TaskRule>
cat: /gevol/assets/.config/garbage: Input/output error
cat: /gevol/assets/.config/MapLayerLevel.taskrule: Operation not permitted
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<MiscConfigStorage>

  <NFSVisibilityDelay>300</NFSVisibilityDelay>

  <AssetCacheSize>32000</AssetCacheSize>

  <VersionCacheSize>32000</VersionCacheSize>

  <GenerateProductPreviews>1</GenerateProductPreviews>

</MiscConfigStorage>
cat: /gevol/assets/.config/PacketLevel.taskrule: Operation not permitted
cat: /gevol/assets/.config/volumes.xml: Operation not permitted
Comment 10 Need Real Name 2011-02-25 10:54:32 EST
adding me into cc list
Comment 11 Shehjar Tikoo 2011-02-28 21:43:37 EST
Hi

The problem operation is:

[landman@blackbird ~]$ cat /gevol/assets/.config/garbage
cat: /gevol/assets/.config/garbage: Input/output error

It shows up in the nfs.log as:

[2011-02-25 11:41:17.863139] D [nfs3-helpers.c:2389:nfs3_log_rw_call] nfs-nfsv3: XID: c2523400, READ: args: FH: hashcount 4, exportid 6522e543-ea10-4684-8312-1fda37dbdb2f, gfid 7e25e922-6869-4336-9600-27d6354b25bd, offset: 0,  count: 4096
[2011-02-25 11:41:17.863150] T [nfs3.c:1791:nfs3_read] nfs-nfsv3: FH to Volume: fusion
[2011-02-25 11:41:17.863159] T [nfs3-helpers.c:3098:nfs3_fh_resolve_inode] nfs-nfsv3: FH needs inode resolution
[2011-02-25 11:41:17.863167] T [nfs3-helpers.c:2523:nfs3_fh_resolve_inode_done] nfs-nfsv3: FH inode resolved
[2011-02-25 11:41:17.863177] T [nfs3-helpers.c:2238:nfs3_file_open_and_resume] nfs-nfsv3: Opening: /blackbird/assets/.config/garbage
[2011-02-25 11:41:17.863185] T [nfs3-helpers.c:2218:nfs3_fdcache_getfd] nfs-nfsv3: fd found in state: 2
[2011-02-25 11:41:17.863193] T [nfs3-helpers.c:1926:__nfs3_fdcache_update_entry] nfs-nfsv3: Updating fd: 0x7f9b602db024
[2011-02-25 11:41:17.863209] T [nfs.c:412:nfs_user_create] nfs: uid: 52033, gid 311, gids: 1
[2011-02-25 11:41:17.863218] T [nfs.c:420:nfs_user_create] nfs: gid: 311
[2011-02-25 11:41:17.863225] T [nfs-fops.c:133:nfs_create_frame] nfs: uid: 52033, gid 311, gids: 1
[2011-02-25 11:41:17.863233] T [nfs-fops.c:135:nfs_create_frame] nfs: gid: 311
[2011-02-25 11:41:17.863246] T [write-behind.c:442:wb_sync] fusion-write-behind: no vectors are to besynced
[2011-02-25 11:41:17.863262] T [rpc-clnt.c:1295:rpc_clnt_record] : Auth Info: pid: 0, uid: 52033, gid: 311, owner: 260
[2011-02-25 11:41:17.863272] T [rpc-clnt.c:1195:rpc_clnt_record_build_header] rpc-clnt: Request fraglen 152, payload: 24, rpc hdr: 128
[2011-02-25 11:41:17.863301] T [rpc-clnt.c:1499:rpc_clnt_submit] rpc-clnt: submitted request (XID: 0x296x Program: GlusterFS 3.1, ProgVers: 310, Proc: 25) to rpc-transport (fusion-client-1)
[2011-02-25 11:41:17.863631] T [rpc-clnt.c:631:rpc_clnt_reply_init] rpc-clnt: recieved rpc message (RPC XID: 0x296x Program: GlusterFS 3.1, ProgVers: 310, Proc: 25) from rpc-transport (fusion-client-1)
[2011-02-25 11:41:17.863666] T [write-behind.c:442:wb_sync] fusion-write-behind: no vectors are to besynced
[2011-02-25 11:41:17.863695] D [nfs3-helpers.c:2431:nfs3_log_read_res] nfs-nfsv3: XID: c2523400, READ: NFS: 0(Call completed successfully.), POSIX: -1(Unknown error 18446744073709551615), count: 0, is_eof: 0

Which means, a 0 length read is returned without the EOF flag set.

For a potential work-around, please try disabling io-cache and quick-read. We fixed a bug each both post-3.1.2.
Comment 12 Shehjar Tikoo 2011-03-03 01:41:27 EST
(In reply to comment #11)
> For a potential work-around, please try disabling io-cache and quick-read. We
> fixed a bug each both post-3.1.2.

Joe, please try with the work-around above and let us know, thanks.
Comment 13 Shehjar Tikoo 2011-03-08 02:34:33 EST
Closing....Please re-open if the work-around didnt work. Thanks.

Note You need to log in before you can comment on or make changes to this bug.