Bug 222299

Summary: gfs knows of directories which it chooses not to display
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: gfs-utilsAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: mceci, nobody+wcheng, rkenna, rohara
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0577 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 20:06:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 190756, 222673    
Bug Blocks:    
Attachments:
Description Flags
RHEL5 patch to fix the problem none

Description Robert Peterson 2007-01-11 15:37:54 UTC
+++ This bug was initially created as a clone of Bug #190756 +++
Crosswritten from RHEL4 to RHEL5 for GFS.

Description of problem:
I saw this after setting up to do NFS/GFS/Rgmanager testing. I have a few gfs
filesystems which are apart of a couple nfs rgmanager services and being
exported to nfs clients. 

[root@taft-04 taft0]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  taft-01                                  Online, rgmanager
  taft-02                                  Online, rgmanager
  taft-03                                  Online, rgmanager
  taft-04                                  Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  nfs1                 taft-04                        started
  nfs2                 taft-04                        started

From the servicer (taft-04) there exist directories that are not being displayed:
[root@taft-04 taft0]# df -h /mnt/taft0
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/TAFT_CLUSTER-TAFT_CLUSTER0
                       91G   22M   91G   1% /mnt/taft0

[root@taft-04 taft0]# pwd
/mnt/taft0
[root@taft-04 taft0]# ls
flea-02
[root@taft-04 taft0]# ls flea-01
[root@taft-04 taft0]# ls flea-03
[root@taft-04 taft0]# ls flea-04
[root@taft-04 taft0]# ls flea-05
[root@taft-04 taft0]# ls flea-99
ls: flea-99: No such file or directory
[root@taft-04 taft0]# cd flea-01
[root@taft-04 flea-01]# ls
[root@taft-04 flea-01]# cd ..

From an nfs client (flea-05) I see nothing as well:
[root@flea-05 taft0]# pwd
/mnt/taft0
[root@flea-05 taft0]# ls
flea-02
[root@flea-05 taft0]# touch flea-01
touch: cannot touch `flea-01': Permission denied
[root@flea-05 taft0]# touch flea-99
[root@flea-05 taft0]# ls
flea-02  flea-99
[root@flea-05 taft0]# ls -a
.  ..  flea-02  flea-99
[root@flea-05 taft0]# ls -ld
drwxrwxrwx  3 root root 3864 May  4 10:57 .

Version-Release number of selected component (if applicable):
[root@taft-04 taft0]# rpm -q GFS
GFS-6.1.5-0

-- Additional comment from rpeterso on 2006-05-15 11:12 EST --
Corey recreated this problem for me on 11 May 2006.  
(Extraneous data cut)
That shows the mkdir is hung in do_dlm_lock, so there must also be some
locking issues related to this problem.  Unfortunately, I couldn't
do any more analysis because the machine wouldn't let me touch the fs.
Therefore, I asked Corey to reboot it.  I think we need to add some kind
of debug code, possibly to vfs's dcache code, before recreating it again.


-- Additional comment from swhiteho on 2006-06-05 06:56 EST --
Do you know if its possible to reproduce without NFS? It looks a bit like things
I've seen when GFS is confused over the hash value of certain files, but could
also be possibly caused by NFS filehandle lookup problems.

In the dir code, there are two different sets of lookup code, one via lookup and
one via readdir (getdents64) and what you see here looks very like it might be
that they have different ideas of whats in the directory.

Also do you know if at any stage the directory contained particularly large
numbers of files, >1000 say, which can trigger this kind of thing?

-- Additional comment from rpeterso on 2006-06-05 10:03 EST --
Sorry, I should have mentioned the following in my previous comment.
Maybe I did and it got truncated by accident due to size.

When I first asked Corey to reproduce the problem for me, he accidentally
hit bz 178057 by accident.  (That might indicate it's related to that bz.)
At any rate, the way to recreate the failure is by doing things similar
to that bz, using nfs.  So I doubt it can recreated with anything else.

The difference between this bz and that was is this: 
With 178057, the NFS clients do not see the files through NFS.  
With this bz, 190756, the NFS server cannot see the files through GFS.

Corey recreated the problem on his second attempt at my request and I
looked into the problem.  I determined that the ondisk stuff was correct.
I verified the hash value in the ondisk dirent.  One of the other nodes
in the cluster could see the files just fine, but the node in question
could not.  At the time I suspected it might be a problem with vfs's
dirent cache getting messed up somehow, which could be the fault of
either NFS or GFS I suppose.

Then I ran into the locking problem mentioned in comment #1.


-- Additional comment from kanderso on 2006-09-22 15:04 EST --
Devel ACK for 4.5, need to figure this one out.

-- Additional comment from rkenna on 2006-09-22 16:07 EST --
Only 2 'r's in directories.  Couldn't stand looking at this for a whole release
cycle :)

-- Additional comment from rpeterso on 2006-09-25 12:31 EST --
I've been looking at this one again and trying to recreate it
useing the nfs_try test case that originally was used.
Therefore I'm reassigning to myself.


-- Additional comment from rpeterso on 2006-09-27 12:23 EST --
I finally recreated this problem by running the nfs_try test case
using three servers (roth-01,02,03) and four nfs clients 
(trin-13,14,15,16).  This was recreated at RHEL4U4.  Here's what I
observed:

The "incorrect" node was roth-01 and it did not see two files
/mnt/roth0/trin-16/lock1_small and /mnt/roth0/trin-16/rwranbuflarge.
The other two server nodes could see these files correctly.

Once again, I verified that the inodes and directory entries 
were okay in the gfs file system on disk.

The "incorrect" node reported this error from strace ls:
stat("/mnt/roth0/trin-16/lock1_small", 0x5164e8) = -1 ENOENT (No such file or
directory)
lstat("/mnt/roth0/trin-16/lock1_small", 0x5164e8) = -1 ENOENT (No such file or
directory)

An attempt to open the file on the "incorrect" node produced:
open("/mnt/roth0/trin-16/lock1_small", O_WRONLY|O_NONBLOCK|O_CREAT|O_NOCTTY,
0666) = -1 EPERM (Operation not permitted)

I did a gfs_tool lockdump on the fs from all nodes in the cluster
and discovered two glock entries for the "missing" file's inode
that had the same values on all nodes in the cluster:

Glock (2, 90145100)
  gl_flags = 
  gl_count = 2
  gl_state = 0
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Inode:
    num = 90145100/90145100
    type = 1
    i_count = 1
    i_flags = 
    vnode = yes

and

Glock (5, 90145100)
  gl_flags = 
  gl_count = 2
  gl_state = 3
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = no
  ail_bufs = no
  Holder
    owner = -1
    gh_state = 3
    gh_flags = 5 7 
    error = 0
    gh_iflags = 1 6 7 

The glocks for the subdirectory in question were also identical
on all nodes:

Glock (2, 90047399)
  gl_flags = 
  gl_count = 3
  gl_state = 3
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 1
  ail_bufs = no
  Inode:
    num = 90047399/90047399
    type = 2
    i_count = 1
    i_flags = 
    vnode = yes

and

Glock (5, 90047399)
  gl_flags = 
  gl_count = 2
  gl_state = 3
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = no
  ail_bufs = no
  Holder
    owner = -1
    gh_state = 3
    gh_flags = 5 7 
    error = 0
    gh_iflags = 1 6 7 

I had earlier thought that this might be locking related, but
I would have expected different glock states if that were the case.

An ls -lad on the directory that contains the "missing" files showed
an odd descrepancy on the nodes.  The link count was zero on the
incorrect node:

roth-01 (incorrect):
drwxr-xr-x    0 root root 2048 Sep 26 09:14 trin-16

roth-02 (correct):
drwxr-xr-x    2 root root 2048 Sep 26 09:17 trin-16

I did stat on the "missing" file from both good and bad nodes:
Incorrect:
[root@roth-01 /mnt/roth0/trin-16]# stat /mnt/roth0/trin-16/lock1_small
stat: cannot stat `/mnt/roth0/trin-16/lock1_small': No such file or directory

Correct:
[root@roth-02 /mnt/roth0/trin-16]# stat /mnt/roth0/trin-16/lock1_small
  File: `/mnt/roth0/trin-16/lock1_small'
  Size: 512000          Blocks: 1008       IO Block: 4096   regular file
Device: fc00h/64512d    Inode: 90145100    Links: 1
Access: (0666/-rw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2006-09-26 09:17:10.000000000 -0500
Modify: 2006-09-26 09:17:16.000000000 -0500
Change: 2006-09-26 09:17:16.000000000 -0500

Next I did the stat command on the directory from the nodes.
Incorrect:

[root@roth-01 /mnt/roth0/trin-16]# stat /mnt/roth0/trin-16
  File: `/mnt/roth0/trin-16'
  Size: 2048            Blocks: 24         IO Block: 4096   directory
Device: fc00h/64512d    Inode: 104775835   Links: 0
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2006-09-27 09:21:06.000000000 -0500
Modify: 2006-09-26 09:14:23.000000000 -0500
Change: 2006-09-26 09:14:23.000000000 -0500

Note that the inode 104775835 listed for the directory is wrong.
That inode (0x63ec9b) is a directory inode, but not a directory 
off of the root.  I suspect it may be a deleted directory, but
I'm still investigating.

The correct information from stat of the directory from a good 
node shows:
[root@roth-02 /mnt/roth0/trin-16]# stat /mnt/roth0/trin-16
  File: `/mnt/roth0/trin-16'
  Size: 2048            Blocks: 24         IO Block: 4096   directory
Device: fc00h/64512d    Inode: 90047399    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2006-09-27 10:25:39.000000000 -0500
Modify: 2006-09-26 09:17:10.000000000 -0500
Change: 2006-09-26 09:17:10.000000000 -0500

This shows the correct directory inode, 90047399 (0x055e03a7),
as shown on disk.

I suspect I'll have to temporarily add debug code to the gfs lookup
code and recreate this again to figure out what's going on.
This problem took many runs of the test to recreate the problem.


-- Additional comment from rpeterso on 2007-01-03 13:03 EST --
I recreated this problem again today after trying for many weeks.
This is exactly what I did to recreate it:

1. I've got a three-node NFS server cluster (roth-01/2/3), all 
   serving the same GFS file system, mounted as /mnt/roth0.
2. On the same cluster, there is a virtual IP, xx.xx.xx.250.
   I occasionally move the virtual IP between two of the three
   nfs servers by running the following script on roth-01:

#!/bin/bash
for ((a=1; a>0; a++))
do
        clusvcadm -r xx.xx.xx.250 -m roth-02
        sleep 20
        clusvcadm -r xx.xx.xx.250 -m roth-01
        sleep 20
done
exit 0

3. I've got a five-node NFS client cluster (trin-12,13,14,15,16), 
   all of which are mounting the xx.xx.xx.250:/mnt/roth0 mount point.
   This mounting happens automatically when I run the nfs_try test
   case from trin-12.  I'm using the following command to run the test:

./nfs_try -R /usr/tests/sts/var/share/resource_files/roth.xml -n nfs1 -C
/usr/tests/sts/var/share/resource_files/roth.xml -S TRIVIAL

The "TRIVIAL" test case boils down to a script that runs a bunch
of the test suite programs for 1 second, but the genesis program is
run with -d 100 to create 100 directories.

When the test was finished this iteration, roth-01 exhibited the 
symptoms and couldn't see anything inside directories /mnt/roth0/trin-12
and /mnt/roth0/trin-14.  This was despite the test case ending
successfully.

A stat on the "incorrect" directory indicated the wrong inode, as before:

[root@roth-01 ~]# stat /mnt/roth0/trin-12
  File: `/mnt/roth0/trin-12'
  Size: 2048            Blocks: 24         IO Block: 4096   directory
Device: fc00h/64512d    Inode: 179012524   Links: 0
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-01-03 10:50:23.000000000 -0600
Modify: 2007-01-03 11:00:50.000000000 -0600
Change: 2007-01-03 11:00:50.000000000 -0600

179012524 in hex is 0xaab83ac, which corresponds to a directory that 
was deleted by the gfs kernel code running on roth-02.  This is output
from some debug code I added to roth-02 for this problem:

GFS mkdir(trin-15) inum=aac9a57 rc=0
GFS rmdir(trin-16) inum=aab83ab rc=0
GFS rmdir(trin-14) inum=aab83a9 rc=0
GFS mkdir(trin-16) inum=aac9a58 rc=0
GFS mkdir(trin-14) inum=aac9a59 rc=0
GFS rmdir(trin-12) inum=aab83ac rc=0
GFS mkdir(trin-12) inum=aac9a5a rc=0

The last rmdir shown is for the directory that roth-01 still think's
is valid.  The trin-12 directory created immediately thereafter on 
roth-02, 0xaac9a5a (179083866 decimal) shows up as the good inode on
the other nodes:

[root@roth-02 ~]# stat /mnt/roth0/trin-12
  File: `/mnt/roth0/trin-12'
  Size: 2048            Blocks: 40         IO Block: 4096   directory
Device: fc00h/64512d    Inode: 179083866   Links: 101
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-01-03 11:00:50.000000000 -0600
Modify: 2007-01-03 11:02:00.000000000 -0600
Change: 2007-01-03 11:02:00.000000000 -0600

So we've definitely got a problem with the GFS kernel rmdir (when done
through NFS?) not being seen from the other node when a mkdir is done
afterward with the same name in its place.


-- Additional comment from rpeterso on 2007-01-05 17:11 EST --
Adding Wendy Cheng to the cc list because I'd like her to review my
findings and code changes.

I was able to determine through debug code that the dentry
revalidate code was not getting called for the "missing" directory
by VFS.  Since revalidate wasn't getting called, VFS never asked 
GFS if the dentry was still valid.  Instead, VFS thought the dentry
was okay and used the inode value it found for the dentry in memory.

In fact, the inode had been changed on disk by a process on a 
different node (the genesis program from the test case), which had 
done rmdir/mkdir through nfs.  So a simple revalidate would have
corrected the problem.

The first big question in my mind was: How did GFS manage to get a
VFS directory entry (dentry) that refused to call revalidate?
The revalidate callback is set whenever a GFS lookup is done
through gfs_lookup, which is pretty much all the time.  To be more
specific, it happens whenever an inode is looked up by name (and
thus the directory path is searched).

The obvious case where lookups don't need to be done by name is in 
ops_export.c, which has the hooks for exporting GFS to NFS.
Because of its nature, NFS can look up inodes by a "cookie" 
identifier rather than a file name.

Inside ops_export.c, there are two places where it can create a
dentry by calling d_alloc_anon.  So my theory is that the missing
revalidate callback was causing the problem.  It also explained 
why the error only showed up when exporting GFS through NFS.  
The problem is, it's a hard theory to test.

The only way I could think of to test the theory was to make a
code change and see if the problem went away, but that assumed I
had a baseline for the failure in the first place.

So armed with my recreation method documented in comment #8, I
tried to figure out how often I could recreate the failure.  It
turns out I could recreate the problem almost half the time while
I had debugging code in GFS:

Try Result            Time the test ran
=== ================= =================
 1. No failure        1m30s
 2. Failure recreated 8m
 3. No failure        8m
 4. roth-02 kernel panic in NFS (unrelated to GFS)
 5. No failure        
 6. Failure recreated 1m43s
 7. No failure        12m30s
 8. Failure recreated 10m0s
 9. No failure        7m20s
10. Failure recreated 1m39s

The kernel panic I saw in try #4 seemed unrelated to GFS.
So this established a baseline of recreating the failure nearly
50% of the time.

Next, I patched my kernel to set the correct callback from both
locations in ops_export.c.  Then I reran my test thirty more times:

Try Result            Time the test ran
=== ================= =================
 1. No failure        12m
 2. No failure        6m40s
 3. No failure        1m9s
 4. roth-02 kernel panic in NFS (unrelated to GFS)
 5. No failure        6m48s
 6. No failure        9m2s
 7. roth-02 kernel panic in NFS (unrelated to GFS)
 8. No failure        4m12s
 9. No failure        5m4s
10. No failure        5m42s
11. roth-02 kernel panic in NFS (unrelated to GFS)
12. No failure        4m16s
13. No failure        9m15s
14. No failure        4m50s
15. No failure        3m50s
16. No failure        5m7s
17. No failure        4m46s
18. No failure        5m11s
19. No failure        5m46s
20. roth-02 kernel panic in NFS (unrelated to GFS)
21. No failure        7m8s
22. No failure        6m34s
23. No failure        8m0s
24. roth-01 kernel panic in NFS (unrelated to GFS)
25. No failure        5m32s
26. No failure        8m52s
27. roth-02 kernel panic in NFS (unrelated to GFS)
28. No failure        8m29s
29. roth-01 kernel panic in NFS (unrelated to GFS)
30. No failure        6m36s

I got the panic 7 times out of 30, which is 23 percent of the time.
Just for fun, I took the lines of code back out and reran it a
dozen more times without any kind of debug code:

Try Result            Time the test ran
=== ================= =================
 1. No failure        7m20s
 2. No failure        6m14s
 3. No failure        7m49s
 4. No failure        6m20s
 5. No failure        12m3s
 6. Failed on roth-01
 7. No failure
 8. Failed on roth-01
 9. Same kernel panic on roth-02 (without my patch)
10. No failure
11. Failed on roth-01
12. No failure

Conclusions:

1. The same kernel panic in NFS happens with or without my patch
   so my patch doesn't seem to change that.  I'll be opening another 
   bugzilla for that problem against NFS.
2. With my patch, I got no failures other than the unrelated NFS bug.
   The problem seemed to be fixed by the patch.
3. Without my patch and debug code, I still had 3 out of 12 
   attempts fail (25% failure rate).

I'd like Wendy to look over these results and conclusions, and if she
doesn't see a problem, I'll commit it to CVS.


-- Additional comment from rpeterso on 2007-01-05 17:14 EST --
Created an attachment (id=144940)
Proposed patch to fix the problem


-- Additional comment from rpeterso on 2007-01-05 17:37 EST --
I opened bz #221666 regarding the NFS kernel panic above.

Comment 2 Robert Peterson 2007-01-15 16:45:09 UTC
I tried to test a RHEL5 version of the fix, but I ran into issues
unrelated to the patch.  Periodically, when the service_relocate script
moves the virtual IP from node to node, it causes a weird state where
openais goes to 100% cpu utilization.  It can sometimes take 2 full
minutes before a command is registered on the command line.
If I kill all the tests, eventually it straightens itself out and
things go back to normal.

I tried to use gdb to figure out what was happening, but didn't get 
very far.  By the time I was able to enter commands into gdb, the
problem had unstuck itself and gone back to normal.

Steve Dake created a new instrumented version of openais to
debug this problem, and it will need his attention.
Therefore I will open a bugzilla against openais for this new problem.

This makes testing this fix impossible.  The nfs tests that took 2-10
minutes on RHEL4 are now never completing because of this problem.


Comment 3 Robert Peterson 2007-01-22 20:56:21 UTC
Created attachment 146227 [details]
RHEL5 patch to fix the problem

Comment 4 Robert Peterson 2007-01-24 14:56:02 UTC
The openais problems I was having earlier in comment #2 mysteriously 
disappeared when I upgraded my kernel, so I closed the bugzilla I had
opened against openais.

I tested this fix as best I could despite the frequent run-ins with
bz 223263.  The code did not exhibit the symptoms of the original
problem with the patch installed.  I had no kernel failures or kernel
panics unlike RHEL4.  So I believe this fix is good and committed it
to CVS at HEAD and RHEL5.  I'm setting the status to Modified.


Comment 5 Kiersten (Kerri) Anderson 2007-04-23 17:38:15 UTC
Fixing product name. Cluster Suite components were integrated into Enterprise
Linux for verion 5.0.

Comment 8 errata-xmlrpc 2007-11-07 20:06:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0577.html