Bug 1003546

Summary: NFS + large XFS fs sometimes fails uncached lookups for client when inode number >2^32 on 32-bit computers
Product: [Fedora] Fedora Reporter: Trevor Cordes <trevor>
Component: nfs-utilsAssignee: J. Bruce Fields <bfields>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: bfields, dchinner, esandeen, lav, steved, trevor
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-09 08:10:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pcap #1 broken
none
pcap #2 broken
none
pcap #1 fixing
none
pcap #4 fixed
none
output from ls showing bug hits only high inode #'s
none
exportfs: fix lookup of uncached 64-bit inode on 32-bit server
none
exportfs: fix by getting inode number from ->getattr
none
my slightly modified version of the 2nd patch
none
modified again to remove buffer.ino reassignment
none
backport of upstream submission
none
upstream patch "exportfs: fix 32-bit nfsd handling of 64-bit inode numbers"
none
upstream patch "vfs: split out vfs_getattr_nosec" none

Description Trevor Cordes 2013-09-02 10:24:01 UTC
Description of problem:
Experienced readers, just skip to shell transcript in Steps To Reproduce.

I have a simple NFSv4 setup with 1 server and 1 client.  Both F19 latest version of everything.  Ever since about the middle of F17's run (I think) I would get this strange behavior on the client.  If the client makes a directory on the server and writes some files to it, everything is fine at first.  Then after a period of time (I haven't been able to pin it down, sometimes it's few minutes, sometimes maybe more) if the client accesses that same directory (like with ls) then it fails with "no such file or directory".

Here's where it gets strange.  If you ls the same directory on the server, of course it appears ok.  Then if you immediately do the client ls again, it succeeds!  Always!  Then you can wait again (secs to mins), repeat on the client and it fails again.

After fiddling around, I also found doing a ls dir/.. on the client will "fix" the problem temporarily and then you can do a ls dir (on the client) without error.

This bug only ever *seems* to affect files/dirs created/altered by the client.  It never seems to hit files/dirs created/altered by the server.

Also, it only *seems* to affect files/dirs *recently* created/altered by the client, not ones done >1-2 days ago, though I can't be sure of that.  I only ever have problems on stuff I was doing that day or the day before.

Note, these aren't remove delays/races bugs as I am not ever removing any of the affected files/dirs, on either client or server.  In fact, besides doing the test ls's shown below, I'm not accessing these files at all on the server during any of these tests.

I can't yet verfy, but the problem *seems* exacerbated when there is a high stat load on the server via the same client.  For instance, if I do a find /mnt/z -ls | wc on the client it seems to speed up the "breaking" of the ls I am interested in.

I see no errors of relevance in either /v/l/messages on either computer.

This all worked on F16 and for a long time on F17 (I think, but it may have been F15 and F16).  Something in F17's updates broke it and it's remained broken on F18 and F19 too.

Some *possibly* relevant bugs:
bug #949069
bug #158293
bug #150759
bug #228801

as per bug #949069 in a few minutes I am going to test with /proc/sys/fs/leases-enable = 0 on the server after a reboot.


Version-Release number of selected component (if applicable):
same on both client and server:
kernel-PAE-3.10.10-200.fc19.i686
nfs-utils-1.2.8-4.0.fc19.i686


How reproducible:
easy, happens to me several times an hour when working with NFS on the client, but I can't yet find a way to guaranteed trigger it with exact timing


Steps to Reproduce:
client:
>mkdir /mnt/z/a
>echo foo > /mnt/z/a/b
>ls /mnt/z/a
b
comment: ... wait ...
>ls /mnt/z/a
ls: cannot access /mnt/z/a: No such file or directory
comment: !!!!!!!

server:
>ls /z/a
b

client (immediately):
>ls /mnt/z/a
b
comment: !!!!!!!

comment: ... wait ...
>ls /mnt/z/a
ls: cannot access /mnt/z/a: No such file or directory
comment: !!!!!!!

>ls /mnt/z/a/..; ls /mnt/z/a
comment: contents of /mnt/z shown first, then
b
comment: !!!!!!!

(actual mount/directory names changed for brevity)


Actual results:
weird results and failures above each "comment: !!!!" above.


Expected results:
should just show the directory in any case, as the old kernel/NFSv4 did.


Additional info:
server: cat /etc/exports
/nfs 192.168.100.1(ro,async,no_subtree_check,no_root_squash,insecure,fsid=0)
/nfs/data 192.168.100.1(rw,async,no_subtree_check,no_root_squash,insecure,nohide)

client:
mount |grep /data
192.168.100.2:/data on /mnt/data type nfs4 (rw,nosuid,noatime,nodiratime,vers=4.0,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=15,retrans=2,sec=sys,clientaddr=192.168.100.1,local_lock=none,addr=192.168.100.2)

Comment 1 Trevor Cordes 2013-09-02 10:35:05 UTC
Rebooted server with:

fs.leases-enable=0   in /etc/sysctl.conf
(verified with: cat /proc/sys/fs/leases-enable)

No help whatsoever, the bug hit instantly.  Which also indicates and interesting feature of the bug, it survives server reboots.  So it must be something really strange happening client-side.  (I mount NFS "hard" so the client just waits for the server to come back up, and never has to remount or even re-open open files.)

Not sure if /etc/sysctl.conf happens before the NFS startup, but this is the only place I could think to put it.

I also saw another ls oddity that I forgot to say before:

>ll /mnt/z/
drwx------ 3 trevor trevor 25 Sep  1 05:00 ./
drwxrwx--- 3 trevor trevor 30 Sep  1 02:58 ../
d????????? ? ?      ?       ?            ? a/
...

So the client knows that a is in the /mnt/z dir, but it looks like it can't stat it.

Comment 2 J. Bruce Fields 2013-09-03 15:31:15 UTC
Once you get it in this state, it would be interesting to see if you could get a network trace.

So, run something like "tcpdump -wtmp.pcap -ieth0" (replace by the appropriate network interface), then run your "ll /mnt/z/", then kill tcpdump, then attach tmp.pcap (and/or take a look at it yourself with "wireshark tmp.pcap".).

Or do the same while trying some of those other experiments (like running to "ls"'s on the client separated by an "ls" on the server).

It may be that the problem is all in the client caching in which case this may not show much, but more likely there's some error on the wire that it might be interesting to see.

Comment 3 Trevor Cordes 2013-09-04 09:55:03 UTC
I got a simple example (though not quite as simple as above) to trigger the bug while doing tcpdump.

My test dir tree:
#find /data/a -ls
6388232    0 drwxr-xr-x   3 trevor   trevor         47 Sep  4 00:27 /data/a
6388266    4 -rw-rw----   1 trevor   trevor          4 Sep  3 23:15 /data/a/b
8611263021    0 drwxrwx---   2 trevor   trevor         23 Sep  4 00:27 /data/a/aa
8611263022    4 -rw-rw----   1 trevor   trevor          4 Sep  4 00:27 /data/a/aa/bb
6388270    4 -rw-rw----   1 trevor   trevor          4 Sep  3 23:45 /data/a/d

pcaps attached.  All pcaps + ls's are executed on the client the client (pog).

pcap 1:
#ll /data/a
/bin/ls: cannot access /data/a/aa: No such file or directory
total 12
drwxr-xr-x  3 trevor trevor   47 Sep  4 00:27 ./
drwxr-xr-x 24 root   root   4096 Sep  3 23:14 ../
d?????????  ? ?      ?         ?            ? aa/
-rw-rw----  1 trevor trevor    4 Sep  3 23:15 b
-rw-rw----  1 trevor trevor    4 Sep  3 23:45 d

pcap 2:
#ll /data/a/aa
/bin/ls: cannot access /data/a/aa: No such file or directory

pcap 3: (.. fixes it)
#ll /data/a/aa/..
total 12
drwxr-xr-x  3 trevor trevor   47 Sep  4 00:27 ./
drwxr-xr-x 24 root   root   4096 Sep  3 23:14 ../
drwxrwx---  2 trevor trevor   23 Sep  4 00:27 aa/
-rw-rw----  1 trevor trevor    4 Sep  3 23:15 b
-rw-rw----  1 trevor trevor    4 Sep  3 23:45 d

pcap 4: (now fully fixed, until some time interval passes)
#ll /data/a/
total 12
drwxr-xr-x  3 trevor trevor   47 Sep  4 00:27 ./
drwxr-xr-x 24 root   root   4096 Sep  3 23:14 ../
drwxrwx---  2 trevor trevor   23 Sep  4 00:27 aa/
-rw-rw----  1 trevor trevor    4 Sep  3 23:15 b
-rw-rw----  1 trevor trevor    4 Sep  3 23:45 d

I should note that the server /data partition is very large: 11T-usable RAID6 array, XFS.  It is also very full, 93%

Comment 4 Trevor Cordes 2013-09-04 09:56:01 UTC
Created attachment 793579 [details]
pcap #1 broken

Comment 5 Trevor Cordes 2013-09-04 09:56:29 UTC
Created attachment 793580 [details]
pcap #2 broken

Comment 6 Trevor Cordes 2013-09-04 09:57:01 UTC
Created attachment 793581 [details]
pcap #1 fixing

Comment 7 Trevor Cordes 2013-09-04 09:57:28 UTC
Created attachment 793582 [details]
pcap #4 fixed

Comment 8 Trevor Cordes 2013-09-04 09:58:45 UTC
(In reply to Trevor Cordes from comment #6)
> Created attachment 793581 [details]
> pcap #1 fixing

Sorry, that should read pcap #3 fixing

Comment 9 J. Bruce Fields 2013-09-04 16:24:25 UTC
Looking just at the first one with wireshark--the initial PUTFH in a GETATTR compound reply is returning ENOENT.  That's extremely weird.

I think that could only be returned by exportfs_decode_fh().

Based on the output it seems almost certain that filehandle is for /data/a/aa/, a directory.  (It might be useful to confirm this.  You can do that by collecting traffic for the whole test starting with the mount, and verifying that the filehandle used on the PUTFH that returns ENOENT was returned by an earlier lookup for "aa").

That means the ENOENT came from reconnect_path().

This is consistent with the fact that this can be cleared temporarily by a server-side lookup.  There's some problem with the logic that's meant to reconnect directories to their parents on use of a filehandle for a directory whose parent isn't in the dcache.  The lookup temporarily fixes the problem, then after the dentry falls out of the dcache the problem crops up again.

The "high stat load" you mention would create a lot of new dentries and hence be more likely to push this particular dentry out of the dcache.

If that's the case.... A server-side "echo 2 > /proc/sys/vm/drop_caches" at the right moment might give you a quicker reproducer?  Maybe something like:

  client# mount /data
  client# ll /data/a
  (ok results)
  server# echo 2 >/proc/sys/vm/drop_caches
  client# ll /data/a
  (bad results ?)

Comment 10 Trevor Cordes 2013-09-05 06:23:26 UTC
I'll try to get a capture including the mount shortly (a lot uses this mount so it's hard for me to completely umount it).

For your /proc/sys/vm/drop_caches test, yes you are right, setting it to 2 causes the bug to hit almost right away.

1:17am /data/a#find -ls
6388232    0 drwxrwx---   4 trevor   trevor         60 Sep  5 01:16 .
6388266    4 -rw-rw----   1 trevor   trevor          4 Sep  3 23:15 ./b
8611263021    0 drwxrwx---   2 trevor   trevor         23 Sep  4 00:27 ./aa
8611263022    4 -rw-rw----   1 trevor   trevor          4 Sep  4 00:27 ./aa/bb
6388270    4 -rw-rw----   1 trevor   trevor          4 Sep  3 23:45 ./d
51556539652    0 drwxrwx---   2 trevor   trevor         24 Sep  5 01:17 ./bb
51556539653    4 -rw-r-----   1 trevor   trevor          5 Sep  5 01:17 ./bb/bbb

(server: echo 2 ......)

1:18am /data/a#ll
/bin/ls: cannot access bb: No such file or directory
total 12
drwxrwx---  4 trevor trevor   60 Sep  5 01:16 ./
drwxr-xr-x 24 root   root   4096 Sep  3 23:14 ../
drwxrwx---  2 trevor trevor   23 Sep  4 00:27 aa/
-rw-rw----  1 trevor trevor    4 Sep  3 23:15 b
d?????????  ? ?      ?         ?            ? bb/
-rw-rw----  1 trevor trevor    4 Sep  3 23:45 d

1:18am /data/a#ll
/bin/ls: cannot access aa: No such file or directory
/bin/ls: cannot access bb: No such file or directory
total 12
drwxrwx---  4 trevor trevor   60 Sep  5 01:16 ./
drwxr-xr-x 24 root   root   4096 Sep  3 23:14 ../
d?????????  ? ?      ?         ?            ? aa/
-rw-rw----  1 trevor trevor    4 Sep  3 23:15 b
d?????????  ? ?      ?         ?            ? bb/
-rw-rw----  1 trevor trevor    4 Sep  3 23:45 d

Note how even within 1 min of ll'ing showing a failed bb, another ll then showed a failed aa also.  This will make testing a lot easier.

Interesting, it seems you can't set drop_caches back to 0, so I guess I have to reboot after these tests.

Comment 11 Trevor Cordes 2013-09-05 06:26:33 UTC
Hmm, also interesting is a I just did a:

ll aa/..

then a:

ll

and all entries, *including the bb one* are ok (see below).  I would have guessed I'd have to do a ll aa/.. and then a ll bb/.. in order to get a clean ll in /data/a.  I think this fits in with your parent dir theory.

1:24am /data/a#ll
total 12
drwxrwx---  4 trevor trevor   60 Sep  5 01:16 ./
drwxr-xr-x 24 root   root   4096 Sep  3 23:14 ../
drwxrwx---  2 trevor trevor   23 Sep  4 00:27 aa/
-rw-rw----  1 trevor trevor    4 Sep  3 23:15 b
drwxrwx---  2 trevor trevor   24 Sep  5 01:17 bb/
-rw-rw----  1 trevor trevor    4 Sep  3 23:45 d

Comment 12 Trevor Cordes 2013-09-05 08:13:30 UTC
This may be a red herring, but I've noticed that in most (all?) of my tests, the directories that disappear have very large inode #'s.

Especially on a test I recently did on another directory where the bug hit like this:

ll -i
/bin/ls: cannot access a: No such file or directory
/bin/ls: cannot access b: No such file or directory
/bin/ls: cannot access c: No such file or directory
total 0
12931758391 drwxrwx--- 6 trevor trevor 94 Sep  4 00:43 ./
     105654 drwx------ 6 trevor trevor 94 Sep  1 00:20 ../
 2148665645 drwxrwx--- 4 trevor trevor 59 Sep  3 05:05 z/
          ? d????????? ? ?      ?       ?            ? a/
          ? d????????? ? ?      ?       ?            ? b/
          ? d????????? ? ?      ?       ?            ? c/

After find'ing on the whole tree on the server to fix it, I can see the inode #'s:

ll -i
total 0
12931758391 drwxrwx--- 6 trevor trevor  94 Sep  4 00:43 ./
     105654 drwx------ 6 trevor trevor  94 Sep  1 00:20 ../
 2148665645 drwxrwx--- 4 trevor trevor  59 Sep  3 05:05 z/
15068757813 drwxrwx--- 3 trevor trevor  31 Sep  4 00:43 a/
40820128817 drwx------ 4 trevor trevor  40 Sep  3 01:05 b/
49438897470 drwxrwx--- 5 trevor trevor 154 Sep  4 00:35 c/

Note how the only one that didn't disappear was under 2^32.

The earlier tests I reported on (/data/a) also had very large inode #'s (>2^32) for the ones that disappear.  Maybe I reached some point in XFS, the kernel, or NFS in terms of capacity (rather than a specific kernel version), where I triggered a rare bug?  There must be some reason I am seeing this and no one else in the world seems to be?  I am also running 32-bit on both computers, which also might be rare these days.

My IUsed on /data is 23244196

Comment 13 J. Bruce Fields 2013-09-05 13:09:15 UTC
Good observation, yes, of the 5 subdirectories in that directory, the three that are causing problems are exactly the ones with inode numbers greater than 2^32.

So we've got some problem on xfs on a 32-bit machine looking up filehandles for directories that have 64-bit inode numbers and that are not already in cache.

Cc'ing a couple xfs developers in case this looks familiar to them.

Comment 14 J. Bruce Fields 2013-09-05 13:27:28 UTC
(In reply to Trevor Cordes from comment #10)
> Interesting, it seems you can't set drop_caches back to 0, so I guess I have
> to reboot after these tests.

No, writing to drop_caches is just a one-time event: certain caches are cleared when you write to it, but everything will return to normal as the cache repopulates.  I didn't realize that reads would show the last value written, that's interesting, but I don't think it means anything.

Comment 15 Eric Sandeen 2013-09-05 16:09:46 UTC
Upstream, xfs does enable > 2^32 inode numbers by default; now you'll get them if your fs is over. . . 2T or so, IIRC (depending on other fs geometry details).  (Inode numbers in xfs reflect their location on disk).

I'm woefully unfamiliar w/ the nfs side of things, do we have any idea where the actual failure is occuring?

Comment 16 Trevor Cordes 2013-09-06 07:22:24 UTC
I just ran another test with the aid of /proc/sys/vm/drop_caches = 2

I managed to make a test dir with some inodes <2^32 by deleting about 100G of files and about 150k inodes.  I guess only by clearing up space earlier in the array (as per Eric's comment) can I obtain small inode #'s.

Anyhow, I think I confirmed my hunch about the inode # bit-count being significant.  My new, larger, test dirs had only one inode <2^32 and it was the ONLY one not to get hit by this bug during the test.  About 20 other dirs, all >2^32, got messed up.  That plus my 2 other tests also showing this behavior should mean it's now pretty conclusive.

See attached ls results.

I must note that I've been at 9 to 10TB used on my 11TB array for a few years.  For sure I've been over 2TB for 5+ years.  Maybe capacity/used is a red herring and this is some bug introduced in a kernel code change.  Certainly I've been getting >2^32 inode #'s for many years before this bug hit.  I'm trying to think what else might have changed.

Comment 17 Trevor Cordes 2013-09-06 07:26:10 UTC
Created attachment 794600 [details]
output from ls showing bug hits only high inode #'s

Comment 18 Trevor Cordes 2013-09-06 07:36:56 UTC
Playing around with it some more

Comment 19 Trevor Cordes 2013-09-06 07:46:50 UTC
Changing summary to reflect new knowledge.

Playing some more, I found that one of the hunches from my comment #1 was wrong.  This bug hits regardless of who (client or server) creates the file/directory.  I had thought it only affected ones created by the client, but I just confirmed the bug hits on files created by the server that have never been looked at by the client until now.

And again, they all show >2^32 inode #'s (and ones under that are fine).  That is definitely where the bug is.  I hope that narrows down the code path.

Looking over some large directories that I was using before this bug hit, it looks like I was not having >2^32 inode #'s assigned at all.  This is even when my fs was at 9 or 10TB used!  There seems to be a zillion congregated around 2.1B, but none above.

Perhaps what happened is either a) is a new XFS version became freer with its use of high inodes; or b) I reached some magical point in my XFS at around 10+TB that caused me to get high inodes.  (I guess I could try deleting files to get below 10TB to see if the bug goes away, but this is mostly valuable data here, not garbage.)

In either case, the bug we are seeing could have been there forever, and it's just the trigger that no one has ever reached.  This would be a heck of a bug to try to reproduce on a bare new system without coming up with some large sample fs's!

Comment 20 Eric Sandeen 2013-09-06 19:22:09 UTC
Yes, XFS defaulting to inode64 is relatively new.  We used to have to specify -o inode64 on the mount cmdline to get it.  Without that option, inodes were restricted to low filesystem blocks, which led to other problems (weird allocation & inode starvation) for very large filesystems.  But that's changed recently.

Anyway that's why you see so many inodes congregated at the low area; you were running on an old kernel that didn't allow 64 bit inodes by default, until now.

FWIW, you can now mount with -o inode32 to get the old behavior for newly created inodes, but any inodes existing with > 32 bits will remain; it just won't create new ones.  You could then find inodes > 64 bits, copy them to a new file, and rename over the old one, to force them back under 64 bits.

That's just a workaround, though, for what appears to be a real bug somewhere.

Comment 21 J. Bruce Fields 2013-09-06 19:46:14 UTC
Oh, this is suspicious: to connect a directory we just looked up to its parent, we use fs/exportfs/expfs.c:getname().  (Actually filesystems can define their own but xfs uses that default.)

That default getname works by just scanning the parent directory for an entry with a matching inode number.  The context it passes through to the scan defines the inode number as an "unsigned long"....

Comment 22 J. Bruce Fields 2013-09-06 20:36:46 UTC
Created attachment 794955 [details]
exportfs: fix lookup of uncached 64-bit inode on 32-bit server

I don't think this patch is really the right one, but it should work in practice.  Could you check whether the problem is still reproduceable after applying this to your server's kernel?

Comment 23 Trevor Cordes 2013-09-08 10:29:08 UTC
Comment #22's patch seems to fix the problem.  Sorry for the delay, but it's been years since I compiled a kernel so I had to brush up a bit.

I rebooted into the patched kernel and did the echo 2 trick and waited about 15 mins and retried a few times and the find and ll tests that previously failed all worked ok now.  I'll report back if the bug hits, but it appears to be mitigated.

As you say in the patch comments, this kludge probably isn't the proper long-term solution.  Correct me if I'm wrong, but aren't we opening the door to false "hits" on inodes that have any number equal to the sought inode modulo 2^32?  Instead of a benign problem, would this not be disastrous as you would be ll'ing and possibly modifying an entirely wrong directory or file?

I assume the patched file is NFS code and not XFS code, so at least fixing this here in one place should handle the case for all underlying FS's.  Since I'm not a kernel hacker, I'm only stabbing in the dark here.  I leave it to the boffins to properly decipher.  (Thanks!)

Comment 24 J. Bruce Fields 2013-09-09 14:36:34 UTC
(In reply to Trevor Cordes from comment #23)
> Comment #22's patch seems to fix the problem.  Sorry for the delay, but it's
> been years since I compiled a kernel so I had to brush up a bit.

Thanks!  Hopefully having done it once, a second time will be easier....

> I rebooted into the patched kernel and did the echo 2 trick and waited about
> 15 mins and retried a few times and the find and ll tests that previously
> failed all worked ok now.  I'll report back if the bug hits, but it appears
> to be mitigated.
> 
> As you say in the patch comments, this kludge probably isn't the proper
> long-term solution.

Right.  I think what we should do is add a ->getattr() call instead of using the (usigned long) inode->i_ino field as the target of our search; getattr knows how to return the full 64-bit inode.  I'll try to get a new patch out soon.

> Correct me if I'm wrong, but aren't we opening the door
> to false "hits" on inodes that have any number equal to the sought inode
> modulo 2^32?  Instead of a benign problem, would this not be disastrous as
> you would be ll'ing and possibly modifying an entirely wrong directory or
> file?

No, once the code finds a matching inode it actually tries a lookup of the name it found and checks to see whether the lookup got the right thing.  So you'd just get another spurious -ENOENT.

And I suspect the chances of one directory having inode numbers that are equal mode 2^32 are for practical purposes zero.

That said I think we should still fix this right on the chance some odd workload or inode allocation algorithm would prove me wrong.

Comment 25 Eric Sandeen 2013-09-09 16:24:29 UTC
(In reply to J. Bruce Fields from comment #24)

...

> And I suspect the chances of one directory having inode numbers that are
> equal mode 2^32 are for practical purposes zero.

I'm not quite certain about that; at least for xfs, inode numbers aren't really random; they are reflective of the inode location on disk, and those inode locations are at least somewhat "regular".  It's probably still rare but not quite as rare as you've hinted at </handwave>.
 
> That said I think we should still fix this right on the chance some odd
> workload or inode allocation algorithm would prove me wrong.

*nod* - I think we need our own xfs_get_name() to fix it properly.

(or a 64-bit vfs i_ino!)

-Eric

Comment 26 J. Bruce Fields 2013-09-10 16:47:55 UTC
Created attachment 796070 [details]
exportfs: fix by getting inode number from ->getattr

Here's an attempt at a correct fix.  Would it be possible to try this one as well?

Comment 27 Trevor Cordes 2013-09-13 13:06:19 UTC
This new patch also seems to work ok and mitigate the bug.  I did the echo 2 test again and tried to reproduce the bug for 15 minutes and could not.  I will report back if the bug hits again, but I doubt it will.

Note, I hope it doesn't matter but I'm working off the 3.10.10-200.fc19.i686.PAE kernel source rpms and had to modify the new patch's chunk 2 to make it apply as code below the change seems to have changed in vanilla also.  If I instead need to be working off rawhide or something, let me know which one and where to get the srpm.  Thanks!

Comment 28 Trevor Cordes 2013-09-14 10:36:34 UTC
Oops!  The bug did just hit again!  I left it overnight (no problems yesterday) and just now one of the directories I normally work with was back to the same old symptoms.  Perhaps the echo 2 isn't enough to *always* trigger this, only enough to *usually* trigger it.

I confirmed, this is indeed with your 2nd patch in place (unless I did something wrong):
Linux piles 3.10.10-200.bz1003546.fc19.i686.PAE #1 SMP Fri Sep 13 05:48:13 CDT 2013 i686 i686 i386 GNU/Linux

It may have been I simply didn't wait long enough with your patch #1 to have it trigger, as I only ran it for about 1 hour.  So perhaps let's not draw any conclusions about patch #1 unless I test it out again for a longer period, or we find a way to always trigger this bug on demand.

Sorry!  I thought we really had it solved there.

Comment 29 Trevor Cordes 2013-09-14 10:39:58 UTC
I found a way to, I think, realiably trigger the bug on demand.  Do the echo 2 thing, and then do a "find /bighugedir -ls |wc" on the server for about 1 min, which for sure uses up all the cache and causes the dirs I will test on the client to flush.

Then on the client simply do the normal finds and ls's on things not in /bighugedir.  Voila, always seems to trigger the bug on *every* dir I test.

Comment 30 J. Bruce Fields 2013-09-16 15:31:46 UTC
(In reply to Trevor Cordes from comment #29)
> I found a way to, I think, realiably trigger the bug on demand.  Do the echo
> 2 thing, and then do a "find /bighugedir -ls |wc" on the server for about 1
> min, which for sure uses up all the cache and causes the dirs I will test on
> the client to flush.
> 
> Then on the client simply do the normal finds and ls's on things not in
> /bighugedir.  Voila, always seems to trigger the bug on *every* dir I test.

I'm assuming it's still only a problem for directories with high inode numbers?

Could you attach the patch as you applied it?  (Since you said there was some fix required to get it applied to your kernel, I'd just like to make sure the fix was right.)

Comment 31 Trevor Cordes 2013-09-17 16:18:59 UTC
Created attachment 798881 [details]
my slightly modified version of the 2nd patch

Comment 32 Trevor Cordes 2013-09-17 16:24:25 UTC
I just changed the context beneath chunk 2 of the patch so it would apply to 3.10.10-200

Yes, the problem is still only on high inode numbers (confirmed).

Comment 33 J. Bruce Fields 2013-09-19 19:11:22 UTC
OK, yes, I see the problem: look a few lines down below the final chunk: you'll see that buffer.ino is set again, overriding the value we just set.  You just need to remove that second assignment to buffer.ino and I suspect it will work.

Anyway I'm certain we understand what the cause of the bug is now.  There's still some question what the best fix might be--I need to think about this a little more then send a patch upstream.  But I think more testing's not necessary at this time.

Thanks for all the excellent bug reporting and testing so far.

Comment 34 Trevor Cordes 2013-09-21 07:58:07 UTC
I took out the 2nd buffer.ino assignment and recompiled and tests were going ok but then something weird/bad happened... now, for the first time ever that I know of, I had it do weird things on a < 2^32 inode.

I had rebooted the file server then after all the tests were ok (with echo 2 and a big find as per comment #29), I was just doing normal usage and a terminal that was open from before the server reboot and was in a cwd on the exported fs, that cwd disappeared and I was unable to do anything in it.  I moved up one level and ll'd and it was foobar, as we normally see:

          ? d?????????  ? ?      ?              ?            ? Z/

So I fix it by ll'ing the same thing on the server.  Then I ll -i:

    6027513 drwx------  2 trevor trevor     28672 Sep  9 01:20 Z/

In fact, the parent dir that contains Z had a > 2^32 inode dir in it and it listed perfectly fine at the moment Z was failing!

So either this updated patch breaks something new, or there's some weird issue I haven't seen before because the terminal had a cwd in a since-rebooted exported fs.  But I seriously doubt the latter is the case since I'm mounting hard and the client should always block until the server is back up, without messing up anything on the client.

I'll see what happens when I next am at this computer and report back.  If even low inodes are made wonky I'll be forced to switch back to an unpatched kernel I guess.

Comment 35 Trevor Cordes 2013-09-27 11:11:05 UTC
I had to switch back to the stock kernel pretty much right away as the wonky inode bug was really messing me up now that it was hitting small inodes.  Looks like the last patch makes things worse.

I'll attach the modified patch I compiled with following this.

Comment 36 Trevor Cordes 2013-09-27 11:12:32 UTC
Created attachment 803897 [details]
modified again to remove buffer.ino reassignment

Comment 37 J. Bruce Fields 2013-10-02 15:47:35 UTC
Upstream discussion at http://thread.gmane.org/gmane.linux.file-systems/78140

Comment 38 Trevor Cordes 2013-11-13 09:12:27 UTC
Any progress on this bug?  It's starting to become a bit of a bigger problem for me as I save more and more stuff to this fs.  I now run every 15s on the server a "find /big/inode/dirs /times/a/hundred" to try to keep the symptom from showing up on the client.  But my list of directories is getting larger.

I would like to run a backup (via nfs) of that fs so I was wondering, do you think this bug is present only on directories, or also on files?  I ask because I can do a find -type d on all the high-inode dirs a lot more easily than I can on dirs+files.  I'm getting the impression this is mainly a bug with directory entries.  I can't really think of a test to see if it will screw up files also (which would devastate my incremental backups).

I'm also thinking of switching to 64-bit (the hw is a P-D and capable) to just be done with it, but then I cannot easily test this bug any further.  (For uniformity I have reasons to stick with 32-bit, but can abandon it on this one box if required.)  Since I would like to help test this bug, I'm reluctant to switch, as I do believe it would be quite hard for someone to create a test system for this.

If I better understood the bug, perhaps I could have a go, but the code I've looked at so far and the mailing list discussion makes it seem out of my league.

Thanks!

Comment 39 J. Bruce Fields 2013-11-13 14:44:03 UTC
After further discussion upstream we settled on the solution of replacing the direct reference of i_ino by a getattr in the common get_name helper:

http://marc.info/?i=20131106151003.GA21425@ZenIV.linux.org.uk%3E

That is now upstream as commit 950ee9566a5b6cc "exportfs: fix 32-bit nfsd handling of 64-bit inode numbers", so will be included in 2.3.13.

I'll see if we can also get that into stable and that should get it included in Fedora a little sooner.

And, yes, I believe this affects only directories.

Thanks for your reporting and testing, both have been helpful.

Comment 40 Trevor Cordes 2013-11-19 12:02:08 UTC
Is it worth it me grabbing the patches and compiling and testing?  Or does the bug fix seem to be theoretically foolproof?

I ask because the previous patch attempt started doing really strange things on my server which had me sweating some bullets.  Though it's a bit of a pain for me to compile, deploy and test, I can do it.

Thanks for everything!

Comment 41 J. Bruce Fields 2013-11-19 15:40:42 UTC
(In reply to Trevor Cordes from comment #40)
> Is it worth it me grabbing the patches and compiling and testing?  Or does
> the bug fix seem to be theoretically foolproof?
> 
> I ask because the previous patch attempt started doing really strange things
> on my server which had me sweating some bullets.  Though it's a bit of a
> pain for me to compile, deploy and test, I can do it.

I'm a bit confused by comments above.

Once you removed the buffer.ino reassigment from your backport, was the problem fixed for you?

Comment 42 J. Bruce Fields 2013-11-19 20:20:14 UTC
(In reply to J. Bruce Fields from comment #41)
> Once you removed the buffer.ino reassigment from your backport, was the
> problem fixed for you?

Sorry, I somehow overlooked the description of the problem in comment #34.

There is one important difference between the patch you tested and the one that went upstream: the patch you tested calls vfs_getattr(), which does a security_inode_getattr(path->mnt, path->dentry) before calling ->getattr.

I believe that the only reason that would cause a problem would be if selinux was erroring out.

If you can retest with the new version upstream (commits b7a6ec52dd4eced4a9bcda9ca85b3c8af84d3c90 and 950ee9566a5b6cc45d15f5fe044bab4f1e8b62cb, or I can attach a backport), that would be helpful.

However it might be enough to retest with the kernel you tested before, but with selinux disabled.  Let me look into that....

Comment 43 J. Bruce Fields 2013-11-19 20:39:32 UTC
Created attachment 826289 [details]
backport of upstream submission

I can't tell whether putting selinux in permissive mode would be sufficient to work around the problem here; you might need to actually boot with selinux disabled (by adding selinux=0 to the kernel commandline).

If one of those two things fixes the problem, then we're good.

If not, then there may be some further problem that we don't understand, and it might be worth testing the patch as it was actually applied upstream (backport attached; basically just the two commits I mentioned in the previous comment, with the same buffer.ino fix that was needed for the previous patch you tested).

Comment 44 Trevor Cordes 2013-11-21 12:28:42 UTC
My selinux (to the best of my knowledge) is and always has been disabled on the server and client.

#sestatus 
SELinux status:                 disabled
#getenforce 
Disabled

However, I do not boot with linux command "selinux=0".  Do I need to do that?  I'm always so frustrated and confused with selinux just trying to figure out what mode it is in and if that means it's really off.

Do I need to boot with selinux=0 to make sure?

On the server, client or both?

I don't mind testing again, the new attachment 826289 [details] if I can get it to add as a patch to my kernel rpmbuild setup.  However, I'd like to know the above first to be on the safe side.

Also, messing around with this stuff, what is the worst (likely) case scenario in terms of damage?  If it can foobar my entire 11TB fs, that would be very bad.  If it just confuses the client, that's still bad, but a little better.

The main issue I hit during my testing was that low inodes were starting to show the bug, which a) never happened before or since when using the stock kernel, and b) screwed up a lot more of my day to day programs which are using the low inode dirs almost exclusively.  Granted, I didn't test at length or triply-verify, because as soon as I saw it affecting low inodes I got cold feet and rebooted to stock.

Comment 45 J. Bruce Fields 2013-11-21 15:31:27 UTC
(In reply to Trevor Cordes from comment #44)
> My selinux (to the best of my knowledge) is and always has been disabled on
> the server and client.
> 
> #sestatus 
> SELinux status:                 disabled
> #getenforce 
> Disabled
> 
> However, I do not boot with linux command "selinux=0".  Do I need to do
> that?

From my quick look at the kernel code I'm not completely sure.  It might be that the result is the same, but since booting to your patched kernel with selinux=0 is probably easy enough to do (compared to building a new kernel, anyway), that would be worth trying.

> On the server, client or both?

Just on the server.

> I don't mind testing again, the new attachment 826289 [details] if I can get
> it to add as a patch to my kernel rpmbuild setup.  However, I'd like to know
> the above first to be on the safe side.
> 
> Also, messing around with this stuff, what is the worst (likely) case
> scenario in terms of damage?  If it can foobar my entire 11TB fs, that would
> be very bad.  If it just confuses the client, that's still bad, but a little
> better.

It's just returning errors on the client's attempt to use those files.  Nothing's 100%, but I have a hard time seeing how this would corrupt the filesystem.  (So worst case might be that the unexpected error casues your application itself to do something stupid to your data.)

> The main issue I hit during my testing was that low inodes were starting to
> show the bug, which a) never happened before or since when using the stock
> kernel, and b) screwed up a lot more of my day to day programs which are
> using the low inode dirs almost exclusively.  Granted, I didn't test at
> length or triply-verify, because as soon as I saw it affecting low inodes I
> got cold feet and rebooted to stock.

Right, understood.  If it is a security module problem then it would indeed affect low inodes too.

(Also: I'm assuming you're not using yama, or tomoyo, or smack, or some other odd security module.)

Comment 46 Trevor Cordes 2013-11-28 22:22:40 UTC
I'm having trouble applying the latest patch against linux-3.11.9-201.fc19.  Hunk 2 & 4 failing:
--- fs/exportfs/expfs.c
+++ fs/exportfs/expfs.c
@@ -255,6 +255,11 @@
        int error;
        struct file *file;
        struct getdents_callback buffer;
+	struct kstat stat;
+	struct path child_path = {
+               .mnt = path->mnt,
+               .dentry = child,
+	};

	error = -ENOTDIR;
        if (!dir || !S_ISDIR(dir->i_mode))
@@ -285,7 +300,6 @@
                goto out_close;

        buffer.name = name;
-	buffer.ino = child->d_inode->i_ino;
        buffer.found = 0;
        buffer.sequence = 0;
        while (1) {


Looking at the source it looks like it has changed (or is not changed enough) to get the patch to work.  What is the kernel version basis for the patch?  Perhaps I should use an older or newer source base.  I'm using the latest .src.rpm as provided in F19's repos to allow a simple SPEC edit and a rpmbuild.  If I'm approaching this the wrong way, let me know.

The relevant areas in the source file I have are:

        struct file *file;
        struct getdents_callback buffer = {
               	.ctx.actor = filldir_one,
                .name = name,
                .ino = child->d_inode->i_ino
        };

        error = -ENOTDIR;

[...]

                goto out_close;

        buffer.sequence = 0;
        while (1) {

Comment 47 J. Bruce Fields 2013-12-04 22:51:55 UTC
Created attachment 832896 [details]
upstream patch "exportfs: fix 32-bit nfsd handling of 64-bit inode numbers"

Upstream's version of the patch (attached) should apply fine to 3.11.9.

Comment 48 J. Bruce Fields 2013-12-04 23:00:18 UTC
Created attachment 832897 [details]
upstream patch "vfs: split out vfs_getattr_nosec"

But, I forgot, you'll also need to apply this patch first.

Comment 49 Trevor Cordes 2013-12-18 06:51:37 UTC
I applied the new patches to 3.11.10 and compiled with the stock Fedora src.rpm.  I've been running the patched kernel on my file server for 2 days now and so far so good.  No problems with large inodes.  No problems with small ones... so far.  I would say this set of patches is working much better than the previous one.

I have not made any changes to my selinux setup (it's still disabled but possibly present), nor am I using any other MAC/security system.

I will continue to use the patched kernel and report back if I see problems.  But for now it's looking like it's fixed.

A big thank you for all the hard work!!

Comment 50 J. Bruce Fields 2013-12-18 19:42:02 UTC
Thanks for the testing!  Patch has also been submitted for stable now.

Comment 51 Trevor Cordes 2015-01-09 08:10:10 UTC
I have not seen this bug since my last report, and have used several stock Fedora kernels since.  Thanks for all your prompt and very useful help.  We squashed that one good.  Closing.