Bug 1132610

Summary:	nfsd does not release free space of a file created with dd oflag=direct where there was no space left on device even after manual deletion
Product:	Red Hat Enterprise Linux 7	Reporter:	Harshula Jayasuriya <harshula>
Component:	kernel	Assignee:	J. Bruce Fields <bfields>
kernel sub component:	NFS	QA Contact:	JianHong Yin <jiyin>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	agaikwad, bcodding, bfields, dwysocha, fsorenso, harshula, jaeshin, jlayton, moagrawa, mthacker, plambri, rhandlin, rkothiya, sfroemer, steved, swhiteho, xzhou, yoyang
Version:	7.0	Keywords:	Reproducer
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-3.10.0-838.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1562716 (view as bug list)		Environment:
Last Closed:	2018-04-10 19:16:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1203710, 1205790, 1298243, 1385242, 1420851, 1469559, 1562716

Description Harshula Jayasuriya 2014-08-21 16:22:33 UTC

Description of problem:

* Using the same RHEL 7 system as the NFS server and client for convenience. To reproduce the problem, need to use direct I/O and the file being created must be larger than the available free space on the NFS export. Then delete the file and verify that free space has not increased.

Version-Release number of selected component (if applicable):

* kernel-3.10.0-123.6.3.el7.x86_64
* nfs-utils-1.3.0-0.el7.x86_64

Steps to Reproduce:

# systemctl start nfs-server.service

# mount -t nfs -o vers=4 NFS-SERVER:/mnt2/export /mnt/test

# dd if=/dev/zero of=/mnt/test/zeroes bs=1M oflag=direct
dd: error writing ‘/mnt/test/zeroes’: No space left on device

# df
Filesystem                                 1K-blocks    Used Available Use% Mounted on
...
/dev/sda1                                     508588  508568        20 100% /mnt2
NFS-SERVER:/mnt2/export                       508672  508544       128 100% /mnt/test

# rm /mnt/test/zeroes

# df
Filesystem                                 1K-blocks    Used Available Use% Mounted on
...
/dev/sda1                                     508588  508568        20 100% /mnt2
NFS-SERVER:/mnt2/export                       508672  508544       128 100% /mnt/test

# ls /mnt/test/zeroes
ls: cannot access /mnt/test/zeroes: No such file or directory

# ls -a  /mnt/test/
.  ..  .nfs000000000001812400000002

# rm /mnt/test/.nfs000000000001812400000002 
rm: cannot remove ‘/mnt/test/.nfs000000000001812400000002’: Device or resource busy

# umount /mnt/test/

# df
Filesystem                        1K-blocks    Used Available Use% Mounted on
...
/dev/sda1                            508588  508568        20 100% /mnt2

# rm /mnt2/export/.nfs000000000001812400000002 

# df
Filesystem                        1K-blocks    Used Available Use% Mounted on
...
/dev/sda1                            508588  508568        20 100% /mnt2


# systemctl stop nfs-server.service

# df
Filesystem                        1K-blocks    Used Available Use% Mounted on
...
/dev/sda1                            508588  201700    306888  40% /mnt2

Comment 5 Harshula Jayasuriya 2014-11-18 05:17:05 UTC

Hi Bruce, Do you want me to work on this?

Comment 6 J. Bruce Fields 2014-11-18 14:22:32 UTC

(In reply to Harshula Jayasuriya from comment #5)
> Hi Bruce, Do you want me to work on this?

That would be fantastic if you could.  I'll continue watching it.

Comment 15 J. Bruce Fields 2015-07-21 15:39:37 UTC

I've had some luck reproducing this between recent upstream client and server.  In the process I also noticed what appears to be a separate leak on the server side.

I suspect that server leak is the same one we ran across independently in bug 1236736, fixed by the patch https://bugzilla.redhat.com/attachment.cgi?id=1050149&action=diff

Therefore any further investigation should probably be done using kernels with that patch applied, to avoid being confused by that server bug.

Comment 17 J. Bruce Fields 2015-09-03 19:35:25 UTC

I've confirmed that I can reproduce this upstream with client and server running v4.2-rc3, which includes Al's 75a6f82a0d10 "freeing unlinked file indefinitely delayed".  (Actually, on a kernel including my patches for 4.3).

The reproducer isn't 100% reliable.  It's more reliable the more data written (equivalently, the more space free to write in).  I think I've always reproduced when the filesystem has a gig free.  But I'm not always able to reproduce when there's less free space.

Examination of a network trace shows a WRITE reply with NFS4ERR_NOSPC.

Comment 18 Harshula Jayasuriya 2015-09-04 09:07:35 UTC

NOTES:

* I've had difficulty reproducing this on RHEL 6 & 7 recently.
* I am able to reproduce it on Fedora 22, kernel 4.1.6-200.fc22.x86_64, (it includes the patch 75a6f82a0d10).
* The export that I use only has about 330 MB of free space and is on a separate partition to the root filesystem.
* On my new test VM, Fedora 22, NFS v3 and v4 behaviour has a very subtle difference. With an NFS v3 mount, one does NOT have stop the NFS service before the space on the export is recovered. Whereas, with NFS v4, the space is only recovered after stopping the NFS service.

NFSv3:
------------------------------------------------------------
...
[root@nfsclient ~]# umount /mnt/test
[root@nfsclient ~]# df
Filesystem                          1K-blocks     Used Available Use% Mounted on
devtmpfs                              1013820        0   1013820   0% /dev
tmpfs                                 1024596        0   1024596   0% /dev/shm
tmpfs                                 1024596      824   1023772   1% /run
tmpfs                                 1024596        0   1024596   0% /sys/fs/cgroup
/dev/mapper/fedora_nfsclient-root    18762184 10342564   7530392  58% /
tmpfs                                 1024596        4   1024592   1% /tmp
/dev/vda1                              487652   457956         0 100% /boot
tmpfs                                  204920        0    204920   0% /run/user/0

[root@nfsclient ~]# rm /boot/export/.nfs000000000000ff8f00000005 
rm: remove regular file ‘/boot/export/.nfs000000000000ff8f00000005’? y

[root@nfsclient ~]# df
Filesystem                          1K-blocks     Used Available Use% Mounted on
devtmpfs                              1013820        0   1013820   0% /dev
tmpfs                                 1024596        0   1024596   0% /dev/shm
tmpfs                                 1024596      824   1023772   1% /run
tmpfs                                 1024596        0   1024596   0% /sys/fs/cgroup
/dev/mapper/fedora_nfsclient-root    18762184 10342564   7530392  58% /
tmpfs                                 1024596        4   1024592   1% /tmp
/dev/vda1                              487652   135688    322268  30% /boot
tmpfs                                  204920        0    204920   0% /run/user/0

[root@nfsclient ~]# systemctl stop nfs
------------------------------------------------------------

NFSv4
------------------------------------------------------------
[root@nfsclient ~]# rm /boot/export/.nfs000000000000ff8f00000004 
rm: remove regular file ‘/boot/export/.nfs000000000000ff8f00000004’? y

[root@nfsclient ~]# df
Filesystem                          1K-blocks     Used Available Use% Mounted on
devtmpfs                              1013820        0   1013820   0% /dev
tmpfs                                 1024596        0   1024596   0% /dev/shm
tmpfs                                 1024596      824   1023772   1% /run
tmpfs                                 1024596        0   1024596   0% /sys/fs/cgroup
/dev/mapper/fedora_nfsclient-root  18762184 10342556   7530400  58% /
tmpfs                                 1024596        4   1024592   1% /tmp
/dev/vda1                              487652   457956         0 100% /boot
tmpfs                                  204920        0    204920   0% /run/user/0

[root@nfsclient ~]# systemctl stop nfs

[root@nfsclient ~]# df
Filesystem                          1K-blocks     Used Available Use% Mounted on
devtmpfs                              1013820        0   1013820   0% /dev
tmpfs                                 1024596        0   1024596   0% /dev/shm
tmpfs                                 1024596      824   1023772   1% /run
tmpfs                                 1024596        0   1024596   0% /sys/fs/cgroup
/dev/mapper/fedora_nfsclient-root  18762184 10342556   7530400  58% /
tmpfs                                 1024596        4   1024592   1% /tmp
/dev/vda1                              487652   135688    322268  30% /boot
tmpfs                                  204920        0    204920   0% /run/user/0
------------------------------------------------------------

Comment 25 J. Bruce Fields 2016-01-14 17:17:58 UTC

Since nfs_direct_req holds a reference on open and lock contexts I was curious whether this could be explained by a leak of nfs_direct_req.

Those are allocated from their own slab, so you can test this by watching the first number of "grep nfs_direct /proc/slabinfo"; I found that it dropped back to zero after reproducing the problem, so I don't think there's a nfs_direct_req leak.

Comment 26 J. Bruce Fields 2016-01-19 17:59:35 UTC

(In reply to Harshula Jayasuriya from comment #18)
> * On my new test VM, Fedora 22, NFS v3 and v4 behaviour has a very subtle
> difference. With an NFS v3 mount, one does NOT have stop the NFS service
> before the space on the export is recovered. Whereas, with NFS v4, the space
> is only recovered after stopping the NFS service.

My working hypothesis for this difference, by the way, is just that whatever reference leak we have here that prevents unlinking the sillyrenamed .nfs file, also (in the v4 case) prevents the client from sending a CLOSE when it should.  That leaves the server holding a reference on the unlinked file (which persists even if you unlink the .nfs file).  In the v4 case even after I umount I still se RENEW calls from the client, and df still shows the space is used.  The space is finally deallocated after rebooting the client.

Comment 27 Harshula Jayasuriya 2016-01-26 23:12:57 UTC

Hi Bruce,

Looks like nfs_commitdata_release() is not being called when this bug is triggered. That would explain reference count not dropping back to zero.

1555 void nfs_commitdata_release(struct nfs_commit_data *data)
1556 {
1557         put_nfs_open_context(data->context);
1558         nfs_commit_free(data);
1559 }

nfs_commitdata_release
- put_nfs_open_context
  - __put_nfs_open_context

Comment 28 J. Bruce Fields 2016-02-01 23:26:46 UTC

(In reply to Harshula Jayasuriya from comment #27)
> Looks like nfs_commitdata_release() is not being called when this bug is
> triggered. That would explain reference count not dropping back to zero.

Huh, that's interesting, but I'm not seeing a bug there yet....

As far as I can tell commit is done by nfs_generic_commit_list in both direct and non-direct cases, and that does the right thing: nfs_init_commit does a get_nfs_open_context, then nfs_commit_list calls rpc_run_task with nfs_commit_ops, which has rpc_release method set to nfs_commitdata_release().  (And rpc_run_task should guarantee ->rpc_release is always called, even if it fails completely.)

Comment 29 Harshula Jayasuriya 2016-02-05 15:00:27 UTC

Hi Bruce,

I observed that during testing of a very specific corner cases. I was trying to minimise the quantity of I/O needed to reproduce the bug.

1) bs=512K oflag=direct

Vs

2) bs=256K oflag=direct

with only 268KiB of space remaining. (1) would reproduce the bug, (2) did not. I guess, the lack of a nfs_commitdata_release can be explained by the bs being larger than available space.

Comment 30 Harshula Jayasuriya 2016-02-05 15:16:34 UTC

Also ... perhaps if you have a moment could you please double check the following observations:

* Can't reproduce the bug when dd's bs <= wsize.
* When bs > wsize, if the first wsize RPC WRITE from a bs size syscall fails, the bug does not reproduce.

Where wsize was 128K.

I was at LCA this week. I'll take another look at this next week.

Comment 31 J. Bruce Fields 2016-02-12 23:04:06 UTC

Oh, that's fantastic, I hadn't been able to get such a good reproducer.  And, yes, it also works for me.  In more detail:

  - on the server:
    - create a small xfs filesystem and export it.
    - create a file "filler" and write to it until out of space
    - use "truncate" to adjust space until there's 268K left
  - on client:
    - mount -overs=3,sec=sys,wsize=131072 server:/export
    - dd if=/dev/zero of=/mnt/zeroes bs=512K oflag=direct
    - rm /mnt/zeroes

At this point an "ls -a" of the exports shows a .nfs sillyrename file that shouldn't be there.

> perhaps if you have a moment could you please double check the following observations:

I think that's consistent with what I've seen, but I haven't checked each of those cases yet.

Comment 32 JianHong Yin 2016-05-19 10:50:36 UTC

Reproduced:
  https://beaker.engineering.redhat.com/jobs/1342863

[06:26:31 root@ ~~]# dd if=/dev/zero of=$nfsmp/zeros bs=1025K oflag=direct
dd: error writing ‘/mnt/nfsmp-bugid-810/zeros’: No space left on device
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.133586 s, 0.0 kB/s
--------------------------------------------------------------------------------
[06:28:11 root@ ~~]# rm $nfsmp/zeros
--------------------------------------------------------------------------------
[06:28:12 root@ ~~]# df -Th
Filesystem                                               Type      Size  Used Avail Use% Mounted on
/dev/mapper/rhel_ibm--hs23--02-root                      xfs        50G  2.1G   48G   5% /
devtmpfs                                                 devtmpfs  898M     0  898M   0% /dev
tmpfs                                                    tmpfs     908M     0  908M   0% /dev/shm
tmpfs                                                    tmpfs     908M  8.6M  899M   1% /run
tmpfs                                                    tmpfs     908M     0  908M   0% /sys/fs/cgroup
/dev/sda1                                                xfs       497M  135M  363M  28% /boot
/dev/mapper/rhel_ibm--hs23--02-home                      xfs       414G   33M  413G   1% /home
/dev/loop0                                               xfs       497M  497M   20K 100% /exportdir-bugid-810
ibm-hs23-02.rhts.eng.bos.redhat.com:/exportdir-bugid-810 nfs4      497M  497M  256K 100% /mnt/nfsmp-bugid-810
--------------------------------------------------------------------------------
[06:28:12 root@ ~~]# ls -al $nfsmp|grep "\.nfs"
-rw-r--r--. 1 root root   1049600 May 19 06:28 .nfs000000000000008400000001
:: [   FAIL   ] :: Running 'ls -al $nfsmp|grep "\.nfs"' (Expected 1, got 0)

Comment 34 J. Bruce Fields 2016-06-16 18:16:38 UTC

Note Christoph Hellwig ran across what looks like the same problem and has a reproducer in the form of a patch against xfstests:

  http://lkml.kernel.org/r/20160616150146.GA14015@infradead.org

Comment 40 Steve Dickson 2016-11-21 16:26:59 UTC

The upstream commit 

commit f1c6bb2cb8b81013e8979806f8e15e3d53efb96d
Author: Jeff Layton <jlayton>
Date:   Tue Apr 15 06:17:49 2014 -0400

    locks: allow __break_lease to sleep even when break_time is 0

Comment 41 Harshula Jayasuriya 2016-11-21 23:38:07 UTC

Hi Steve,

Are you sure f1c6bb2cb8b81013e8979806f8e15e3d53efb96d fixes this bug? IIRC, Bruce, Christoph and I have reproduced the bug on kernels that contain that patch.

cya,
#

Comment 42 Steve Dickson 2016-12-05 16:17:29 UTC

(In reply to Harshula Jayasuriya from comment #41)
> Hi Steve,
> 
> Are you sure f1c6bb2cb8b81013e8979806f8e15e3d53efb96d fixes this bug? IIRC,
> Bruce, Christoph and I have reproduced the bug on kernels that contain that
> patch.
I did not know this... removing the 'Patch' keyword.

Comment 53 J. Bruce Fields 2017-09-25 21:31:45 UTC

Unfortunately last time I looked at this I had a lot of trouble making any progress.  It looked to me like some references taken in nfs_create_request were not gettng put, but I couldn't figure out why not.

I'll take another look, but not till after this week.

We do at least have a reliable reproducer.  So one simple thing to do would be just retest the latest upstream to see if the problem has already be fixed.  Then if that doesn't turn up any clues I'll add some more instrumentation.

Comment 54 JianHong Yin 2017-09-26 07:24:24 UTC

(In reply to J. Bruce Fields from comment #53)
> Unfortunately last time I looked at this I had a lot of trouble making any
> progress.  It looked to me like some references taken in nfs_create_request
> were not gettng put, but I couldn't figure out why not.
> 
> I'll take another look, but not till after this week.
> 
> We do at least have a reliable reproducer.  So one simple thing to do would
> be just retest the latest upstream to see if the problem has already be
> fixed.  Then if that doesn't turn up any clues I'll add some more
> instrumentation.

JFYI:
 Upstream(v4.14-rc2) test pass:
  https://beaker.engineering.redhat.com/jobs/2068520
  https://beaker.engineering.redhat.com/jobs/2068460

Comment 55 J. Bruce Fields 2017-09-26 12:56:34 UTC

(In reply to Yin.JianHong from comment #54)
> JFYI:
>  Upstream(v4.14-rc2) test pass:
>   https://beaker.engineering.redhat.com/jobs/2068520
>   https://beaker.engineering.redhat.com/jobs/2068460

Oh, thanks, that's good to know, so that means this may have been fixed upstream by a commit in the range v4.6..v4.14-rc2.

I took a quick skim through "git log v4.6..v4.14-rc2 fs/nfs" and didn't see an obvious candidate, but there are for example plenty of O_DIRECT changes.

It would be worth starting a bisect on that range, if it's quick to run the tests.

Comment 57 JianHong Yin 2017-09-27 07:44:52 UTC

(In reply to J. Bruce Fields from comment #55)
> (In reply to Yin.JianHong from comment #54)
> > JFYI:
> >  Upstream(v4.14-rc2) test pass:
> >   https://beaker.engineering.redhat.com/jobs/2068520
> >   https://beaker.engineering.redhat.com/jobs/2068460
> 
> Oh, thanks, that's good to know, so that means this may have been fixed
> upstream by a commit in the range v4.6..v4.14-rc2.
Hi Bruce

I'm very sorry! I reproduced it again on v4.14-rc2:
 https://beaker.engineering.redhat.com/jobs/2070538
 *the reproducer is not 100% effective;

Comment 61 Steffen Froemer 2017-11-02 14:22:54 UTC

any updates on this?

Comment 62 J. Bruce Fields 2017-11-03 13:48:50 UTC

(In reply to Steffen Froemer from comment #61)
> any updates on this?

I have a reliable reproducer and am actively working on this.  If I haven't found the cause by the end of the day I'll try to at least update with my notes.

Comment 63 J. Bruce Fields 2017-11-03 21:37:32 UTC

Based on tracing all I know is that nfs_direct_write_schedule_iovec()'s calls to nfs_create_request() are taking references to the only nfs_open_context created during this test, and those reference counts are not all getting put.  The code there is quite a thicket of function pointers, but it appears to me that the puts should be done by the release method of the rpc_task created by nfs_initiate_pgio().  That release method should be nfs_pgio_release(), which eventually calls nfs_unlock_and_release_request().

I'm adding some tracing in an attempt to trace the struct nfs_page's through the write path from their creation by nfs_create_request() to their destruction in nfs_unlock_and_release_request() and see if that helps see what's happening.

It doesn't look like a leak of the nfs_page's themselves, though, since they're allocated from a slab, and /proc/slabinfo is not showing a leak.

I may just need to write up what I've figured out so far and see if I can get help from Trond or someone, as even after spending a few days staring at it this code is still pretty opaque to me.

Comment 64 Steve Whitehouse 2017-11-07 14:10:23 UTC

Do you have any further updates? Was Trond able to assist?

Comment 65 J. Bruce Fields 2017-11-08 22:31:36 UTC

(In reply to Steve Whitehouse from comment #64)
> Do you have any further updates? Was Trond able to assist?

I'd like to spend another day collecting data first, but needed to clear a backlog of upstream review.  Hoping to be back to this tomorrow.

Comment 66 Steve Whitehouse 2017-11-14 14:35:07 UTC

Any further updates? I'm trying to figure out whether we should fix this or defer it at this stage.

Comment 67 J. Bruce Fields 2017-11-14 16:59:27 UTC

(In reply to Steve Whitehouse from comment #66)
> Any further updates? I'm trying to figure out whether we should fix this or
> defer it at this stage.

No updates.  I'll look at it some more today.  Realistically it may not get figured out before the thanksgiving holidays.

Comment 75 J. Bruce Fields 2017-12-08 22:36:51 UTC

I experimented some more, and tried email upstream to see if anyone else has ideas:

  https://marc.info/?l=linux-nfs&m=151277138820047

(In reply to J. Bruce Fields from comment #63)
> It doesn't look like a leak of the nfs_page's themselves, though, since
> they're allocated from a slab, and /proc/slabinfo is not showing a leak.

It appears I was wrong about this.  I am now seeing an nfs_page leak, visible as nonzero numbers in the first columns of /proc/slabinfo or from tracing or printk's.

It seems to reproduce whenever a single write() systemcall results in multiple WRITE requests, some of which return ENOSPC.

I can therefore simplify the reproducer by using a smaller wsize and a filesystems with less free space; currently I'm reproducing with 4K wspace and a single 64K write() to a filesystem where df reports 60K free.  The write() call returns 40K, and a network trace shows 16 consecutive 4K writes, the first 10 succeeding, the last 6 returning ENOSPC.

Comment 76 J. Bruce Fields 2017-12-14 21:33:56 UTC

I think I have at least an idea where the problem is, with the help of some printk's and the bcc "stackcount" tool (handy for getting an idea which callers may be doing gets or puts that are out of balance).

The callback that handles a direct write reply is fs/nfs/direct.c:nfs_direct_write_completion().  One of the things it does is schedule followup COMMIT calls.  A successful NFS_UNSTABLE reply will result in taking a reference on any nfs_pages and calling nfs_mark_request_commit to add them to some list of pages to be committed associated with the dreq.  But if a WRITE fails, nfs_direct_write_completion sets dreq->flags to zero.  This causes nfs_direct_write_schedule_work to skip any commits.  So if a single write results in multiple WRITE requests, some succeeding and some failing, it appears that we end up with references taken on nfs_pages that are never put by the commit completion code.

At least, that's my best guess for now.  I'm not sure what the best fix is.  Just removing the "dreq->flags = 0" from nfs_direct_write_completion fixes the problem for my reproducer, but I suspect that's not correct and would cause different problems in other cases.

Comment 77 Steve Whitehouse 2018-01-16 09:52:17 UTC

Bruce, this is a blocker for 7.5, can you add a status update here since there has not been one for a little while now. We will need to either get this fixed shortly or push if off to 7.6 if we think that we cannot do it for 7.5.

Comment 78 J. Bruce Fields 2018-01-16 15:08:58 UTC

(In reply to Steve Whitehouse from comment #77)
> Bruce, this is a blocker for 7.5, can you add a status update here since
> there has not been one for a little while now. We will need to either get
> this fixed shortly or push if off to 7.6 if we think that we cannot do it
> for 7.5.

I have a candidate fix, but haven't gotten any review upstream.  And this isn't an area I'm sufficiently familiar with to be confident about proceeding without upstream review.

I've tried again.

Comment 79 Steve Whitehouse 2018-01-16 15:17:20 UTC

Can someone else from the team assist? I assume that this is something that should still make 7.5 if we can get the patches through the system in time? What are the consequences if we miss?

Comment 80 J. Bruce Fields 2018-01-16 16:46:18 UTC

Latest posting is here: https://marc.info/?l=linux-nfs&m=151611528314226&w=2

Comment 89 Bruno Meneguele 2018-01-26 18:28:19 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 91 Bruno Meneguele 2018-01-29 12:47:03 UTC

Patch(es) available on kernel-3.10.0-838.el7

Comment 98 errata-xmlrpc 2018-04-10 19:16:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1062