Bug 1132610
Summary: | nfsd does not release free space of a file created with dd oflag=direct where there was no space left on device even after manual deletion | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Harshula Jayasuriya <harshula> | |
Component: | kernel | Assignee: | J. Bruce Fields <bfields> | |
kernel sub component: | NFS | QA Contact: | JianHong Yin <jiyin> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | agaikwad, bcodding, bfields, dwysocha, fsorenso, harshula, jaeshin, jlayton, moagrawa, mthacker, plambri, rhandlin, rkothiya, sfroemer, steved, swhiteho, xzhou, yoyang | |
Version: | 7.0 | Keywords: | Reproducer | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | kernel-3.10.0-838.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1562716 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-10 19:16:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1203710, 1205790, 1298243, 1385242, 1420851, 1469559, 1562716 |
Description
Harshula Jayasuriya
2014-08-21 16:22:33 UTC
Hi Bruce, Do you want me to work on this? (In reply to Harshula Jayasuriya from comment #5) > Hi Bruce, Do you want me to work on this? That would be fantastic if you could. I'll continue watching it. I've had some luck reproducing this between recent upstream client and server. In the process I also noticed what appears to be a separate leak on the server side. I suspect that server leak is the same one we ran across independently in bug 1236736, fixed by the patch https://bugzilla.redhat.com/attachment.cgi?id=1050149&action=diff Therefore any further investigation should probably be done using kernels with that patch applied, to avoid being confused by that server bug. I've confirmed that I can reproduce this upstream with client and server running v4.2-rc3, which includes Al's 75a6f82a0d10 "freeing unlinked file indefinitely delayed". (Actually, on a kernel including my patches for 4.3). The reproducer isn't 100% reliable. It's more reliable the more data written (equivalently, the more space free to write in). I think I've always reproduced when the filesystem has a gig free. But I'm not always able to reproduce when there's less free space. Examination of a network trace shows a WRITE reply with NFS4ERR_NOSPC. NOTES: * I've had difficulty reproducing this on RHEL 6 & 7 recently. * I am able to reproduce it on Fedora 22, kernel 4.1.6-200.fc22.x86_64, (it includes the patch 75a6f82a0d10). * The export that I use only has about 330 MB of free space and is on a separate partition to the root filesystem. * On my new test VM, Fedora 22, NFS v3 and v4 behaviour has a very subtle difference. With an NFS v3 mount, one does NOT have stop the NFS service before the space on the export is recovered. Whereas, with NFS v4, the space is only recovered after stopping the NFS service. NFSv3: ------------------------------------------------------------ ... [root@nfsclient ~]# umount /mnt/test [root@nfsclient ~]# df Filesystem 1K-blocks Used Available Use% Mounted on devtmpfs 1013820 0 1013820 0% /dev tmpfs 1024596 0 1024596 0% /dev/shm tmpfs 1024596 824 1023772 1% /run tmpfs 1024596 0 1024596 0% /sys/fs/cgroup /dev/mapper/fedora_nfsclient-root 18762184 10342564 7530392 58% / tmpfs 1024596 4 1024592 1% /tmp /dev/vda1 487652 457956 0 100% /boot tmpfs 204920 0 204920 0% /run/user/0 [root@nfsclient ~]# rm /boot/export/.nfs000000000000ff8f00000005 rm: remove regular file ‘/boot/export/.nfs000000000000ff8f00000005’? y [root@nfsclient ~]# df Filesystem 1K-blocks Used Available Use% Mounted on devtmpfs 1013820 0 1013820 0% /dev tmpfs 1024596 0 1024596 0% /dev/shm tmpfs 1024596 824 1023772 1% /run tmpfs 1024596 0 1024596 0% /sys/fs/cgroup /dev/mapper/fedora_nfsclient-root 18762184 10342564 7530392 58% / tmpfs 1024596 4 1024592 1% /tmp /dev/vda1 487652 135688 322268 30% /boot tmpfs 204920 0 204920 0% /run/user/0 [root@nfsclient ~]# systemctl stop nfs ------------------------------------------------------------ NFSv4 ------------------------------------------------------------ [root@nfsclient ~]# rm /boot/export/.nfs000000000000ff8f00000004 rm: remove regular file ‘/boot/export/.nfs000000000000ff8f00000004’? y [root@nfsclient ~]# df Filesystem 1K-blocks Used Available Use% Mounted on devtmpfs 1013820 0 1013820 0% /dev tmpfs 1024596 0 1024596 0% /dev/shm tmpfs 1024596 824 1023772 1% /run tmpfs 1024596 0 1024596 0% /sys/fs/cgroup /dev/mapper/fedora_nfsclient-root 18762184 10342556 7530400 58% / tmpfs 1024596 4 1024592 1% /tmp /dev/vda1 487652 457956 0 100% /boot tmpfs 204920 0 204920 0% /run/user/0 [root@nfsclient ~]# systemctl stop nfs [root@nfsclient ~]# df Filesystem 1K-blocks Used Available Use% Mounted on devtmpfs 1013820 0 1013820 0% /dev tmpfs 1024596 0 1024596 0% /dev/shm tmpfs 1024596 824 1023772 1% /run tmpfs 1024596 0 1024596 0% /sys/fs/cgroup /dev/mapper/fedora_nfsclient-root 18762184 10342556 7530400 58% / tmpfs 1024596 4 1024592 1% /tmp /dev/vda1 487652 135688 322268 30% /boot tmpfs 204920 0 204920 0% /run/user/0 ------------------------------------------------------------ Since nfs_direct_req holds a reference on open and lock contexts I was curious whether this could be explained by a leak of nfs_direct_req. Those are allocated from their own slab, so you can test this by watching the first number of "grep nfs_direct /proc/slabinfo"; I found that it dropped back to zero after reproducing the problem, so I don't think there's a nfs_direct_req leak. (In reply to Harshula Jayasuriya from comment #18) > * On my new test VM, Fedora 22, NFS v3 and v4 behaviour has a very subtle > difference. With an NFS v3 mount, one does NOT have stop the NFS service > before the space on the export is recovered. Whereas, with NFS v4, the space > is only recovered after stopping the NFS service. My working hypothesis for this difference, by the way, is just that whatever reference leak we have here that prevents unlinking the sillyrenamed .nfs file, also (in the v4 case) prevents the client from sending a CLOSE when it should. That leaves the server holding a reference on the unlinked file (which persists even if you unlink the .nfs file). In the v4 case even after I umount I still se RENEW calls from the client, and df still shows the space is used. The space is finally deallocated after rebooting the client. Hi Bruce, Looks like nfs_commitdata_release() is not being called when this bug is triggered. That would explain reference count not dropping back to zero. 1555 void nfs_commitdata_release(struct nfs_commit_data *data) 1556 { 1557 put_nfs_open_context(data->context); 1558 nfs_commit_free(data); 1559 } nfs_commitdata_release - put_nfs_open_context - __put_nfs_open_context (In reply to Harshula Jayasuriya from comment #27) > Looks like nfs_commitdata_release() is not being called when this bug is > triggered. That would explain reference count not dropping back to zero. Huh, that's interesting, but I'm not seeing a bug there yet.... As far as I can tell commit is done by nfs_generic_commit_list in both direct and non-direct cases, and that does the right thing: nfs_init_commit does a get_nfs_open_context, then nfs_commit_list calls rpc_run_task with nfs_commit_ops, which has rpc_release method set to nfs_commitdata_release(). (And rpc_run_task should guarantee ->rpc_release is always called, even if it fails completely.) Hi Bruce, I observed that during testing of a very specific corner cases. I was trying to minimise the quantity of I/O needed to reproduce the bug. 1) bs=512K oflag=direct Vs 2) bs=256K oflag=direct with only 268KiB of space remaining. (1) would reproduce the bug, (2) did not. I guess, the lack of a nfs_commitdata_release can be explained by the bs being larger than available space. Also ... perhaps if you have a moment could you please double check the following observations: * Can't reproduce the bug when dd's bs <= wsize. * When bs > wsize, if the first wsize RPC WRITE from a bs size syscall fails, the bug does not reproduce. Where wsize was 128K. I was at LCA this week. I'll take another look at this next week. Oh, that's fantastic, I hadn't been able to get such a good reproducer. And, yes, it also works for me. In more detail:
- on the server:
- create a small xfs filesystem and export it.
- create a file "filler" and write to it until out of space
- use "truncate" to adjust space until there's 268K left
- on client:
- mount -overs=3,sec=sys,wsize=131072 server:/export
- dd if=/dev/zero of=/mnt/zeroes bs=512K oflag=direct
- rm /mnt/zeroes
At this point an "ls -a" of the exports shows a .nfs sillyrename file that shouldn't be there.
> perhaps if you have a moment could you please double check the following observations:
I think that's consistent with what I've seen, but I haven't checked each of those cases yet.
Reproduced: https://beaker.engineering.redhat.com/jobs/1342863 [06:26:31 root@ ~~]# dd if=/dev/zero of=$nfsmp/zeros bs=1025K oflag=direct dd: error writing ‘/mnt/nfsmp-bugid-810/zeros’: No space left on device 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.133586 s, 0.0 kB/s -------------------------------------------------------------------------------- [06:28:11 root@ ~~]# rm $nfsmp/zeros -------------------------------------------------------------------------------- [06:28:12 root@ ~~]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_ibm--hs23--02-root xfs 50G 2.1G 48G 5% / devtmpfs devtmpfs 898M 0 898M 0% /dev tmpfs tmpfs 908M 0 908M 0% /dev/shm tmpfs tmpfs 908M 8.6M 899M 1% /run tmpfs tmpfs 908M 0 908M 0% /sys/fs/cgroup /dev/sda1 xfs 497M 135M 363M 28% /boot /dev/mapper/rhel_ibm--hs23--02-home xfs 414G 33M 413G 1% /home /dev/loop0 xfs 497M 497M 20K 100% /exportdir-bugid-810 ibm-hs23-02.rhts.eng.bos.redhat.com:/exportdir-bugid-810 nfs4 497M 497M 256K 100% /mnt/nfsmp-bugid-810 -------------------------------------------------------------------------------- [06:28:12 root@ ~~]# ls -al $nfsmp|grep "\.nfs" -rw-r--r--. 1 root root 1049600 May 19 06:28 .nfs000000000000008400000001 :: [ FAIL ] :: Running 'ls -al $nfsmp|grep "\.nfs"' (Expected 1, got 0) Note Christoph Hellwig ran across what looks like the same problem and has a reproducer in the form of a patch against xfstests: http://lkml.kernel.org/r/20160616150146.GA14015@infradead.org The upstream commit commit f1c6bb2cb8b81013e8979806f8e15e3d53efb96d Author: Jeff Layton <jlayton> Date: Tue Apr 15 06:17:49 2014 -0400 locks: allow __break_lease to sleep even when break_time is 0 Hi Steve, Are you sure f1c6bb2cb8b81013e8979806f8e15e3d53efb96d fixes this bug? IIRC, Bruce, Christoph and I have reproduced the bug on kernels that contain that patch. cya, # (In reply to Harshula Jayasuriya from comment #41) > Hi Steve, > > Are you sure f1c6bb2cb8b81013e8979806f8e15e3d53efb96d fixes this bug? IIRC, > Bruce, Christoph and I have reproduced the bug on kernels that contain that > patch. I did not know this... removing the 'Patch' keyword. Unfortunately last time I looked at this I had a lot of trouble making any progress. It looked to me like some references taken in nfs_create_request were not gettng put, but I couldn't figure out why not. I'll take another look, but not till after this week. We do at least have a reliable reproducer. So one simple thing to do would be just retest the latest upstream to see if the problem has already be fixed. Then if that doesn't turn up any clues I'll add some more instrumentation. (In reply to J. Bruce Fields from comment #53) > Unfortunately last time I looked at this I had a lot of trouble making any > progress. It looked to me like some references taken in nfs_create_request > were not gettng put, but I couldn't figure out why not. > > I'll take another look, but not till after this week. > > We do at least have a reliable reproducer. So one simple thing to do would > be just retest the latest upstream to see if the problem has already be > fixed. Then if that doesn't turn up any clues I'll add some more > instrumentation. JFYI: Upstream(v4.14-rc2) test pass: https://beaker.engineering.redhat.com/jobs/2068520 https://beaker.engineering.redhat.com/jobs/2068460 (In reply to Yin.JianHong from comment #54) > JFYI: > Upstream(v4.14-rc2) test pass: > https://beaker.engineering.redhat.com/jobs/2068520 > https://beaker.engineering.redhat.com/jobs/2068460 Oh, thanks, that's good to know, so that means this may have been fixed upstream by a commit in the range v4.6..v4.14-rc2. I took a quick skim through "git log v4.6..v4.14-rc2 fs/nfs" and didn't see an obvious candidate, but there are for example plenty of O_DIRECT changes. It would be worth starting a bisect on that range, if it's quick to run the tests. (In reply to J. Bruce Fields from comment #55) > (In reply to Yin.JianHong from comment #54) > > JFYI: > > Upstream(v4.14-rc2) test pass: > > https://beaker.engineering.redhat.com/jobs/2068520 > > https://beaker.engineering.redhat.com/jobs/2068460 > > Oh, thanks, that's good to know, so that means this may have been fixed > upstream by a commit in the range v4.6..v4.14-rc2. Hi Bruce I'm very sorry! I reproduced it again on v4.14-rc2: https://beaker.engineering.redhat.com/jobs/2070538 *the reproducer is not 100% effective; any updates on this? (In reply to Steffen Froemer from comment #61) > any updates on this? I have a reliable reproducer and am actively working on this. If I haven't found the cause by the end of the day I'll try to at least update with my notes. Based on tracing all I know is that nfs_direct_write_schedule_iovec()'s calls to nfs_create_request() are taking references to the only nfs_open_context created during this test, and those reference counts are not all getting put. The code there is quite a thicket of function pointers, but it appears to me that the puts should be done by the release method of the rpc_task created by nfs_initiate_pgio(). That release method should be nfs_pgio_release(), which eventually calls nfs_unlock_and_release_request(). I'm adding some tracing in an attempt to trace the struct nfs_page's through the write path from their creation by nfs_create_request() to their destruction in nfs_unlock_and_release_request() and see if that helps see what's happening. It doesn't look like a leak of the nfs_page's themselves, though, since they're allocated from a slab, and /proc/slabinfo is not showing a leak. I may just need to write up what I've figured out so far and see if I can get help from Trond or someone, as even after spending a few days staring at it this code is still pretty opaque to me. Do you have any further updates? Was Trond able to assist? (In reply to Steve Whitehouse from comment #64) > Do you have any further updates? Was Trond able to assist? I'd like to spend another day collecting data first, but needed to clear a backlog of upstream review. Hoping to be back to this tomorrow. Any further updates? I'm trying to figure out whether we should fix this or defer it at this stage. (In reply to Steve Whitehouse from comment #66) > Any further updates? I'm trying to figure out whether we should fix this or > defer it at this stage. No updates. I'll look at it some more today. Realistically it may not get figured out before the thanksgiving holidays. I experimented some more, and tried email upstream to see if anyone else has ideas: https://marc.info/?l=linux-nfs&m=151277138820047 (In reply to J. Bruce Fields from comment #63) > It doesn't look like a leak of the nfs_page's themselves, though, since > they're allocated from a slab, and /proc/slabinfo is not showing a leak. It appears I was wrong about this. I am now seeing an nfs_page leak, visible as nonzero numbers in the first columns of /proc/slabinfo or from tracing or printk's. It seems to reproduce whenever a single write() systemcall results in multiple WRITE requests, some of which return ENOSPC. I can therefore simplify the reproducer by using a smaller wsize and a filesystems with less free space; currently I'm reproducing with 4K wspace and a single 64K write() to a filesystem where df reports 60K free. The write() call returns 40K, and a network trace shows 16 consecutive 4K writes, the first 10 succeeding, the last 6 returning ENOSPC. I think I have at least an idea where the problem is, with the help of some printk's and the bcc "stackcount" tool (handy for getting an idea which callers may be doing gets or puts that are out of balance). The callback that handles a direct write reply is fs/nfs/direct.c:nfs_direct_write_completion(). One of the things it does is schedule followup COMMIT calls. A successful NFS_UNSTABLE reply will result in taking a reference on any nfs_pages and calling nfs_mark_request_commit to add them to some list of pages to be committed associated with the dreq. But if a WRITE fails, nfs_direct_write_completion sets dreq->flags to zero. This causes nfs_direct_write_schedule_work to skip any commits. So if a single write results in multiple WRITE requests, some succeeding and some failing, it appears that we end up with references taken on nfs_pages that are never put by the commit completion code. At least, that's my best guess for now. I'm not sure what the best fix is. Just removing the "dreq->flags = 0" from nfs_direct_write_completion fixes the problem for my reproducer, but I suspect that's not correct and would cause different problems in other cases. Bruce, this is a blocker for 7.5, can you add a status update here since there has not been one for a little while now. We will need to either get this fixed shortly or push if off to 7.6 if we think that we cannot do it for 7.5. (In reply to Steve Whitehouse from comment #77) > Bruce, this is a blocker for 7.5, can you add a status update here since > there has not been one for a little while now. We will need to either get > this fixed shortly or push if off to 7.6 if we think that we cannot do it > for 7.5. I have a candidate fix, but haven't gotten any review upstream. And this isn't an area I'm sufficiently familiar with to be confident about proceeding without upstream review. I've tried again. Can someone else from the team assist? I assume that this is something that should still make 7.5 if we can get the patches through the system in time? What are the consequences if we miss? Latest posting is here: https://marc.info/?l=linux-nfs&m=151611528314226&w=2 Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-838.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1062 |