Red Hat Bugzilla – Bug 834808
nfs over openvpn (and cgroup?) locks up
Last modified: 2014-06-18 03:42:35 EDT
Created attachment 593938 [details]
Description of problem:
NFSv4 mounts over openvpn over ethernet lock up on any larger transfers.
It is is easier (only?) reproducible when cgroup is also used.
Version-Release number of selected component (if applicable):
kernel-3.3.8-1.fc16.x86_64 but this is at least a year long problem.
Often, copy larger (happened after 1.3GB now) file client->server.
I reproduced it two times but not the third time now.
Steps to Reproduce:
100Mb ethernet, openvpn (comp-lzo no), client=X220 notebook, 8GB RAM, SSD disk.
host1s:/ /host1 nfs defaults,noauto,nfsvers=4,resvport 0 0
client:/etc/cgconfig.conf - as attached.
Use user jkratoch (as defined in cgconfig) - but I did not put the cp command to any cgroup class so cgroup may not matter.
openvpn is stuck in D state.
[<ffffffffa08484a9>] nfs_wait_bit_killable+0x39/0x90 [nfs]
[<ffffffffa085737a>] nfs_commit_inode+0xaa/0x250 [nfs]
[<ffffffffa0845f14>] nfs_release_page+0x84/0xa0 [nfs]
ps axwfl|grep 27760
1 0 27760 1 20 0 42156 856 nfs_wa Ds ? 0:33 /usr/sbin/openvpn --daemon --writepid /var/run/openvpn/vidhome.pid --cd /etc/openvpn/ --config vidhome.conf
State: D (disk sleep)
Uid: 0 0 0 0
Gid: 0 0 0 0
VmPeak: 42160 kB
VmSize: 42156 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 964 kB
VmRSS: 856 kB
VmData: 656 kB
VmStk: 136 kB
VmExe: 544 kB
VmLib: 5672 kB
VmPTE: 108 kB
VmSwap: 0 kB
27760 (openvpn) D 1 27760 27760 0 -1 1077946432 190 0 0 0 2262 1065 0 0 20 0 1 0 102118072 43167744 214 18446744073709551615 4194304 4748500 140733664299952 140733664296248 237421666339 0 0 4096 18947 18446744072107623593 0 0 17 0 0 0 0 0 0 8948508 8959424 26247168
No lock up.
Therefore normally I use scp instead of NFS cp.
Any other interesting log messages on client or server?
Can you watch the NFS traffic in wireshark, and if so, what do you see?
(You'd want to be able to watch the traffic that's going over the vpn, not the on-the-wire traffic. I'm not sure how to do that.)
Isn't the backtrace clear enough?
openvpn wants to send a packet, it needs memory for it, filesystem tries to flush out some pages, this invokes NFS which tries to contact the server - but over openvpn which is locked up now.
(In reply to comment #1)
> Any other interesting log messages on client or server?
> Can you watch the NFS traffic in wireshark, and if so, what do you see?
Currently I have no longer setup the ethernet connection (only WiFi now).
> (You'd want to be able to watch the traffic that's going over the vpn, not
> the on-the-wire traffic. I'm not sure how to do that.)
I can just sniff the openvpn interface instead of the ethernet interface.
(cc'ing Trond in case he's interested)
You may want to report this upstream to the linux-nfs ml. This looks somewhat similar to this case that we've been working on upstream there:
There, the reclaim of the page was recursing back into nfs from rpciod so we fixed it (we think anyway) by setting the PF_FSTRANS in rpciod to prevent doing the commit there in that case.
The connection to NFS is less clear here, but the problem is similar. How is the NFS client to know that we need this process in order to complete the commit? I'm beginning to think we might be be best off never doing a blocking commit from ->releasepage.
Created attachment 597103 [details]
patchset -- don't block in releasepage when recursing back from a fs transaction or memory allocation
This patchset is still an RFC, but is probably worth testing to see if it fixes the problem. I suspect that it will fix the issue for you, but it's unclear whether it might cause other issues when memory is tight and we need to commit some NFS pages in order to satisfy an allocation.
Jan can you test this and let me know if it helps at all?
Jan, have you been able to test this patch at all? I'd like to know whether it helped you at all...
I have to move the computer across house, I have moved in the meantime.
Unfortunately I have filed it only around the time I moved.
I expected you have also other reproducers for it. If not I can move temporarily the computer across house for the test. I have tested that through 802.11b the problem does not happen.
Comment 4 probably fixes the bug, I may say it more certainly in more weeks.
I have reproduced the hang first with kernel-3.4.7-1.fc16.x86_64 (see below).
BTW with kernel-3.4.7-1.fc16.x86_64 I had IIRC 2 hard lock ups without any traces in either netconsole or in kdump. I cannot say it was due to this Bug but normally I do not have such hangs. With kernel-3.4.7-1.openvpn.fc16.x86_64 I had also one hard lockup, I do not know its reason.
With kernel-3.4.7-1.openvpn.fc16.x86_64 patch by Comment 4
I am unable to reproduce the bug so far. But the reproducibility is not easy on freshly booted OS, it is best reproducible on ~month etc. running OS.
09:35:21 up 2 days, 39 min, 26 users, load average: 0.17, 0.20, 0.16
15:36:59 up 1 day, 17:26, 23 users, load average: 5.67, 3.20, 1.99
[<ffffffffa0873e29>] nfs_wait_bit_killable+0x39/0x90 [nfs]
[<ffffffffa088386a>] nfs_commit_inode+0x9a/0x300 [nfs]
[<ffffffffa0872364>] nfs_release_page+0x84/0xa0 [nfs]
[<ffffffffa037fc4f>] tun_get_user+0x15f/0x4e0 [tun]
[<ffffffffa0380674>] tun_chr_aio_write+0x64/0x90 [tun]
# Mass update to all open bugs.
Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.
Please retest with this kernel, and let us know if your problem has been fixed.
In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with. Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.
If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.
Jan, you posted a comment with just a kernel version in it and reopened the bug. Can you elaborate?
I got a lock up looking exactly the one described above.
Unfortunately I SIGKILLed the process before I took its /proc/PID/stack.
My guess it was the same hang but I cannot confirm it any way.
This automatic closing of Bugs does not make sense without any known change in code related to the bugreport.
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '16'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 16's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 16 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" and open it against that version of Fedora.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.
[<ffffffffa000a559>] rpc_wait_bit_killable+0x39/0x90 [sunrpc]
[<ffffffffa000a51d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
[<ffffffffa077bdfa>] nfs_initiate_commit+0xea/0x110 [nfs]
[<ffffffffa077dbcb>] nfs_generic_commit_list+0x8b/0xf0 [nfs]
[<ffffffffa077dd25>] nfs_commit_inode+0xf5/0x160 [nfs]
[<ffffffffa076f0ee>] nfs_release_page+0x8e/0xb0 [nfs]
It has reproduced on its own.
(In reply to Jan Kratochvil from comment #15)
> [<ffffffffa000a559>] rpc_wait_bit_killable+0x39/0x90 [sunrpc]
> [<ffffffffa000a51d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
> [<ffffffffa077bdfa>] nfs_initiate_commit+0xea/0x110 [nfs]
> [<ffffffffa077dbcb>] nfs_generic_commit_list+0x8b/0xf0 [nfs]
> [<ffffffffa077dd25>] nfs_commit_inode+0xf5/0x160 [nfs]
> [<ffffffffa076f0ee>] nfs_release_page+0x8e/0xb0 [nfs]
> [<ffffffff81133a12>] try_to_release_page+0x32/0x50
> [<ffffffff8114768f>] shrink_page_list+0x5ff/0x990
> [<ffffffff81148027>] shrink_inactive_list+0x157/0x410
> [<ffffffff81148869>] shrink_lruvec+0x229/0x4e0
> [<ffffffff81148b86>] shrink_zone+0x66/0x180
> [<ffffffff81149010>] do_try_to_free_pages+0x100/0x620
> [<ffffffff81149838>] try_to_free_pages+0xf8/0x180
> [<ffffffff8113d805>] __alloc_pages_nodemask+0x6a5/0xae0
> [<ffffffff8117c308>] alloc_pages_current+0xb8/0x190
> [<ffffffff811862e0>] new_slab+0x2d0/0x3a0
> [<ffffffff8165aeaf>] __slab_alloc+0x393/0x560
> [<ffffffff8118a62f>] __kmalloc_node_track_caller+0xbf/0x2a0
> [<ffffffff815454dc>] __kmalloc_reserve.isra.51+0x3c/0xa0
> [<ffffffff81547a6c>] __alloc_skb+0x7c/0x290
> [<ffffffff81540221>] sock_alloc_send_pskb+0x1d1/0x350
> [<ffffffff815403b5>] sock_alloc_send_skb+0x15/0x20
> [<ffffffff8158eb38>] __ip_append_data.isra.38+0x5c8/0xa00
> [<ffffffff8159115d>] ip_make_skb+0xdd/0x120
> [<ffffffff815b8d3b>] udp_sendmsg+0x2db/0xa10
> [<ffffffff815c3e63>] inet_sendmsg+0x63/0xb0
> [<ffffffff8153aa30>] sock_sendmsg+0xb0/0xe0
> [<ffffffff8153df5d>] sys_sendto+0x12d/0x180
> [<ffffffff8166afd9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
I assume the stack trace is from the VPN daemon itself and the NFS mount is being done over the VPN? If so, then I simply don't see a way to reasonably fix this. The core of the problem is a disconnect between the daemon handling your VPN and the NFS subsystem.
If the kernel somehow knew that it shouldn't try to flush NFS pages in order to handle memory allocations for this particular daemon, then that would work, but there isn't a way to do that as far as I can tell. Another idea might be to never allow the nfs_releasepage to block, but that might cause other problems.
It might be best to report this bug to the linux-nfs mailing list and see whether anyone there has ideas. Until then, I think the only thing you can do here is to try and prevent this by tuning the VM to keep the kernel out of direct reclaim.
(In reply to Jeff Layton from comment #16)
> I assume the stack trace is from the VPN daemon itself and the NFS mount is
> being done over the VPN?
> It might be best to report this bug to the linux-nfs mailing list and see
> whether anyone there has ideas.
I can post it there but I am not kernel developer. https://bugzilla.kernel.org seems as more appropriate to me.
One of the package maintainership goals is to act as qualified interface to upstream development.
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs.
Fedora 18 has now been rebased to 3.11.4-101.fc18. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19.
If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale.
It has been over a month since we asked you to test the 3.11 kernel updates and let us know if your issue has been resolved or is still a problem. When this happened, the bug was set to needinfo. Because the needinfo is still set, we assume either this is no longer a problem, or you cannot provide additional information to help us resolve the issue. As a result we are closing with insufficient data. If this is still a problem, we apologize, feel free to reopen the bug and provide more information so that we can work towards a resolution
If you experience different issues, please open a new bug report for those.