Bug 834808
Summary: | nfs over openvpn (and cgroup?) locks up | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jan Kratochvil <jan.kratochvil> | ||||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 18 | CC: | bfields, gansalmon, itamar, jan.kratochvil, jforbes, jlayton, jonathan, kernel-maint, madhu.chinakonda, steved, trond.myklebust | ||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-11-27 15:58:07 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Any other interesting log messages on client or server? Can you watch the NFS traffic in wireshark, and if so, what do you see? (You'd want to be able to watch the traffic that's going over the vpn, not the on-the-wire traffic. I'm not sure how to do that.) Isn't the backtrace clear enough? openvpn wants to send a packet, it needs memory for it, filesystem tries to flush out some pages, this invokes NFS which tries to contact the server - but over openvpn which is locked up now. (In reply to comment #1) > Any other interesting log messages on client or server? No. > Can you watch the NFS traffic in wireshark, and if so, what do you see? Currently I have no longer setup the ethernet connection (only WiFi now). > (You'd want to be able to watch the traffic that's going over the vpn, not > the on-the-wire traffic. I'm not sure how to do that.) I can just sniff the openvpn interface instead of the ethernet interface. (cc'ing Trond in case he's interested) You may want to report this upstream to the linux-nfs ml. This looks somewhat similar to this case that we've been working on upstream there: http://marc.info/?l=linux-nfs&m=134089555815444&w=2 There, the reclaim of the page was recursing back into nfs from rpciod so we fixed it (we think anyway) by setting the PF_FSTRANS in rpciod to prevent doing the commit there in that case. The connection to NFS is less clear here, but the problem is similar. How is the NFS client to know that we need this process in order to complete the commit? I'm beginning to think we might be be best off never doing a blocking commit from ->releasepage. Thoughts? Created attachment 597103 [details]
patchset -- don't block in releasepage when recursing back from a fs transaction or memory allocation
This patchset is still an RFC, but is probably worth testing to see if it fixes the problem. I suspect that it will fix the issue for you, but it's unclear whether it might cause other issues when memory is tight and we need to commit some NFS pages in order to satisfy an allocation.
Jan can you test this and let me know if it helps at all?
Jan, have you been able to test this patch at all? I'd like to know whether it helped you at all... I have to move the computer across house, I have moved in the meantime. Unfortunately I have filed it only around the time I moved. I expected you have also other reproducers for it. If not I can move temporarily the computer across house for the test. I have tested that through 802.11b the problem does not happen. Comment 4 probably fixes the bug, I may say it more certainly in more weeks. I have reproduced the hang first with kernel-3.4.7-1.fc16.x86_64 (see below). BTW with kernel-3.4.7-1.fc16.x86_64 I had IIRC 2 hard lock ups without any traces in either netconsole or in kdump. I cannot say it was due to this Bug but normally I do not have such hangs. With kernel-3.4.7-1.openvpn.fc16.x86_64 I had also one hard lockup, I do not know its reason. With kernel-3.4.7-1.openvpn.fc16.x86_64 patch by Comment 4 https://koji.fedoraproject.org/koji/taskinfo?taskID=4366480 I am unable to reproduce the bug so far. But the reproducibility is not easy on freshly booted OS, it is best reproducible on ~month etc. running OS. 09:35:21 up 2 days, 39 min, 26 users, load average: 0.17, 0.20, 0.16 kernel-3.4.7-1.fc16.x86_64: 15:36:59 up 1 day, 17:26, 23 users, load average: 5.67, 3.20, 1.99 [<ffffffffa0873e29>] nfs_wait_bit_killable+0x39/0x90 [nfs] [<ffffffffa088386a>] nfs_commit_inode+0x9a/0x300 [nfs] [<ffffffffa0872364>] nfs_release_page+0x84/0xa0 [nfs] [<ffffffff81122832>] try_to_release_page+0x32/0x50 [<ffffffff81135fa2>] shrink_page_list+0x792/0x9a0 [<ffffffff811365cb>] shrink_inactive_list+0x18b/0x520 [<ffffffff81137138>] shrink_mem_cgroup_zone+0x448/0x5d0 [<ffffffff81137336>] shrink_zone+0x76/0xa0 [<ffffffff811376d4>] do_try_to_free_pages+0x104/0x5b0 [<ffffffff81137e5b>] try_to_free_pages+0xab/0x170 [<ffffffff8112ba3e>] __alloc_pages_nodemask+0x5ae/0x950 [<ffffffff81163b36>] alloc_pages_current+0xb6/0x120 [<ffffffff8116edd5>] new_slab+0x245/0x2f0 [<ffffffff815f490e>] __slab_alloc+0x30f/0x424 [<ffffffff81171d97>] __kmalloc_node_track_caller+0x97/0x1f0 [<ffffffff814e0e08>] __alloc_skb+0x78/0x230 [<ffffffff814db72f>] sock_alloc_send_pskb+0x1df/0x320 [<ffffffffa037fc4f>] tun_get_user+0x15f/0x4e0 [tun] [<ffffffffa0380674>] tun_chr_aio_write+0x64/0x90 [tun] [<ffffffff8118326a>] do_sync_write+0xda/0x120 [<ffffffff81183b63>] vfs_write+0xb3/0x180 [<ffffffff81183e8a>] sys_write+0x4a/0x90 [<ffffffff816047e9>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff # Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient). With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report. kernel-3.6.10-2.fc16.x86_64 Jan, you posted a comment with just a kernel version in it and reopened the bug. Can you elaborate? I got a lock up looking exactly the one described above. Unfortunately I SIGKILLed the process before I took its /proc/PID/stack. My guess it was the same hang but I cannot confirm it any way. This automatic closing of Bugs does not make sense without any known change in code related to the bugreport. This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. kernel-3.9.9-201.fc18.x86_64 [<ffffffffa000a559>] rpc_wait_bit_killable+0x39/0x90 [sunrpc] [<ffffffffa000a51d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc] [<ffffffffa077bdfa>] nfs_initiate_commit+0xea/0x110 [nfs] [<ffffffffa077dbcb>] nfs_generic_commit_list+0x8b/0xf0 [nfs] [<ffffffffa077dd25>] nfs_commit_inode+0xf5/0x160 [nfs] [<ffffffffa076f0ee>] nfs_release_page+0x8e/0xb0 [nfs] [<ffffffff81133a12>] try_to_release_page+0x32/0x50 [<ffffffff8114768f>] shrink_page_list+0x5ff/0x990 [<ffffffff81148027>] shrink_inactive_list+0x157/0x410 [<ffffffff81148869>] shrink_lruvec+0x229/0x4e0 [<ffffffff81148b86>] shrink_zone+0x66/0x180 [<ffffffff81149010>] do_try_to_free_pages+0x100/0x620 [<ffffffff81149838>] try_to_free_pages+0xf8/0x180 [<ffffffff8113d805>] __alloc_pages_nodemask+0x6a5/0xae0 [<ffffffff8117c308>] alloc_pages_current+0xb8/0x190 [<ffffffff811862e0>] new_slab+0x2d0/0x3a0 [<ffffffff8165aeaf>] __slab_alloc+0x393/0x560 [<ffffffff8118a62f>] __kmalloc_node_track_caller+0xbf/0x2a0 [<ffffffff815454dc>] __kmalloc_reserve.isra.51+0x3c/0xa0 [<ffffffff81547a6c>] __alloc_skb+0x7c/0x290 [<ffffffff81540221>] sock_alloc_send_pskb+0x1d1/0x350 [<ffffffff815403b5>] sock_alloc_send_skb+0x15/0x20 [<ffffffff8158eb38>] __ip_append_data.isra.38+0x5c8/0xa00 [<ffffffff8159115d>] ip_make_skb+0xdd/0x120 [<ffffffff815b8d3b>] udp_sendmsg+0x2db/0xa10 [<ffffffff815c3e63>] inet_sendmsg+0x63/0xb0 [<ffffffff8153aa30>] sock_sendmsg+0xb0/0xe0 [<ffffffff8153df5d>] sys_sendto+0x12d/0x180 [<ffffffff8166afd9>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff It has reproduced on its own. (In reply to Jan Kratochvil from comment #15) > kernel-3.9.9-201.fc18.x86_64 > > [<ffffffffa000a559>] rpc_wait_bit_killable+0x39/0x90 [sunrpc] > [<ffffffffa000a51d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc] > [<ffffffffa077bdfa>] nfs_initiate_commit+0xea/0x110 [nfs] > [<ffffffffa077dbcb>] nfs_generic_commit_list+0x8b/0xf0 [nfs] > [<ffffffffa077dd25>] nfs_commit_inode+0xf5/0x160 [nfs] > [<ffffffffa076f0ee>] nfs_release_page+0x8e/0xb0 [nfs] > [<ffffffff81133a12>] try_to_release_page+0x32/0x50 > [<ffffffff8114768f>] shrink_page_list+0x5ff/0x990 > [<ffffffff81148027>] shrink_inactive_list+0x157/0x410 > [<ffffffff81148869>] shrink_lruvec+0x229/0x4e0 > [<ffffffff81148b86>] shrink_zone+0x66/0x180 > [<ffffffff81149010>] do_try_to_free_pages+0x100/0x620 > [<ffffffff81149838>] try_to_free_pages+0xf8/0x180 > [<ffffffff8113d805>] __alloc_pages_nodemask+0x6a5/0xae0 > [<ffffffff8117c308>] alloc_pages_current+0xb8/0x190 > [<ffffffff811862e0>] new_slab+0x2d0/0x3a0 > [<ffffffff8165aeaf>] __slab_alloc+0x393/0x560 > [<ffffffff8118a62f>] __kmalloc_node_track_caller+0xbf/0x2a0 > [<ffffffff815454dc>] __kmalloc_reserve.isra.51+0x3c/0xa0 > [<ffffffff81547a6c>] __alloc_skb+0x7c/0x290 > [<ffffffff81540221>] sock_alloc_send_pskb+0x1d1/0x350 > [<ffffffff815403b5>] sock_alloc_send_skb+0x15/0x20 > [<ffffffff8158eb38>] __ip_append_data.isra.38+0x5c8/0xa00 > [<ffffffff8159115d>] ip_make_skb+0xdd/0x120 > [<ffffffff815b8d3b>] udp_sendmsg+0x2db/0xa10 > [<ffffffff815c3e63>] inet_sendmsg+0x63/0xb0 > [<ffffffff8153aa30>] sock_sendmsg+0xb0/0xe0 > [<ffffffff8153df5d>] sys_sendto+0x12d/0x180 > [<ffffffff8166afd9>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff > I assume the stack trace is from the VPN daemon itself and the NFS mount is being done over the VPN? If so, then I simply don't see a way to reasonably fix this. The core of the problem is a disconnect between the daemon handling your VPN and the NFS subsystem. If the kernel somehow knew that it shouldn't try to flush NFS pages in order to handle memory allocations for this particular daemon, then that would work, but there isn't a way to do that as far as I can tell. Another idea might be to never allow the nfs_releasepage to block, but that might cause other problems. It might be best to report this bug to the linux-nfs mailing list and see whether anyone there has ideas. Until then, I think the only thing you can do here is to try and prevent this by tuning the VM to keep the kernel out of direct reclaim. (In reply to Jeff Layton from comment #16) > I assume the stack trace is from the VPN daemon itself and the NFS mount is > being done over the VPN? Yes. > It might be best to report this bug to the linux-nfs mailing list and see > whether anyone there has ideas. I can post it there but I am not kernel developer. https://bugzilla.kernel.org seems as more appropriate to me. One of the package maintainership goals is to act as qualified interface to upstream development. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs. Fedora 18 has now been rebased to 3.11.4-101.fc18. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19. If you experience different issues, please open a new bug report for those. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. It has been over a month since we asked you to test the 3.11 kernel updates and let us know if your issue has been resolved or is still a problem. When this happened, the bug was set to needinfo. Because the needinfo is still set, we assume either this is no longer a problem, or you cannot provide additional information to help us resolve the issue. As a result we are closing with insufficient data. If this is still a problem, we apologize, feel free to reopen the bug and provide more information so that we can work towards a resolution If you experience different issues, please open a new bug report for those. |
Created attachment 593938 [details] /etc/cgconfig.conf Description of problem: NFSv4 mounts over openvpn over ethernet lock up on any larger transfers. It is is easier (only?) reproducible when cgroup is also used. Version-Release number of selected component (if applicable): kernel-3.3.8-1.fc16.x86_64 but this is at least a year long problem. How reproducible: Often, copy larger (happened after 1.3GB now) file client->server. I reproduced it two times but not the third time now. Steps to Reproduce: 100Mb ethernet, openvpn (comp-lzo no), client=X220 notebook, 8GB RAM, SSD disk. server:/etc/exports: / 192.168.67.2(rw,no_root_squash,async,fsid=0,secure) client:/etc/fstab host1s:/ /host1 nfs defaults,noauto,nfsvers=4,resvport 0 0 client:/etc/cgconfig.conf - as attached. Use user jkratoch (as defined in cgconfig) - but I did not put the cp command to any cgroup class so cgroup may not matter. Actual results: openvpn is stuck in D state. cat /proc/pid-of-openvpn/stack [<ffffffffa08484a9>] nfs_wait_bit_killable+0x39/0x90 [nfs] [<ffffffffa085737a>] nfs_commit_inode+0xaa/0x250 [nfs] [<ffffffffa0845f14>] nfs_release_page+0x84/0xa0 [nfs] [<ffffffff811205f2>] try_to_release_page+0x32/0x50 [<ffffffff81133e62>] shrink_page_list+0x792/0x9a0 [<ffffffff81134484>] shrink_inactive_list+0x184/0x4f0 [<ffffffff81134fc8>] shrink_mem_cgroup_zone+0x448/0x5d0 [<ffffffff811351c6>] shrink_zone+0x76/0xa0 [<ffffffff81135564>] do_try_to_free_pages+0x104/0x5c0 [<ffffffff81135cfb>] try_to_free_pages+0xab/0x170 [<ffffffff811299dc>] __alloc_pages_nodemask+0x53c/0x8f0 [<ffffffff81161273>] alloc_pages_current+0xa3/0x110 [<ffffffff8116c2e5>] new_slab+0x245/0x2f0 [<ffffffff815ed28d>] __slab_alloc+0x32b/0x440 [<ffffffff8116f2b7>] __kmalloc_node_track_caller+0x97/0x1f0 [<ffffffff814d9678>] __alloc_skb+0x78/0x230 [<ffffffff814d3fbc>] sock_alloc_send_pskb+0x1ac/0x2f0 [<ffffffff814d4115>] sock_alloc_send_skb+0x15/0x20 [<ffffffff81520861>] __ip_append_data+0x711/0xa10 [<ffffffff81522bd0>] ip_make_skb+0xd0/0x110 [<ffffffff815464ef>] udp_sendmsg+0x2df/0x990 [<ffffffff81550f24>] inet_sendmsg+0x64/0xb0 [<ffffffff814cedc7>] sock_sendmsg+0x117/0x130 [<ffffffff814d22bd>] sys_sendto+0x13d/0x190 [<ffffffff815fc5e9>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff ps axwfl|grep 27760 1 0 27760 1 20 0 42156 856 nfs_wa Ds ? 0:33 /usr/sbin/openvpn --daemon --writepid /var/run/openvpn/vidhome.pid --cd /etc/openvpn/ --config vidhome.conf cat /proc/27760/status Name: openvpn State: D (disk sleep) Tgid: 27760 Pid: 27760 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 Utrace: 0 FDSize: 64 Groups: VmPeak: 42160 kB VmSize: 42156 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 964 kB VmRSS: 856 kB VmData: 656 kB VmStk: 136 kB VmExe: 544 kB VmLib: 5672 kB VmPTE: 108 kB VmSwap: 0 kB Threads: 1 SigQ: 2/61764 SigPnd: 0000000000000000 ShdPnd: 0000000000004000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000180004a03 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: ff Cpus_allowed_list: 0-7 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 380550 nonvoluntary_ctxt_switches: 12410 cat /proc/27760/stat 27760 (openvpn) D 1 27760 27760 0 -1 1077946432 190 0 0 0 2262 1065 0 0 20 0 1 0 102118072 43167744 214 18446744073709551615 4194304 4748500 140733664299952 140733664296248 237421666339 0 0 4096 18947 18446744072107623593 0 0 17 0 0 0 0 0 0 8948508 8959424 26247168 Expected results: No lock up. Additional info: Therefore normally I use scp instead of NFS cp.