Bug 834799
Summary: | NFS write performance with default options absolutely terrible on F16<->F16 | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Pribyl <covex> | ||||
Component: | kernel | Assignee: | nfs-maint | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 16 | CC: | bfields, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-10-23 18:17:27 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Adam Pribyl
2012-06-23 18:02:34 UTC
Would it be possible to get a look at what the server threads are doing at the time? Are they stuck waiting somewhere (in which case sysrq-t output might be interesting), or are they burning CPU (in which case maybe something like perf could get us an idea where they're spending their time?) Can you reproduce the same problem on F17? I am not sure I can provide what you ask for as I probably do not know how to do that. Here is a perf top on server side during the transfer: 26.97% [sunrpc] [k] svc_recv 10.09% [sunrpc] [k] svc_tcp_has_wspace 7.57% [sunrpc] [k] svc_xprt_enqueue 6.57% [kernel] [k] local_bh_enable_ip 4.57% [sunrpc] [k] svc_xprt_release 3.98% [kernel] [k] _raw_spin_unlock_bh 2.48% [kernel] [k] _raw_spin_lock_bh 2.23% [kernel] [k] __list_del_entry 2.02% [kernel] [k] kfree 1.90% [sunrpc] [k] svc_release_skb 1.81% [unknown] [.] 0xb776e55f 1.66% [sunrpc] [k] svc_xprt_received 1.65% [kernel] [k] __list_add 1.59% [kernel] [k] page_address 1.51% [sunrpc] [k] svc_xprt_put 1.48% [kernel] [k] _cond_resched 1.18% [kernel] [k] local_bh_disable 1.18% [sunrpc] [k] svc_pool_for_cpu 1.01% [sunrpc] [k] svc_reserve 0.71% [kernel] [k] kthread_should_stop 0.70% [kernel] [k] acpi_idle_do_entry 0.68% libperl.so [.] 0x000462c3 0.58% [nfsd] [k] nfsd4_encode_components Usuall top shows: 1954 root 20 0 0 0 0 D 28.7 0.0 3:50.38 nfsd 1955 root 20 0 0 0 0 R 25.1 0.0 3:45.21 nfsd 1957 root 20 0 0 0 0 D 21.1 0.0 3:50.41 nfsd 1953 root 20 0 0 0 0 D 10.9 0.0 3:55.68 nfsd 1959 root 20 0 0 0 0 D 10.9 0.0 3:50.18 nfsd 1960 root 20 0 0 0 0 D 1.7 0.0 3:51.02 nfsd 374 root 20 0 0 0 0 S 0.7 0.0 36:08.36 md125_raid1 867 root 20 0 0 0 0 D 0.7 0.0 0:07.37 jbd2/md125-8 It looks like it is burning CPU - at least the top shows the %CPU used. Thanks! Yes, that gives me a better idea where to look.... I may not get the chance soon, though. You say you didn't see this problem (or saw only a milder version of it, I'm not clear?) on F15. What was the last kernel version that worked for you on F15? Also, are you positive that it was a server upgrade that caused the problem (did the client version stay the same throughout?). Another approach might be to start bisecting to figure out exactly which version of the kernel started failing. If you watch the client-server communication with wireshark: what does the network traffic look like after the problem starts? (You could upload an example: I'd need the full binary data.) Created attachment 594188 [details]
output of sysrq t
Attached is the output of sysrq-t during transfer.
Well, the performance on F15 was better, but definitely not good. On the Gbit network I'd expect the write performace to be at the array speed. I've updated only the server, client remains the same - It could be the kernel changes however to 3.4.2 (x86_64) on client too in the meantime. I do not think this is of a big importance, as the performance is bad overall. Are the top and sysrq-t both taken *after* the performance got bad? Also: all the above seems to be with exports that have the "async" export option set in /etc/exports, is that correct? Might be worth seeing what the performance is like without that. The sysrq-t trace shows 8 threads, mostly waiting for writes. You probably want more than 8 threads, though I doubt increasing it will fix the problem. Yes, this is when the performance got bad. This should be with default export options - means sync. When I add "async" to exportfs then the performance is much better - almost what is expected. The strange thing with the threads is, that while in F16 there is roughtly 60kB/s write in iotop per thread, in F15 it was roughly 600kB/s per thread (still too low however). I know, I can increase the NFS write performance with various options like "insecure, async" etc. but the default behaviour is quite frustrating. I understand there could be some penalty as a trade of for secure and sync, but this is somewhat too much. Also the load on the server is really abnormal - close to 10, increasing the threads will make it even worse. Client often gets stuck while this slow write on the server happens. When I did this test with 4GB file, I had to hard reset the client, as it get completely unresponsive - I know waiting long enougth to finish the write would also make it wake up again, but I was not willing to wait the 4GB to be transfered at 60kB/s. # Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient). Given the prominence of svc_recv, svc_tcp_has_wspace, and svc_xprt_enable in the perf top output, this is almost certainly the bug fixed by upstream commit d10f27a750 "svcrpc: fix svc_xprt_enqueue/svc_recv busy-looping", which is included in 3.6. I'm therefore closing this bug under the assumption that it is fixed; however, any confirmation, or information to the contrary, is still welcome. |