Bug 455408
Summary: | NFS perfomance problems on s390x | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jan Wildeboer <jwildebo> | ||||||||||||||||
Component: | kernel | Assignee: | Peter Staubach <staubach> | ||||||||||||||||
Status: | CLOSED WONTFIX | QA Contact: | |||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | 5.2 | CC: | al.koch, bhinson, epasch, fleite, jlayton, jwest, loeh, rwheeler, sardella, smolinski, sprabhu, sputhenp, steved, tao, thoss | ||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | s390x | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2010-02-08 15:57:19 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 533192 | ||||||||||||||||||
Attachments: |
|
Comment 4
Jeff Layton
2008-07-15 14:15:58 UTC
New info from IBM: So far we have found that the delays we are experiencing are from delays in open and close operations. The actual I/O write performs at sufficient levels however the open/close take 10 minutes for NFS v3 and the close in NFS v4 takes 2 minutes 40 seconds. I have attached a system trace of the write operation for your reference. (stracelog.txt addes as attachment) Even more info, does this help? Problem identification in short words: - Linux sends data in out-of-sequence order - zOS is expecting data sequentially - NFS has to put the data blocks in the right order - this is done in a cache - if the next part of a data set does not arrive before the cache is filled , then we get all the kinds of the observed problems (partial records, timeout, etc.) Solution proposition in short words: - increase cache-window (NFS server attribute) to 255 (or 256 I'm not sure personally) (this is the maximum) - find out if there's a possibility to force Linux to send data sequentially Well, the Solaris system may already have the name cached in their DNLC, but they will generate a GETATTR as part of the close-to-open processing. I am somewhat surprised that the Linux generates the LOOKUP since it could already have the name cached. What happens if "ls -lasi <thenameofthefiletobeopenedandwrittento>" is run before the dd command? Created attachment 311935 [details]
STRACE
This is the strace, details see in comment
Created attachment 311936 [details]
TCPDUMP
This is the corresponding TCPDUMP, details see comment
Comments 17 and 18 contain attachments to our latest test. * Test objective: -copy (cp) a local 2 GB file to a preallocated (on z/OS) file on the NFS Share. * Test parameters: -NSCD stopped -No "ls -lisa" on target file -mount was done w/ mount -o intr "gplex1-nfssv.appl.ubs.ch:GZ,binary,space (150,20),cyls,norlse" /mnt/t329823/test001 Test observation: -TCPDUMP was started with tcpdump -vvXx -s 256 -i any "host gplex1- nfssv.appl.ubs.ch" -w /sapmnt/spool/nfs-t329823/capture-20080716-001.txt -Test started on 11:03 with time strace -f -T -tt -o /sapmnt/spool/strace- 20080716-001.txt cp /sapmnt/spool/nfs- t329823/test.file /mnt/t329823/test001/H000.SAP.S1.W0.S10.LINUX.CHECK; date -The long wait on open() could not be observed (completed in <0.038868>) -At 11:04 all read/write operations where finihed and the target file was attempted to be closed 7110 11:04:17.072872 close(4) = -512 <1974.150421> 7110 11:37:11.223332 --- SIGINT (Interrupt) @ 0 (0) --- -After more than 30 minutes we cancelled the copy during the wait we saw only commit requests with "nfsstat -c", about 40 This is the output of the tcpdump and cp command: [root@TLG101 ~]# tcpdump -vvXx -s 256 -i any "host gplex1-nfssv.appl.ubs.ch" - w /sapmnt/spool/nfs-t329823/capture-20080716-001.txt tcpdump: WARNING: Promiscuous mode not supported on the "any" device tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 bytes 437833 packets captured 907246 packets received by filter 31572 packets dropped by kernel [cy1adm@TLG101 mnt]$ time strace -f -T -tt -o /sapmnt/spool/strace-20080716- 001.txt cp /sapmnt/spool/nfs- t329823/test.file /mnt/t329823/test001/H000.SAP.S1.W0.S10.LINUX.CHECK; date cp: closing `/mnt/t329823/test001/H000.SAP.S1.W0.S10.LINUX.CHECK': Bad file descriptor real 33m47.731s user 0m8.658s sys 0m23.458s Wed Jul 16 11:37:11 CEST 2008 I will proceed now with the "nscd" turned on. Results of second test: Same configuration but NSCD turned on. This time we cancelled earlier, but the observations were the same: -no long wait on open -fast read/write -wait on close -no nfs requests visible with nfsstat except commits [root@TLG101 nfs-t329823]# tcpdump -vvXx -s 256 -i any "host gplex1- nfssv.appl.ubs.ch" -w /sapmnt/spool/nfs-t329823/capture-20080716-002.ttcpdump: WARNING: Promiscuous mode not supported on the "any" device tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 bytes 264478 packets captured 614111 packets received by filter 85153 packets dropped by kernel [cy1adm@TLG101 mnt]$ date; time strace -f -T -tt -o /sapmnt/spool/strace- 20080716-002.txt cp /sapmnt/spool/nfs- t329823/test.file /mnt/t329823/test0002/H000.SAP.S1.W0.S11.LINUX.CHECK; date Wed Jul 16 12:19:59 CEST 2008 cp: closing `/mnt/t329823/test0002/H000.SAP.S1.W0.S11.LINUX.CHECK': Bad file descriptor real 12m22.807s user 0m8.578s sys 0m23.222s Wed Jul 16 12:32:22 CEST 2008 The logs will follow with the next two attachments We are getting closer. Just to make sure - nfslockd is up and running? Any interesting messages in /var/log/messages? Jan Created attachment 311939 [details]
STRACE TEST 002
If it is known that NFS clients will _not_ be sharing files, then it may be interesting to try the "nocto" mount option on the Linux NFS client. However, if the NFS clients are sharing files, then this option may easily lead to inconsistent caches among the clients and should be avoided. Looking at the capture and strace from comment #17 and comment #18: Lookups in this capture are fairly speedy, so perhaps the long lookup times in the earlier capture were an anomaly of some sort. They're still rather disturbing though and probably need to be addressed. nscd should have no effect on this test once the open() is complete. The big delay in the strace seems to be the close() of the destination file, which was apparently interrupted (probably ^c'ed by the user): 7110 11:04:17.072872 close(4) = -512 <1974.150421> 7110 11:37:11.223332 --- SIGINT (Interrupt) @ 0 (0) --- ...as a side note, that looks like a bug -- we shouldn't return -ERESTARTSYS to userspace, ever. Regardless though, that's unrelated to any performance problem. The reason the close takes so long is probably because it has to flush all of the dirtied pages to the server before it can return. Looking again at the capture, I'll first note that it seems to be incomplete. I don't see a write call starting at offset 32768, for instance. Still, we can try to do some analysis on what we have: stats for WRITE calls: # of calls: 61770 min service response time: .00012s max service response time: 100s avg service response time: .51s ...half a second to reply to a write seems pretty long, but it's probably just skewed by some packets that have extraordinarily long response times. 100s is a *really* long time for the server to not respond. COMMIT calls: # of calls: 5 min SRT: .93s max SRT: 20.5s avg SRT: 12.7s The commit stats aren't great either. More details in the capture: Things seem to chug along pretty speedily until we hit the write in frame 288378. We end up doing a retransmission in frame 288382 (60s later), so there's a 60s delay where nothing was happening while we waited for the server to catch up. It looks like things pick back up again in frame 288409, but it's hard to be sure what happened there -- looks like some frames were lost from the capture. It's possible the client stalled out here for some other reason, but it could be that more writes were stalled waiting for replies. There's a similar, 20s delay around frame 295455. A 10s delay at 295674... We could probably find more if we wanted. These "short" delays though add up. The problem, most likely is that the server is, in some cases just not sending us timely responses to our calls. IIRC, RPC slots are a fixed resource -- we can only deal with a certain number of outstanding calls being on the wire. I don't see any hard evidence of the client stalling out here while the server is just waiting around for packets. At almost every point where I see the capture stall, there are outstanding RPC calls on the wire. The exceptions seem to be places where the capture is missing packets -- that might be indicative of client side performance issue that would affect NFS performance, but we can't be certain from this info. If they're concerned about ordering of write calls, one thing that might be interesting is doing this test with the "sync" mount option, which should turn all of the writes into synchronous writes. It might slow down performance in some cases, but should ensure that writes are issued on the wire in the order received. As to why solaris is faster. I'm not sure -- we don't have any solaris captures to compare this to. We also don't have any information of how fast solaris actually is. How long does solaris take to copy a 2G file to this server? Finally, I'd like to know exactly what kernel they're running here. Also, what arch is the client? Created attachment 311944 [details]
TCPDUMP TEST 002
Hi all, here are some first answers for your latest questions * Yes nfslockd is running, as far as I know for all the tests done so far * Interesting message: The only interesting message I don not quite understand is: Jul 16 13:48:03 tlg101 setroubleshoot: SELinux is preventing ethtool (ifconfig_t) "read write" to socket (initrc_t). For complete SELinux messages. run sealert -l fb82ac8d-fcd1-4dd1-90b2-e04654589d5d * Arch and Kernelversion: [t329823@TLG101 ~]$ uname -a Linux TLG101 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:58 EDT 2008 s390x s390x s390x GNU/Linux [t329823@TLG101 ~]$ arch s390x * We are doing tests with "sync" option right now.... AFAIK, solaris hasn't been ported to s390x, so I assume they're testing solaris on a different arch entirely. Do the solaris machine and the linux s390x client have comparable amounts of RAM? I'd still like to see the times from the solaris box for a similar test (a capture and maybe some truss output with timing stats similar to the ones in the strace might also be interesting). I think there is a Solaris port in progress, but anyway, Solaris tests were done on SPARC. The Sol. machine has 2GB RAM physical memory, our Linux guest has 16 GB. We have a snoop trace from the machine, can you use that? No syscall trace though. Ok, sounds like both machines have enough ram to cache the entire 2G file in memory then. I didn't really think that was a problem anyway. It looks like the write syscalls all complete in a timely fashion which probably means that memory pressure isn't an issue. Still, it's good to know... If you have the snoop capture I'd like to have a look (particularly if it's a binary capture). Let's not worry about a syscall trace from solaris at the moment we can get that later if we think it's needed. Since the problem is that the linux client is slower than solaris here, we need to see how long the solaris client took to do the same 2G copy to this server. Also, it would be good to know what mount options were used on the solaris client. And what the network topologies look like, ie. mostly how each client is accessing the server. Are they both using ethernet or something else? Is it the same for both clients? I am checking right now if we can publish the snoop trace, the problem is, it was a full trace on the machine, and it may contain sensitive information. This is the mount option for Solaris *****.ch:RZ,text,nofastfilesize,xlat(oemvs311) - /data/sap/RZ nfs - no intr We have done a copy on monday, a 2.4GB file from the same NFS to the same share in about 4.5 Minutes. The same copy on linux never completed, resp. was cancelled after a long time. We have done similiar tests since from zLinux and the longest time I waited on a close() operation was over 30 minutes before I cancelled it From the linux side we use osa adapters, and have hipersockets On Solaris afaik plain ethernet... This is my analysis of two strace outputs, and my recommendations as copied from the IBM PMR: Hi Team, we have analysed the strace data submitted. In both cases we have a syscall (open/close) taking a long time and then being interrupted by SIGINT. We suppose, the source of the SIGINT was the operator terminatin g the command by Ctrl-C. In both cases the syscall has completed successfully after SIGINT: -the open call returns a valid file descriptor (1) -the close call returns ERESTARTSYS as indication, that it had been interrupted by SIGINT and a following close() call returns EBADF as an indicationm, that the file already has been closed. This looks, as if the syscall was already completed in both cases, but j ust refusing to return to userspace - which is very unusual behavior. Alternatively theSIGINT (Ctrl-C) could have been handled incorrectly causing the syscall to return correctly, although it might not haven bee n completed correctly before. The later theory seems to be wrong, because at least after the open call the remote file could be written successfully. We now need to find out, what the processes are doing, when they receive the SIGINT and subsequently return to userspace. can you please: 1) confirm, that the operator pressed Ctrl-C to terminate the stalled command. 2) recreate the stalling command and then issue a ps axo euid,ruid,tty,tpgid,sess,pgrp,pid,ppid,tpgid,stat,pcpu,wchan:15.. ..,comm:25 (remove the dots, but not the commas). From this command we can see, in which kernel function the syscall hangs, before the Ctrl-C was pressed. Then you can aboth the test by hitting Ctrl-C. One more question: Are you running any security modules, which might hang during execution of the syscall? Are you able to run dbginfo.sh and submit the output to ECURep? Regards Holger Created attachment 311962 [details]
strace #2 - with blocked 'close()' call
in Strace #2 we have a close() syscall taking 33 minutes, before being
interrupted by SIGINT, then returning -512 (ERESTARTSYS) but apparently having
successfully completed already, since subsequent close(4) call returns EBADF,
which indicates, that the fd has already been removed from the list of open
files.
7110 11:04:17.072850 read(3, "", 4096) = 0 <0.000004>
7110 11:04:17.072872 close(4) = -512 <1974.150421>
7110 11:37:11.223332 --- SIGINT (Interrupt) @ 0 (0) ---
7110 11:37:11.223357 close(4) = -1 EBADF (Bad file descriptor)
<0.000004>
Very unlikely, that the operator has hit Ctrl-C just in time, when the syscall
also completed recently ;(
I will have to ask IBM suport for access data to the second strace, which I mentioned in todays phone call, and in which the open call was blocking....I will attach that info to the bugzilla. Regards Holger Actually I found the strace in attachment 'strace log' below https://bugzilla.redhat.com/attachment.cgi?id=311849 15:39:40.533041 close(1) = 0 15:39:40.533061 open("/data/sap_native/GZ/H000.SAP.S1.W0.S11.LINUX.CHECK", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1 15:48:09.478212 --- SIGINT (Interrupt) @ 0 (0) --- Here we see the open call apparently being interrupted by the SIGINTR, but returning a valid fd, which is late on used for writing... Regards Holger The close() call is pretty clearly broken. We should never return -ERESTARTSYS to userspace. The close is almost certainly taking a long time because the client is flushing pages to the server before it can (or should) return. The open() is more interesting. I suspect that the open() is taking a long time because of one of these slow LOOKUP calls (as shown in the capture in comment #1). As to why it returned a valid fd, I'm not sure. It might be interesting to see if we can reproduce that somehow (maybe doctor a nfs server with artificial delays in LOOKUP calls). The signal handling is a separate issue from any performance concerns, however. Created attachment 311973 [details]
simple perl script to detect out-of-order write calls
Here's a quick perl script to detect out of order write calls. I ran it like
this:
$ tshark -r /tmp/capture-20080716-001.txt | grep 'WRITE Call' | /tmp/seq.pl
It just detects when a WRITE call has an offset that's lower than the offset in
the previous write call. From this, I see 568 writes that have offsets lower
than the previous one.
It appears that the client is not handling NFS3ERR_JUKEBOX correctly and is treating it as a normal error in some path. That's a bug and we should address that. That said, I suspect that the problem is that the NFS client is constantly getting JUKEBOX errors returned and is never transmitting the data that the server needs. I might also suggest checking the server code to ensure that it is handling the out of cache situation correctly so that when a missing block does come in, it can handle it and write out some data to the file. Jeff, it appears that I should take this one for the time being. If you disagree, please let me know. I would appreciate your continued assistance though. All yours Peter, but I'll add this that I already wrote up... To clarify, we have several separate problems: 1) the problem that the z-series server doesn't deal well with out-of-order writes. This is the problem that was discussed on today's concall. Peter is going to explore whether we can send writes in a more ordered fashion. 2) the fact that the z-series server occasionally takes a *very* long time to reply to LOOKUPs (the capture in comment #1 shows this). I believe this is the likely cause for open() calls that take a long time. 3) the linux client is not handling signals in the way it should be during open/close. close() should not return -ERESTARTSYS. There's also the question of why we get a valid fd when the process is signaled while hung on an open() call (perhaps the RPC LOOKUP call was reissued and returned immediately?). This is probably related to the JUKEBOX errors that Peter mentions. 4) a separate problem (possibly related to bug 281241 ?), that's causing captures to miss some packets. For instance, in capture-20080716-001.txt, I don't see a WRITE packet that starts at offset 32k, but I'm pretty sure that one should be there (unless maybe the capture was killed before it was sent). Some of these should probably be split into separate BZ's since they are separate problems. When benchmarking for the performance problems they're having, I'd suggest that they do not signal the cp. I believe it's confusing the timing of the syscalls and is probably making them kill off the captures before all the data has been sent. Just let them run -- we should treat the signal handling problems separately since that's really a different problem. Also, 30+ mins is a really long time to have to wait. They may want to cut down the size of the file they're copying since I think they can probably demonstrate the issue with smaller file sizes (512m or 256m or so is probably sufficient to demonstrate the problem). That should make for faster testing and smaller syscall traces/captures. Also on #4 above, I see several 'TCP ACKed lost segment' packets in wireshark for that capture, which leads me to believe that the captures are incomplete. If they have the latest libpcap, then they may want to open a bug for that and see if the cause can be determined. Having complete captures will make this all of these problems much easier to troubleshoot. Hi All, we have done intensive testing with the "sync" mount option today. Like we mentioned yesterday, our first tests with sync (which did not produce any usable results) were spoiled by other issues. We tried copies (w/ dd and cp) for one, 10, 15 and 20 files in parallel. Copy from local files to NFS as well as copying from an NFS file to an NFS. Local files were 2,7GB, the NFS file 2,4GB We did not see: -long lookups -any long waits or hangs on open() or close() -no "incomplete records" warnings in the NFS Server Performance for single copy was within the Solaris time frame, about 4 minutes with variations of about 45 seconds. Of course for parallel copies we saw increased elapsed time We saw two issues during our tests: -Massive parallel copies (15 or 20) resulted in "Short on storage" errors in the NFS Server. This resulted in "Input/Output Error" in the copy processed on the Linux side. This is fine by me, if the server can't handle write requests, he has to abort, and informs the client. For me this is correctz behaviour. -A high CPU utilization on the NFS Server, about 10% per file write/copy. With the parallel tests we reached a limit og 70%-80%, which AFAIK is due to restrictions in our system/other workload running. We do not know yet what CPU utilization we see with parallel copies from Solaris Clients, so maybe this is not a real issue. We are preparing to do tests with the SAP application for which we hopefully have results for the confcall. Also we need to clarify if the performance for parallel copies is acceptable. But so far it looks like using "sync" is the right way to go. I will create an attachment with a summary of our tests, if you need details, please tell me so. I have not done a ps axo euid,ruid,tty,tpgid,sess,pgrp,pid,ppid,tpgid,stat,pcpu,wchan:15,comm:25 yet on a stalled copy process, would that still be helpful? Like I mentioned before we have not seen this stalled close() wiht the sync option. Do you need any additional information or traces for the moment? I guess we should wait and see for the results we get from the application tests, maybe "sync" really is a solution here. Created attachment 312044 [details]
Summary tests with "SYNC" option
This is a short summary of the tests with the sync mount option
Here the results from our latest efforts. We were able to run the functional test in SAP (copy one NFS file to another), which is the test case that originally made us aware of the problems. That means we successfully copied a 2.4GB file. Performance is about 2.5 times longer than on Solaris (keep in mind we have the "sync" option on on Linux, not on Solaris). To be sure we don't have a problem that is unstable we will do more tests tomorrow, with parallel copies and bringing the system under load. So, we can confirm the functional test case is successful so far, performance is still an issue. We did not do any tcp traces yet, because I am waiting on feedback from Jan for the libpcap issue. Good day, we redid a test with the "ASYNC" option to get another clean set of documentation for our original problem. I will send the attachments later. Its a zip file this time Here the test setup and testlog: * Mount options: text,nofastfilesize,xlat(oemvs311) nfs rw,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=*** 0 0 * tcpdump taken with tcpdump -vvXx -s 256 -i any "host **our-nfs-server**" -w /tmp/capture-20080718-001.txt * The copy was done with dd and straced: date; time strace -f -T -tt /tmp.txt dd if=/mnt/archive/test/test-736.file of=/mnt/nfstest001/data0001 bs=16k and started on: Fri Jul 18 14:16:01 CEST 2008 * A ps axo ... was taken every second with while [ true ]; do date >> pslog.txt; ps axo euid,ruid,tty,tpgid,sess,pgrp,pid,ppid,tpgid,stat,pcpu,wchan:15,comm:25 >> pslog.txt; sleep 1; done * When the hang on the close was observed several time I executed the ps ... command as closelog?.txt * The NFS Server showed a CPU load of about 30-40%, later on dropped to 15%-20% still doing EXCP also when the client hung on the close * nfstat -c shows continous write requests on the linux side * 14:22 NFS Server reported "Client died on the linux side GFSA824W (TCPNFSSV) NETWORK FILE SYSTEM CLIENT DID NOT SEND COMPLETE RECORDS FOR OFFSET The message was repeated afterwards ... * Client dd programm was interrupted with CONTROL-C by the operator at ca. 14:25 date; time strace -f -T -tt -o /sapmnt/spool/nfs-t329823/strace-20080718-001.txt dd if=/mnt/archive/test/test-736.file of=/mnt/nfstest001/data0001 bs=16k dd: closing output file `/mnt/nfstest001/data0001': Bad file descriptor real 8m54.931s user 0m3.217s sys 0m14.889s * After that we saw still ongoing write requests with nfsstat -c There are some more comments I want to make regarding comment 43 and 44 from Jeff Layton >2) the fact that the z-series server occasionally takes a *very* long time to >reply to LOOKUPs (the capture in comment #1 shows this). I believe this is the >likely cause for open() calls that take a long time. We have not experienced this today or yesterday. We believe this may be due to target files beeing exclusively locked by the NFS Server, even after processing has ended, AFAIK there is a write timeout implemented. >4) a separate problem (possibly related to bug 281241 ?), that's causing >captures to miss some packets. For instance, in capture-20080716-001.txt, I >don't see a WRITE packet that starts at offset 32k, but I'm pretty sure that >one >should be there (unless maybe the capture was killed before it was sent). I cant access bug 281241, so no idea. Is this the problem with libpcap Jan mentioned? >When benchmarking for the performance problems they're having, I'd suggest that >they do not signal the cp. I believe it's confusing the timing of the syscalls >and is probably making them kill off the captures before all the data has been >sent. Just let them run -- we should treat the signal handling problems >separately since that's really a different problem. By the way I am the whos signalling the cp :) Well after 30 or minutes of hanging in a close I saw no use in waiting longer, thats why I wanted to abort the cp. We did let them run, sometimes overnight, and next morning it was still hanging in the close, so what do you hope to gain by not aborting? >Also, 30+ mins is a really long time to have to wait. They may want to cut down >the size of the file they're copying since I think they can probably >demonstrate >the issue with smaller file sizes (512m or 256m or so is probably sufficient to >demonstrate the problem). That should make for faster testing and smaller >syscall traces/captures. This is a bit tricky. We see the wait/hang in close only for big files send async (if they produce the "did not send complete records" situation. >Also on #4 above, I see several 'TCP ACKed lost segment' packets in wireshark >for that capture, which leads me to believe that the captures are incomplete. >If >they have the latest libpcap, then they may want to open a bug for that and see We seem to have libpcap statically compiled into tcpdump, the version is tcpdump-3.9.4-12.el5. This is also the newest source package I could find on http://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/ Is this the newest version? Since the NFS synchronous writes avoid the server issue, but have a 2-3x slowdown, we have suggested that we might be able to tune the VM system to help us approach the absolute promise of in order writes by managing the page cache more aggressively. Specifics: The tuning parameters involved are: /proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_expire_centisecs /proc/sys/vm/dirty_writeback_centisecs You can set them dynamically from the sysctl command or by setting the values through /proc: by "echo <new value> > /proc/sys/vm/dirty_writeback_centisecs" for example. If these settings work, you can make the settings persist by adding them to /etc/sysctl.conf. We should start by testing the most aggressive setting to see if we can reproduce the correctness that we saw with the NFS sync mode. If so, we can gradually relax the settings to see if they have a positive impact on performance without impacting the NFS server and out of order requests. Specific suggestion: - start out with dirty_background_ratio at 1 (1% of the page cache can be dirty before async writeout is started) - keep dirty_expire_centisecs and dirty_writeback_centisecs (in 1/100th of a second units) at or above 100 if possible, but feel free to tune them all the way down to 1 if that makes the NFS server happy. Minor nit -- it's not necessarily "incorrect" to flush pages out of order. The NFSv3 spec doesn't prohibit this. Also, RPC retransmissions can cause writes to arrive out of order at the server, even if we queue them up them in order. I'd consider the avenue that Peter is pursuing an optimization rather than a bug fix... That said, the VM tuning suggestions make sense, and are definitely worth testing. I suspect that they probably won't change the ordering of the writes significantly but it may cause the client to flush pages that are in the "gaps" more quickly. That may help keep the server from digging too deeply into its "backlog cache" (or whatever the term is). To answer Alex... > We have not experienced this today or yesterday. We believe this may be due to > target files beeing exclusively locked by the NFS Server, even after > processing has ended, AFAIK there is a write timeout implemented. Ok, makes sense. Hopefully you have a workaround -- I doubt there's much we can do to mitigate this from the client side. > I cant access bug 281241, so no idea. Is this the problem with libpcap Jan > mentioned? Yes. The BZ is marked private (sorry). tcpdump-3.9.4-12.el5 is the latest AFAIK. I may be mistaken about the missing packets. It may be due to some other factor besides libpcap problems, or the capture might have just been canceled before the expected writes were done. > We did let them run, sometimes overnight, and next morning it was still > hanging in the close, so what do you hope to gain by not aborting? My hope is that by not aborting that we'll know that all of the writes have completed and that the capture is also complete. If we then find that the capture seems to be incomplete then we know that there's a problem with libpcap (or something related). Hmm...I would think that the test should eventually run to completion, but perhaps we're hitting a livelock sort of situation where the server's cache is full and it's returning NFS3ERR_JUKEBOX. We end up retrying the same set of writes but they're not the ones that the server needs. If that's the case then the VM tuning that Ric suggests may be helpful. The work that Peter is investigating would certainly be, but we don't have an ETA for that (or even know whether it's feasible). > This is a bit tricky. We see the wait/hang in close only for big files send > async (if they produce the "did not send complete records" situation. That makes sense. It sounds like you need to use a file size on par with the cache size. If that's the case, so be it... Regarding Comment #49, yes, indeed. The Linux NFS client conforms to the NFSv3 specification. I also believe that the MVS NFS server probably also conforms to the NFSv3 specification. However, that server is less flexible in the requests that it can handle, so its level of interoperability is reduced. The changes that I am working on should be considered to be an optimization and not a bug fix. They will hopefully increase the interoperability of the Linux NFS client and this particular server. Created attachment 312352 [details]
SAP strace
Strace of the SAP process doing a copy to an NFS share (was requested to
observe the open syscall). Please not that SAP reuses processes, so in the
beginning this process is waiting to be activated, and it takes a while before
the real job runs.
Created attachment 312358 [details]
Solaris_snoop_Trace_part1
split with "split" command. just cut this and part 2 together and then
ungzip-untar it.
Created attachment 312359 [details]
Solaris_snoop_Trace_part2
part 2 of the splitted snoop trace
Created attachment 312656 [details]
Draft Proposed patch
Here is a draft patch 1) modifies the NFS client behavior to kick off
WRITE requests when it "fill" a complete block, 2) allows the WRITE to
be processed asynchronously and as an UNSTABLE write, and 3) adds a
VM tunable to be able to specify the amount of dirty memory in the
system in megabytes instead of as a percentage of the total memory.
The patch described in Comment #57 is intended to be a draft patch and a workaround for the situation. It is not complete. I think that we also need to add some better flow control to the NFS client to limit the amount of outstanding WRITE data to a server. I will be working on that. With respect to the tunable described in Comment #57, I am not sure yet what the right value is. I need to verify with the IBM NFS server folks what the size of the cache really is and then I can identify a value to start testing with. It turns out that the size of the cache on the NFS server is 256 slots or 8MB total. When UBS or IBM tests, I would suggest setting max_queue_depth to 4. ("echo 4 > /proc/sys/vm/max_queue_depth") Hopefully, by limiting the amount of dirty memory in the system to 4MB, the client will be able to transmit data to the server in a sequential enough fashion for the server to be able to keep up. A test kernel based on the 2.6.18-99.el5 for the s390x architecture containing the following patches - close system call returns -ERESTARTSYS (Jeff Layton) [455729] - Patch from https://bugzilla.redhat.com/process_bug.cgi#c57 (Peter Staubach) [455408] is available at http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-99.el5.bz455408.1.s390x.tgz This compressed file contains the following files kernel-2.6.18-99.el5.bz455408.1.s390x.rpm kernel-devel-2.6.18-99.el5.bz455408.1.s390x.rpm kernel-headers-2.6.18-99.el5.bz455408.1.s390x.rpm I tested the patched kernel, here are the results I mounted the NFS share asyc: mount -t nfs -o async,intr,rsize=32768,wsize=32768 **OURNFSSERVER"":'********.nfs,text,nofastfilesize,xlat (oemvs311)' /mnt/nfstest001 I started wiht a setting of "echo 4 > /proc/sys/vm/max_queue_depth", it was set on zero initially. First I died copies with increasing file sizes 164 KB,1.6 MB, 16MB, 164 MB, 1.6 GB and finally 2.0 GB and could not reproduce the usual problem. I then did several consecutive copies with 2.0GB also no problem. I started to copy 2 files in parallel, and IOWAIT (as reported by IOSTAT) increased dramatically to about 64%. With 5 parallel copies we get close to 90% and the systems behaves sluggish und unresponsive (e.g. opening a file in vim or opening a man page takes minutes). My understanding is that a small value for max_queue_depth would force the kernel to write pages out pretty aggressively but how this contributes to the IO Wait I do not know. Also the transfer times went up a bit. Next i tried a setting of 16: echo 16 > /proc/sys/vm/max_queue_depth and copied 2 files in parallel and we see an IO Wait of around 65% The system still felt very unresponsive so I went up to echo 64 > /proc/sys/vm/max_queue_depth and copied 2 files in parallel. This resulted in our error situation the NFS Server complains, that the client "did not send complete records". The IO Wait went up to nearly 100% until I cancelled the copies after more than 40 miniutes. So with a setting of 64 and two parallel copies we get into the range of our error again, lower settings result in massive IO/Waits (at least I suspect that bigger settings make the system more responsive and reduce the IO Wait) As testing this is a bit tedious, because reproduing our error often requires a restart (sometimes of both linux and the NFS server) I would be gratefule if someone has some advice: -IBM did test with a setting of "0", does this mean to not buffer anything or does this effectively disable the tuning of this parameter? -Does it make sense to try intermediate values between 16 and 32? -What traces/dumps (if any) would you need? Which test should we reproduce then? -One that leads to an error? -A single copy? -Parallel copies? -Which setting for /proc/sys/vm/max_queue_depth? Sorry make that "reported by MPSTAT" not IOSTAT in the previous comment... Created attachment 313588 [details]
Draft Proposed patch
Here is a new draft patch. It adds flow control to the NFS client to
limit the number of outstanding WRITE requests made to the server and
to block applications from creating more dirty data when there are
already enough WRITE requests outstanding to the server.
I have attached a new patch that I would like to see tested. This patch adds a new tunable, nfs_max_outstanding_writes. It can be manipulated either via the sysctl command or via /proc/sys/fs/nfs/nfs_max_outstanding_writes. The default value is 0, which indicates that no flow controlling should be done. I would suggest setting this tunable to 8 when trying the tests in the UBS environment. This seems to result, in my testing, in a sequence of WRITE requests that I think that the IBM MVS NFS server should be able to handle. The testing should be tried without manipulating the other new tunable, max_queue_depth. There is 1 other new variable which is used to monitor the number of concurrent NFS WRITE requests which are outstanding. It is nfs_peak_outstanding_writes and is accessible via /proc/sys/fs/nfs/nfs_peak_outstanding_writes. After running the testing, or perhaps even during the testing, this pathname can be cat'd to see how the maximum number of concurrent outstanding WRITE requests were made. Created attachment 313803 [details]
Patch 2 tested on 20080808
I tested you new kernel patch today. I will put this information also into bugzilla. Tests with one and 2 parallel copies of 2.0GB files finsihed successfully. 5 parallel copies of 2.0GB resulted in the "partial records" error in the NFS Server side. I guess tuning nfs_max_outstanding_writes higher than 8 will not help here? Do you have any other suggestions on what we can test? Are you working on a new patch? If yes, when do you think will it be available? Do you need more information? (Traces, Dumps etc). During the tests I logged output from sar, iostat and mpstat, I can send them if you think they would be helpful. Please find the test results attached Patch 2 tested on 20080808 Created attachment 314535 [details]
Draft proposed patch
Here is a new patch which is designed to attempt to address the
causes of the out of order WRITE requests.
The first change, in nfs_release_page, addresses a situation where
kswapd finds a page, through its LRU algorithms, and requests NFS
to release it. Kswapd does not take the offset into the file into
account, so the page that it finds may be at a very high offset in
the file and there may be pages in the file, at lower offsets, which
are still dirty. The solution is to flush all of the pages in the
file, starting at offset 0.
The second change is to add some synchronization so that when one
thread has acquired a list of pages which need to get flushed, it
can generate all of the WRITE requests for them, before another
thread can start generating WRITE requests for another range of
dirty pages in the file.
The third change is to add some instrumentation to attempt to detect
out of order WRITE requests being generated.
Please note that the instrumentation will only work when access to
the file is strictly sequential. It will trigger on non-sequential
access and the information generated from these accesses should be
ignored.
Also please note that the synchronization code is not bullet proof.
It is timing and process scheduling dependent. This is still being
looked into. It mostly seems to work when there are only two threads
writing regions of the file, but may not for three and is not even
guaranteed to work for two. It should be worth trying though.
Lastly, there is no flow control when the NFS client starts to generate
WRITE requests. It will basically start sending as much dirty data as
its has. The use of transport protocol like TCP may help to avoid
overwhelming the server, but it is possible that some flow control may
be required anyway.
A test kernel based on the 2.6.18-99.el5 for the s390x architecture containing the following patches - close system call returns -ERESTARTSYS (Jeff Layton) [455729] - Patch from https://bugzilla.redhat.com/process_bug.cgi#c73 (Peter Staubach) [455408] is available at http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-99.el5.bz455408.3.s390x.tgz This compressed file contains the following files kernel-2.6.18-99.el5.bz455408.3.s390x.rpm kernel-devel-2.6.18-99.el5.bz455408.3.s390x.rpm kernel-headers-2.6.18-99.el5.bz455408.3.s390x.rpm Created attachment 316301 [details]
Draft Proposed patch
This patch was developed and has been tested with the -109 kernel.
A test kernel based on the 2.6.18-113.el5 for the s390x architecture containing the following patches - Patch from https://bugzilla.redhat.com/process_bug.cgi#c76 (Peter Staubach) [455408] This also includes the earlier patch by jlayton * Tue Jul 29 2008 Don Zickus <dzickus> [2.6.18-101.el5] - [fs] vfs: wrong error code on interrupted close syscalls (Jeff Layton ) [455729] is available at http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-113.el5.bz455408.4.s390x.tgz This compressed file contains the following files kernel-2.6.18-113.el5.bz455408.4.s390x.rpm kernel-devel-2.6.18-113.el5.bz455408.4.s390x.rpm kernel-headers-2.6.18-113.el5.bz455408.4.s390x.rpm Created attachment 316870 [details]
Draft Proposed patch
Here is the updated patch. The serial access detection code wasn't
right, so it wasn't flushing requests as it went along.
kernel-2.6.18-115.el5.bz455408.5.src.rpm is currently building, current build information is available at https://brewweb.devel.redhat.com/taskinfo?taskID=1477553 . A test kernel based on the 2.6.18-113.el5 for the s390x architecture containing the following patch - Patch from https://bugzilla.redhat.com/process_bug.cgi#c80 (Peter Staubach) [455408] This also includes the earlier patch by jlayton * Tue Jul 29 2008 Don Zickus <dzickus> [2.6.18-101.el5] - [fs] vfs: wrong error code on interrupted close syscalls (Jeff Layton ) [455729] is available at http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-115.el5.bz455408.5.s390x.tgz This contains the following files kernel-2.6.18-115.el5.bz455408.5.s390x.rpm kernel-devel-2.6.18-115.el5.bz455408.5.s390x.rpm kernel-headers-2.6.18-115.el5.bz455408.5.s390x.rpm The note above incorrectly mentions that the kernel is based on 2.6.18-113.el5. The new kernel is based on 2.6.18-115.el5. cut.paste-- Created attachment 317393 [details]
Draft Proposed patch
Here is a new draft proposed patch. It is simpler, in some respects,
and seems to have better performance characteristics. It seems worth
trying.
A new test kernel based on 2.6.18-116.el5 for the s390x architecture is available, containing the following patch: - Patch from https://bugzilla.redhat.com/show_bug.cgi?id=455408#c84 (Peter Staubach) This also includes the earlier patch by Jeff Layton: * Tue Jul 29 2008 Don Zickus <dzickus> [2.6.18-101.el5] - [fs] vfs: wrong error code on interrupted close syscalls (Jeff Layton) [455729] A compressed tar archive is available at http://people.redhat.com/jmoyer/bz455408/kernel-2.6.18-116.el5.bz455408.6.s390x.tgz and contains the following files: - kernel-2.6.18-116.el5.bz455408.6.s390x.rpm - kernel-devel-2.6.18-116.el5.bz455408.6.s390x.rpm - kernel-headers-2.6.18-116.el5.bz455408.6.s390x.rpm Regards, Fábio Olivé Looking over this patch. So with the nfs_is_serial() check below, we're only going to initiate a write when we have sequential writes: + } else if (nfs_is_serial(inode, pos) && + nfs_max_outstanding_writes) { + if (NFS_I(inode)->ndirty >= NFS_SERVER(inode)->wpages) { + nfs_flush_inode_async(inode, 0, + NFS_SERVER(inode)->wpages); + } + } ...shouldn't we also consider something similar for random I/O? If I have a bunch of dirty pages that were written out of order by the app, shouldn't we try to flush them to the server as soon as we're able to stitch together a complete write request? Hmmm. Maybe? Got any suggestions? The problem with doing it for random access is that the pages will be scattered all over the file, thus causing the client to generate lots of little WRITE requests to the server. I think that it may be better to wait and let the application do flushing based on its own criteria, thus perhaps maximizing performance and/or correctness. Updating PM score. I've had the latest patch for this in my test kernels for some time. I recently started looking at bug 516490. The test there is a simple dd test: $ time dd if=/dev/zero of=/file/on/nfs bs=1024k count=5000 ...the patch seems to make that test run 5-6 times slower: Unpatched: ---------- $ time dd if=/dev/zero of=/mnt/dantu/ddtest.out bs=1024k count=5000 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 123.011 seconds, 42.6 MB/s real 2m3.660s user 0m0.028s sys 0m7.684s Patched: -------- $ time dd if=/dev/zero of=/mnt/dantu/ddtest.out bs=1024k count=5000 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 744.942 seconds, 7.0 MB/s real 12m26.704s user 0m0.000s sys 0m1.648s ...I think that problem will need to be addressed before we can take this for RHEL. Customer has closed the case, closing this BZ as we have no way to reproduce in-house. |