Bug 455408

Summary: NFS perfomance problems on s390x
Product: Red Hat Enterprise Linux 5 Reporter: Jan Wildeboer <jwildebo>
Component: kernelAssignee: Peter Staubach <staubach>
Status: CLOSED WONTFIX QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 5.2CC: al.koch, bhinson, epasch, fleite, jlayton, jwest, loeh, rwheeler, sardella, smolinski, sprabhu, sputhenp, steved, tao, thoss
Target Milestone: rc   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-02-08 15:57:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 533192    
Attachments:
Description Flags
simple perl script to detect out-of-order write calls
none
Draft Proposed patch
none
Draft Proposed patch
none
Draft proposed patch
none
Draft Proposed patch
none
Draft Proposed patch
none
Draft Proposed patch none

Comment 4 Jeff Layton 2008-07-15 14:15:58 UTC
My suggestion at this point is to redo this test and get some strace information
at the same time. For instance:

      $ strace -o /tmp/dd.strace -f -T -tt dd if=test.file \
        of=/data/sap/GZ/H000.SAP.S1.W0.SM7.LINUX.CHECK bs=736 count=50000

...and at the same time, get another capture. This would allow us to correlate
what's going on on the wire with the timing that userspace is seeing at the
system call level.

Based on what I'm seeing in this capture, my suspicion is that open() syscalls
account for the bulk of the slowness here due to the slow LOOKUP NFS calls. but
it would be good to confirm that.


Comment 6 Jan Wildeboer 2008-07-15 15:49:20 UTC
New info from IBM:

So far we have found that the delays we are experiencing are from delays 
in open and close operations. The actual I/O write performs at sufficient 
levels however the open/close take 10 minutes for NFS v3 and the close in 
NFS v4 takes 2 minutes 40 seconds. I have attached a system trace of the 
write operation for your reference.

(stracelog.txt addes as attachment)

Comment 7 Jan Wildeboer 2008-07-15 15:52:07 UTC
Even more info, does this help?

Problem identification in short words:
- Linux sends data in out-of-sequence order
- zOS is expecting data sequentially
- NFS has to put the data blocks in the right order
- this is done in a cache
- if the next part of a data set does not arrive before the cache is filled
, then we get all the kinds of the observed problems (partial records,
timeout, etc.)

Solution proposition in short words:
- increase cache-window (NFS server attribute) to 255 (or 256 I'm not sure
personally) (this is the maximum)
- find out if there's a possibility to force Linux to send data
sequentially

Comment 13 Peter Staubach 2008-07-15 17:38:58 UTC
Well, the Solaris system may already have the name cached in their DNLC,
but they will generate a GETATTR as part of the close-to-open processing.

I am somewhat surprised that the Linux generates the LOOKUP since it
could already have the name cached.

What happens if "ls -lasi <thenameofthefiletobeopenedandwrittento>"
is run before the dd command?

Comment 17 Alex Koch 2008-07-16 09:59:22 UTC
Created attachment 311935 [details]
STRACE

This is the strace, details see in comment

Comment 18 Alex Koch 2008-07-16 10:04:51 UTC
Created attachment 311936 [details]
TCPDUMP

This is the corresponding TCPDUMP, details see comment

Comment 19 Alex Koch 2008-07-16 10:14:08 UTC
Comments 17 and 18 contain attachments to our latest test.

* Test objective: 
-copy (cp) a local 2 GB file to a preallocated (on z/OS) file on the NFS Share.

* Test parameters:
-NSCD stopped
-No "ls -lisa" on target file
-mount was done w/ mount -o intr "gplex1-nfssv.appl.ubs.ch:GZ,binary,space
(150,20),cyls,norlse" /mnt/t329823/test001

Test observation:
-TCPDUMP was started with tcpdump -vvXx -s 256  -i any "host gplex1-
nfssv.appl.ubs.ch" -w /sapmnt/spool/nfs-t329823/capture-20080716-001.txt
-Test started on 11:03 with time strace -f -T -tt -o /sapmnt/spool/strace-
20080716-001.txt cp /sapmnt/spool/nfs-
t329823/test.file /mnt/t329823/test001/H000.SAP.S1.W0.S10.LINUX.CHECK; date
-The long wait on open() could not be observed (completed in <0.038868>)
-At 11:04 all read/write operations where finihed and the target file was 
attempted to be closed
7110  11:04:17.072872 close(4)          = -512 <1974.150421>
7110  11:37:11.223332 --- SIGINT (Interrupt) @ 0 (0) ---
-After more than 30 minutes we cancelled the copy during the wait we saw only 
commit requests with "nfsstat -c", about 40

This is the output of the tcpdump and cp command:
[root@TLG101 ~]# tcpdump -vvXx -s 256  -i any "host gplex1-nfssv.appl.ubs.ch" -
w /sapmnt/spool/nfs-t329823/capture-20080716-001.txt
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 
bytes
437833 packets captured
907246 packets received by filter
31572 packets dropped by kernel


[cy1adm@TLG101 mnt]$ time strace -f -T -tt -o /sapmnt/spool/strace-20080716-
001.txt cp /sapmnt/spool/nfs-
t329823/test.file /mnt/t329823/test001/H000.SAP.S1.W0.S10.LINUX.CHECK; date
cp: closing `/mnt/t329823/test001/H000.SAP.S1.W0.S10.LINUX.CHECK': Bad file 
descriptor

real    33m47.731s
user    0m8.658s
sys     0m23.458s
Wed Jul 16 11:37:11 CEST 2008


I will proceed now with the "nscd" turned on.







Comment 20 Alex Koch 2008-07-16 10:36:43 UTC
Results of second test:

Same configuration but NSCD turned on. This time we cancelled earlier, but the 
observations were the same:
-no long wait on open
-fast read/write
-wait on close
-no nfs requests visible with nfsstat except commits



[root@TLG101 nfs-t329823]# tcpdump -vvXx -s 256  -i any "host gplex1-
nfssv.appl.ubs.ch" -w /sapmnt/spool/nfs-t329823/capture-20080716-002.ttcpdump: 
WARNING: Promiscuous mode not supported on the "any" device
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 256 
bytes
264478 packets captured
614111 packets received by filter
85153 packets dropped by kernel

[cy1adm@TLG101 mnt]$ date; time strace -f -T -tt -o /sapmnt/spool/strace-
20080716-002.txt cp /sapmnt/spool/nfs-
t329823/test.file /mnt/t329823/test0002/H000.SAP.S1.W0.S11.LINUX.CHECK; date
Wed Jul 16 12:19:59 CEST 2008
cp: closing `/mnt/t329823/test0002/H000.SAP.S1.W0.S11.LINUX.CHECK': Bad file 
descriptor

real    12m22.807s
user    0m8.578s
sys     0m23.222s
Wed Jul 16 12:32:22 CEST 2008

The logs will follow with the next two attachments

Comment 21 Jan Wildeboer 2008-07-16 10:39:07 UTC
We are getting closer.

Just to make sure - nfslockd is up and running? Any interesting messages in
/var/log/messages?

Jan

Comment 22 Alex Koch 2008-07-16 10:40:32 UTC
Created attachment 311939 [details]
STRACE TEST 002

Comment 23 Peter Staubach 2008-07-16 11:08:59 UTC
If it is known that NFS clients will _not_ be sharing files, then
it may be interesting to try the "nocto" mount option on the Linux
NFS client.  However, if the NFS clients are sharing files, then
this option may easily lead to inconsistent caches among the clients
and should be avoided.

Comment 24 Jeff Layton 2008-07-16 11:39:38 UTC
Looking at the capture and strace from comment #17 and comment #18:

Lookups in this capture are fairly speedy, so perhaps the long lookup times in
the earlier capture were an anomaly of some sort. They're still rather
disturbing though and probably need to be addressed. nscd should have no effect
on this test once the open() is complete.

The big delay in the strace seems to be the close() of the destination file,
which was apparently interrupted (probably ^c'ed by the user):

7110  11:04:17.072872 close(4)          = -512 <1974.150421>
7110  11:37:11.223332 --- SIGINT (Interrupt) @ 0 (0) ---

...as a side note, that looks like a bug -- we shouldn't return -ERESTARTSYS to
userspace, ever. Regardless though, that's unrelated to any performance problem.

The reason the close takes so long is probably because it has to flush all of
the dirtied pages to the server before it can return.

Looking again at the capture, I'll first note that it seems to be incomplete. I
don't see a write call starting at offset 32768, for instance. Still, we can try
to do some analysis on what we have:

stats for WRITE calls:
# of calls: 61770
min service response time: .00012s
max service response time: 100s
avg service response time: .51s

...half a second to reply to a write seems pretty long, but it's probably just
skewed by some packets that have extraordinarily long response times. 100s is a
*really* long time for the server to not respond.

COMMIT calls:
# of calls: 5
min SRT: .93s
max SRT: 20.5s
avg SRT: 12.7s

The commit stats aren't great either.

More details in the capture:

Things seem to chug along pretty speedily until we hit the write in frame
288378. We end up doing a retransmission in frame 288382 (60s later), so there's
a 60s delay where nothing was happening while we waited for the server to catch
up. It looks like things pick back up again in frame 288409, but it's hard to be
sure what happened there -- looks like some frames were lost from the capture.
It's possible the client stalled out here for some other reason, but it could be
that more writes were stalled waiting for replies.

There's a similar, 20s delay around frame 295455. A 10s delay at 295674...

We could probably find more if we wanted. These "short" delays though add up.

The problem, most likely is that the server is, in some cases just not sending
us timely responses to our calls. IIRC, RPC slots are a fixed resource -- we can
only deal with a certain number of outstanding calls being on the wire.

I don't see any hard evidence of the client stalling out here while the server
is just waiting around for packets. At almost every point where I see the
capture stall, there are outstanding RPC calls on the wire. The exceptions seem
to be places where the capture is missing packets -- that might be indicative of
client side performance issue that would affect NFS performance, but we can't be
certain from this info.

If they're concerned about ordering of write calls, one thing that might be
interesting is doing this test with the "sync" mount option, which should turn
all of the writes into synchronous writes. It might slow down performance in
some cases, but should ensure that writes are issued on the wire in the order
received.

As to why solaris is faster. I'm not sure -- we don't have any solaris captures
to compare this to. We also don't have any information of how fast solaris
actually is. How long does solaris take to copy a 2G file to this server?

Finally, I'd like to know exactly what kernel they're running here. Also, what
arch is the client?

Comment 25 Alex Koch 2008-07-16 11:56:41 UTC
Created attachment 311944 [details]
TCPDUMP TEST 002

Comment 26 Alex Koch 2008-07-16 12:23:14 UTC
Hi all,

here are some first answers for your latest questions

* Yes nfslockd is running, as far as I know for all the tests done so far

* Interesting message: The only interesting message I don not quite   
   understand is: 
  Jul 16 13:48:03 tlg101 setroubleshoot: SELinux is preventing ethtool 
  (ifconfig_t) "read write" to socket (initrc_t). For complete SELinux 
  messages. run sealert -l fb82ac8d-fcd1-4dd1-90b2-e04654589d5d

* Arch and Kernelversion:
  [t329823@TLG101 ~]$ uname -a 
  Linux TLG101 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:58 EDT 2008 s390x s390x
    s390x GNU/Linux
  [t329823@TLG101 ~]$ arch
   s390x
* We are doing tests with "sync" option right now....





Comment 27 Jeff Layton 2008-07-16 13:02:46 UTC
AFAIK, solaris hasn't been ported to s390x, so I assume they're testing solaris
on a different arch entirely. Do the solaris machine and the linux s390x client
have comparable amounts of RAM? 

I'd still like to see the times from the solaris box for a similar test (a
capture and maybe some truss output with timing stats similar to the ones in the
strace might also be interesting).


Comment 28 Alex Koch 2008-07-16 13:14:27 UTC
I think there is a Solaris port in progress, but anyway, Solaris tests were 
done on SPARC.

The Sol. machine has 2GB RAM physical memory, our Linux guest has 16 GB.

We have a snoop trace from the machine, can you use that? No syscall trace 
though. 

Comment 29 Jeff Layton 2008-07-16 13:57:22 UTC
Ok, sounds like both machines have enough ram to cache the entire 2G file in
memory then. I didn't really think that was a problem anyway. It looks like the
write syscalls all complete in a timely fashion which probably means that memory
pressure isn't an issue. Still, it's good to know...

If you have the snoop capture I'd like to have a look (particularly if it's a
binary capture). Let's not worry about a syscall trace from solaris at the
moment we can get that later if we think it's needed.


Comment 30 Jeff Layton 2008-07-16 14:41:12 UTC
Since the problem is that the linux client is slower than solaris here, we need
to see how long the solaris client took to do the same 2G copy to this server.


Comment 31 Jeff Layton 2008-07-16 14:55:50 UTC
Also, it would be good to know what mount options were used on the solaris client.


Comment 32 Peter Staubach 2008-07-16 15:21:35 UTC
And what the network topologies look like, ie. mostly how each client is
accessing the server.  Are they both using ethernet or something else?
Is it the same for both clients?

Comment 33 Alex Koch 2008-07-16 15:23:45 UTC
I am checking right now if we can publish the snoop trace, the problem is, it 
was a full trace on the machine, and it may contain sensitive information.

This is the mount option for Solaris

*****.ch:RZ,text,nofastfilesize,xlat(oemvs311) -       /data/sap/RZ    nfs     -
       no      intr


We have done a copy on monday, a 2.4GB file from the same NFS to the same share 
in about 4.5 Minutes. The same copy on linux never completed, resp. was 
cancelled after a long time. We have done similiar tests since from zLinux and 
the longest time I waited on a close() operation was over 30 minutes before I 
cancelled it

Comment 34 Alex Koch 2008-07-16 15:26:09 UTC
From the linux side we use osa adapters, and have hipersockets

On Solaris afaik plain ethernet...

Comment 35 Holger Smolinski 2008-07-16 15:50:39 UTC
This is my analysis of two strace outputs, and my recommendations as copied from
the IBM PMR:
Hi Team,                                                                 

we have analysed the strace data submitted. In both cases we have a      

syscall (open/close) taking a long time and then being interrupted by    

SIGINT. We suppose, the source of the SIGINT was the operator terminatin 

g the command by Ctrl-C.                                                 

In both cases the syscall has completed successfully after SIGINT:       

-the open call returns a valid file descriptor (1)                       

-the close call returns ERESTARTSYS as indication, that it had been      

  interrupted by SIGINT and a following close() call returns EBADF       

  as an indicationm, that the file already has been closed.              

This looks, as if the syscall was already completed in both cases, but j 

ust refusing to return to userspace - which is very unusual behavior.    

Alternatively theSIGINT (Ctrl-C) could have been handled incorrectly     

causing the syscall to return correctly, although it might not haven bee 

n completed correctly before.                                            

The later theory seems to be wrong, because at least after the open call 

the remote file could be written successfully.                           

We now need to find out, what the processes are doing, when they         

receive the SIGINT and subsequently return to userspace.                 

can you please:                                                          

1) confirm, that the operator pressed Ctrl-C to terminate the stalled    

   command.                                                              

2) recreate the stalling command and then issue a                        

ps axo euid,ruid,tty,tpgid,sess,pgrp,pid,ppid,tpgid,stat,pcpu,wchan:15.. 

..,comm:25 (remove the dots, but not the commas). From this command we   

can see, in which kernel function the syscall hangs, before the Ctrl-C   

was pressed. Then you can aboth the test by hitting Ctrl-C.              

One more question: Are you running any security modules, which might     

hang during execution of the syscall? Are you able to run dbginfo.sh     

and submit the output to ECURep?                                         

Regards                                                                  

Holger                                       

Comment 36 Holger Smolinski 2008-07-16 15:57:11 UTC
Created attachment 311962 [details]
strace #2 - with blocked 'close()' call

in Strace #2 we have a close() syscall taking 33 minutes, before being
interrupted by SIGINT, then returning -512 (ERESTARTSYS) but apparently having
successfully completed already, since subsequent close(4) call returns EBADF,
which indicates, that the fd has already been removed from the list of open
files.
7110  11:04:17.072850 read(3, "", 4096) = 0 <0.000004>
7110  11:04:17.072872 close(4)		= -512 <1974.150421>
7110  11:37:11.223332 --- SIGINT (Interrupt) @ 0 (0) ---
7110  11:37:11.223357 close(4)		= -1 EBADF (Bad file descriptor)
<0.000004>
Very unlikely, that the operator has hit Ctrl-C just in time, when the syscall
also completed recently ;(

Comment 37 Holger Smolinski 2008-07-16 16:02:41 UTC
I will have to ask IBM suport for access data to the second strace, which I
mentioned in todays phone call, and in which the open call was blocking....I
will attach that info to the bugzilla.
Regards
  Holger

Comment 38 Holger Smolinski 2008-07-16 16:15:27 UTC
Actually I found the strace in attachment 'strace log' below
https://bugzilla.redhat.com/attachment.cgi?id=311849

15:39:40.533041 close(1)                = 0
15:39:40.533061 open("/data/sap_native/GZ/H000.SAP.S1.W0.S11.LINUX.CHECK",
O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1
15:48:09.478212 --- SIGINT (Interrupt) @ 0 (0) ---

Here we see the open call apparently being interrupted by the SIGINTR, but
returning a valid fd, which is late on used for writing...

Regards
  Holger

Comment 39 Jeff Layton 2008-07-16 16:40:36 UTC
The close() call is pretty clearly broken. We should never return -ERESTARTSYS
to userspace. The close is almost certainly taking a long time because the
client is flushing pages to the server before it can (or should) return.

The open() is more interesting. I suspect that the open() is taking a long time
because of one of these slow LOOKUP calls (as shown in the capture in comment
#1). As to why it returned a valid fd, I'm not sure. It might be interesting to
see if we can reproduce that somehow (maybe doctor a nfs server with artificial
delays in LOOKUP calls).

The signal handling is a separate issue from any performance concerns, however.


Comment 40 Jeff Layton 2008-07-16 17:14:41 UTC
Created attachment 311973 [details]
simple perl script to detect out-of-order write calls

Here's a quick perl script to detect out of order write calls. I ran it like
this:

$ tshark -r /tmp/capture-20080716-001.txt | grep 'WRITE Call' | /tmp/seq.pl

It just detects when a WRITE call has an offset that's lower than the offset in
the previous write call. From this, I see 568 writes that have offsets lower
than the previous one.

Comment 41 Peter Staubach 2008-07-16 17:21:08 UTC
It appears that the client is not handling NFS3ERR_JUKEBOX correctly
and is treating it as a normal error in some path.  That's a bug and
we should address that.

That said, I suspect that the problem is that the NFS client is
constantly getting JUKEBOX errors returned and is never transmitting
the data that the server needs.

I might also suggest checking the server code to ensure that it is
handling the out of cache situation correctly so that when a missing
block does come in, it can handle it and write out some data to the
file.

Comment 42 Peter Staubach 2008-07-16 17:22:16 UTC
Jeff, it appears that I should take this one for the time being.  If you
disagree, please let me know.

I would appreciate your continued assistance though.

Comment 43 Jeff Layton 2008-07-16 17:32:37 UTC
All yours Peter, but I'll add this that I already wrote up...

To clarify, we have several separate problems:

1) the problem that the z-series server doesn't deal well with out-of-order
writes. This is the problem that was discussed on today's concall. Peter is
going to explore whether we can send writes in a more ordered fashion.

2) the fact that the z-series server occasionally takes a *very* long time to
reply to LOOKUPs (the capture in comment #1 shows this). I believe this is the
likely cause for open() calls that take a long time.

3) the linux client is not handling signals in the way it should be during
open/close. close() should not return -ERESTARTSYS. There's also the question of
why we get a valid fd when the process is signaled while hung on an open() call
(perhaps the RPC LOOKUP call was reissued and returned immediately?). This is
probably related to the JUKEBOX errors that Peter mentions.

4) a separate problem (possibly related to bug 281241 ?), that's causing
captures to miss some packets. For instance, in capture-20080716-001.txt, I
don't see a WRITE packet that starts at offset 32k, but I'm pretty sure that one
should be there (unless maybe the capture was killed before it was sent).

Some of these should probably be split into separate BZ's since they are
separate problems.

When benchmarking for the performance problems they're having, I'd suggest that
they do not signal the cp. I believe it's confusing the timing of the syscalls
and is probably making them kill off the captures before all the data has been
sent. Just let them run -- we should treat the signal handling problems
separately since that's really a different problem.

Also, 30+ mins is a really long time to have to wait. They may want to cut down
the size of the file they're copying since I think they can probably demonstrate
the issue with smaller file sizes (512m or 256m or so is probably sufficient to
demonstrate the problem). That should make for faster testing and smaller
syscall traces/captures.


Comment 44 Jeff Layton 2008-07-16 17:35:29 UTC
Also on #4 above, I see several 'TCP ACKed lost segment' packets in wireshark
for that capture, which leads me to believe that the captures are incomplete. If
they have the latest libpcap, then they may want to open a bug for that and see
if the cause can be determined. Having complete captures will make this all of
these problems much easier to troubleshoot.


Comment 45 Alex Koch 2008-07-17 12:09:29 UTC
Hi All,

we have done intensive testing with the "sync" mount option today. Like we 
mentioned yesterday, our first tests with sync (which did not produce any 
usable results) were spoiled by other issues. 

We tried copies (w/ dd and cp) for one, 10, 15 and 20 files in parallel. Copy 
from local files to NFS as well as copying from an NFS file to an NFS. Local 
files were 2,7GB, the NFS file 2,4GB

We did not see:
-long lookups
-any long waits or hangs on open() or close()
-no "incomplete records" warnings in the NFS Server

Performance for single copy was within the Solaris time frame, about 4 minutes 
with variations of about 45 seconds. Of course for parallel copies we saw 
increased elapsed time

We saw two issues during our tests:

-Massive parallel copies (15 or 20) resulted in "Short on storage" errors in 
the NFS Server. This resulted in "Input/Output Error" in the copy processed on 
the Linux side. This is fine by me, if the server can't handle write requests, 
he has to abort, and informs the client. For me this is correctz behaviour.

-A high CPU utilization on the NFS Server, about 10% per file write/copy. With 
the parallel tests we reached a limit og 70%-80%, which AFAIK is due to 
restrictions in our system/other workload running. We do not know yet what CPU 
utilization we see with parallel copies from Solaris Clients, so maybe this is 
not a real issue.

We are preparing to do tests with the SAP application for which we hopefully 
have results for the confcall. Also we need to clarify if the performance for 
parallel copies is acceptable. But so far it looks like using "sync" is the 
right way to go.

I will create an attachment with a summary of our tests, if you need details, 
please tell me so. 

I have not done a 
ps axo euid,ruid,tty,tpgid,sess,pgrp,pid,ppid,tpgid,stat,pcpu,wchan:15,comm:25 
yet on a stalled copy process, would that still be helpful? Like I mentioned 
before we have not seen this stalled close() wiht the sync option.

Do you need any additional information or traces for the moment?

I guess we should wait and see for the results we get from the application 
tests, maybe "sync" really is a solution here.
















Comment 46 Alex Koch 2008-07-17 12:10:56 UTC
Created attachment 312044 [details]
Summary tests with "SYNC" option

This is a short summary of the tests with the sync mount option

Comment 47 Alex Koch 2008-07-17 17:02:48 UTC
Here the results from our latest efforts.

We were able to run the functional test in SAP (copy one NFS file to another), 
which is the test case that originally made us aware of the problems. That 
means we successfully copied a 2.4GB file. Performance is about 2.5 times 
longer than on Solaris (keep in mind we have the "sync" option on on Linux, not 
on Solaris).

To be sure we don't have a problem that is unstable we will do more tests 
tomorrow, with parallel copies and bringing the system under load.

So, we can confirm the functional test case is successful so far, performance 
is still an issue. 

We did not do any tcp traces yet, because I am waiting on feedback from Jan for 
the libpcap issue.



Comment 48 Alex Koch 2008-07-18 13:06:29 UTC
Good day,

we redid a test with the "ASYNC" option to get another clean set of
documentation for our original problem. I will send the attachments later. Its a
zip file this time 



Here the test setup and testlog:

* Mount options:
text,nofastfilesize,xlat(oemvs311) nfs
rw,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=***
0 0

* tcpdump taken with 
tcpdump -vvXx -s 256  -i any "host **our-nfs-server**" -w
/tmp/capture-20080718-001.txt

* The copy was done with dd and straced:
date; time strace -f -T -tt /tmp.txt dd if=/mnt/archive/test/test-736.file
of=/mnt/nfstest001/data0001 bs=16k
and started on: Fri Jul 18 14:16:01 CEST 2008

* A ps axo ... was taken every second with
while [ true ]; do date >> pslog.txt; ps axo
euid,ruid,tty,tpgid,sess,pgrp,pid,ppid,tpgid,stat,pcpu,wchan:15,comm:25 >>
pslog.txt; sleep 1; done 

* When the hang on the close was observed several time I executed the ps ...
command 
as closelog?.txt

* The NFS Server showed a CPU load of about 30-40%, later on dropped to 15%-20%
still doing EXCP also when the client hung on the close

* nfstat -c shows continous write requests on the linux side

* 14:22 NFS Server reported "Client died on the linux side
GFSA824W  (TCPNFSSV) NETWORK FILE SYSTEM CLIENT
DID NOT SEND COMPLETE RECORDS FOR OFFSET 
The message was repeated afterwards ...

* Client dd programm was interrupted with CONTROL-C by the operator at ca. 14:25
date; time strace -f -T -tt -o /sapmnt/spool/nfs-t329823/strace-20080718-001.txt
dd if=/mnt/archive/test/test-736.file of=/mnt/nfstest001/data0001 bs=16k
dd: closing output file `/mnt/nfstest001/data0001': Bad file descriptor
real    8m54.931s
user    0m3.217s
sys     0m14.889s

* After that we saw still ongoing write requests with nfsstat -c




Comment 49 Alex Koch 2008-07-18 13:31:04 UTC
There are some more comments I want to make regarding comment 43 and 44 from
Jeff Layton

>2) the fact that the z-series server occasionally takes a *very* long time to
>reply to LOOKUPs (the capture in comment #1 shows this). I believe this is the
>likely cause for open() calls that take a long time.
We have not experienced this today or yesterday. We believe this may be due to
target files beeing exclusively locked by the NFS Server, even after processing
has ended, AFAIK there is a write timeout implemented. 

>4) a separate problem (possibly related to bug 281241 ?), that's causing
>captures to miss some packets. For instance, in capture-20080716-001.txt, I
>don't see a WRITE packet that starts at offset 32k, but I'm pretty sure that >one
>should be there (unless maybe the capture was killed before it was sent).
I cant access bug 281241, so no idea. Is this the problem with libpcap Jan
mentioned?

>When benchmarking for the performance problems they're having, I'd suggest that
>they do not signal the cp. I believe it's confusing the timing of the syscalls
>and is probably making them kill off the captures before all the data has been
>sent. Just let them run -- we should treat the signal handling problems
>separately since that's really a different problem.
By the way I am the whos signalling the cp :)
Well after 30 or minutes of hanging in a close I saw no use in waiting longer,
thats why I wanted to abort the cp. We did let them run, sometimes overnight,
and next morning it was still hanging in the close, so what do you hope to gain
by not aborting?

>Also, 30+ mins is a really long time to have to wait. They may want to cut down
>the size of the file they're copying since I think they can probably >demonstrate
>the issue with smaller file sizes (512m or 256m or so is probably sufficient to
>demonstrate the problem). That should make for faster testing and smaller
>syscall traces/captures.
This is a bit tricky. We see the wait/hang in close only for big files send
async (if they produce the "did not send complete records" situation.

>Also on #4 above, I see several 'TCP ACKed lost segment' packets in wireshark
>for that capture, which leads me to believe that the captures are incomplete. >If
>they have the latest libpcap, then they may want to open a bug for that and see
We seem to have libpcap statically compiled into tcpdump, the version is
tcpdump-3.9.4-12.el5. This is also the newest source package I could find on
http://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/
Is this the newest version?




Comment 50 Ric Wheeler 2008-07-18 15:42:32 UTC
Since the NFS synchronous writes avoid the server issue, but have a 2-3x
slowdown, we have suggested that we might be able to tune the VM system to help
us approach the absolute promise of in order writes by managing the page cache
more aggressively.

Specifics:

The tuning parameters involved are:

/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_writeback_centisecs

You can set them dynamically from the sysctl command or by setting the values
through /proc:

by "echo <new value> > /proc/sys/vm/dirty_writeback_centisecs" for example.

If these settings work, you can make the settings persist by adding them to
/etc/sysctl.conf.

We should start by testing the most aggressive setting to see if we can
reproduce the correctness that we saw with the NFS sync mode. If so, we can
gradually relax the settings to see if they have a positive impact on
performance without impacting the NFS server and out of order requests.

Specific suggestion:

- start out with dirty_background_ratio at 1 (1% of the page cache can be dirty
before async writeout is started)
- keep dirty_expire_centisecs and dirty_writeback_centisecs (in 1/100th of a
second units) at or above 100 if possible, but feel free to tune them all the
way down to 1 if that makes the NFS server happy.


Comment 51 Jeff Layton 2008-07-18 16:45:34 UTC
Minor nit -- it's not necessarily "incorrect" to flush pages out of order. The
NFSv3 spec doesn't prohibit this. Also, RPC retransmissions can cause writes to
arrive out of order at the server, even if we queue them up them in order. I'd
consider the avenue that Peter is pursuing an optimization rather than a bug fix...

That said, the VM tuning suggestions make sense, and are definitely worth
testing. I suspect that they probably won't change the ordering of the writes
significantly but it may cause the client to flush pages that are in the "gaps"
more quickly. That may help keep the server from digging too deeply into its
"backlog cache" (or whatever the term is).


Comment 52 Jeff Layton 2008-07-18 17:12:49 UTC
To answer Alex...

> We have not experienced this today or yesterday. We believe this may be due to
> target files beeing exclusively locked by the NFS Server, even after
> processing has ended, AFAIK there is a write timeout implemented. 

Ok, makes sense. Hopefully you have a workaround -- I doubt there's much we can
do to mitigate this from the client side.

> I cant access bug 281241, so no idea. Is this the problem with libpcap Jan
> mentioned?

Yes. The BZ is marked private (sorry). tcpdump-3.9.4-12.el5 is the latest AFAIK.
I may be mistaken about the missing packets. It may be due to some other factor
besides libpcap problems, or the capture might have just been canceled before
the expected writes were done.

> We did let them run, sometimes overnight, and next morning it was still
> hanging in the close, so what do you hope to gain by not aborting?

My hope is that by not aborting that we'll know that all of the writes have
completed and that the capture is also complete. If we then find that the
capture seems to be incomplete then we know that there's a problem with libpcap
(or something related).

Hmm...I would think that the test should eventually run to completion, but
perhaps we're hitting a livelock sort of situation where the server's cache is
full and it's returning NFS3ERR_JUKEBOX. We end up retrying the same set of
writes but they're not the ones that the server needs. If that's the case then
the VM tuning that Ric suggests may be helpful. The work that Peter is
investigating would certainly be, but we don't have an ETA for that (or even
know whether it's feasible).

> This is a bit tricky. We see the wait/hang in close only for big files send
> async (if they produce the "did not send complete records" situation.

That makes sense. It sounds like you need to use a file size on par with the
cache size. If that's the case, so be it...


Comment 53 Peter Staubach 2008-07-21 12:06:43 UTC
Regarding Comment #49, yes, indeed.  The Linux NFS client conforms to
the NFSv3 specification.  I also believe that the MVS NFS server probably
also conforms to the NFSv3 specification.  However, that server is less
flexible in the requests that it can handle, so its level of interoperability
is reduced.

The changes that I am working on should be considered to be an optimization
and not a bug fix.  They will hopefully increase the interoperability of
the Linux NFS client and this particular server.

Comment 54 Alex Koch 2008-07-22 14:37:52 UTC
Created attachment 312352 [details]
SAP strace

Strace of the SAP process doing a copy to an NFS share (was requested to
observe the open syscall). Please not that SAP reuses processes, so in the
beginning this process is waiting to be activated, and it takes a while before
the real job runs.

Comment 55 Alex Koch 2008-07-22 15:25:23 UTC
Created attachment 312358 [details]
Solaris_snoop_Trace_part1

split with "split" command. just cut this and part 2 together and then
ungzip-untar it.

Comment 56 Alex Koch 2008-07-22 15:29:52 UTC
Created attachment 312359 [details]
Solaris_snoop_Trace_part2

part 2 of the splitted snoop trace

Comment 57 Peter Staubach 2008-07-25 15:41:50 UTC
Created attachment 312656 [details]
Draft Proposed patch

Here is a draft patch 1) modifies the NFS client behavior to kick off
WRITE requests when it "fill" a complete block, 2) allows the WRITE to
be processed asynchronously and as an UNSTABLE write, and 3) adds a
VM tunable to be able to specify the amount of dirty memory in the
system in megabytes instead of as a percentage of the total memory.

Comment 58 Peter Staubach 2008-07-25 15:43:29 UTC
The patch described in Comment #57 is intended to be a draft patch and
a workaround for the situation.  It is not complete.  I think that we
also need to add some better flow control to the NFS client to limit
the amount of outstanding WRITE data to a server.  I will be working
on that.

Comment 59 Peter Staubach 2008-07-25 15:46:07 UTC
With respect to the tunable described in Comment #57, I am not sure yet
what the right value is.  I need to verify with the IBM NFS server folks
what the size of the cache really is and then I can identify a value to
start testing with.

Comment 62 Peter Staubach 2008-07-25 17:10:04 UTC
It turns out that the size of the cache on the NFS server is 256 slots
or 8MB total.

When UBS or IBM tests, I would suggest setting max_queue_depth to 4.
("echo 4 > /proc/sys/vm/max_queue_depth")

Hopefully, by limiting the amount of dirty memory in the system to 4MB,
the client will be able to transmit data to the server in a sequential
enough fashion for the server to be able to keep up.

Comment 64 Sachin Prabhu 2008-07-25 18:27:51 UTC
A test kernel based on the 2.6.18-99.el5 for the s390x architecture containing
the following patches

- close system call returns -ERESTARTSYS (Jeff Layton) [455729]
- Patch from https://bugzilla.redhat.com/process_bug.cgi#c57 (Peter Staubach)
[455408]

is available at
http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-99.el5.bz455408.1.s390x.tgz

This compressed file contains the following files
kernel-2.6.18-99.el5.bz455408.1.s390x.rpm
kernel-devel-2.6.18-99.el5.bz455408.1.s390x.rpm
kernel-headers-2.6.18-99.el5.bz455408.1.s390x.rpm


Comment 67 Alex Koch 2008-07-31 15:40:27 UTC
I tested the patched kernel, here are the results

I mounted the NFS share asyc: 
mount -t nfs -o async,intr,rsize=32768,wsize=32768 
**OURNFSSERVER"":'********.nfs,text,nofastfilesize,xlat
(oemvs311)' /mnt/nfstest001


I started wiht a setting of "echo 4 > /proc/sys/vm/max_queue_depth", it was set 
on zero initially.

First I died copies with increasing file sizes 164 KB,1.6 MB, 16MB, 164 MB, 1.6 
GB and finally 2.0 GB and could not reproduce the usual problem. 

I then did several consecutive copies with 2.0GB also no problem. 

I started to copy 2 files in parallel, and IOWAIT (as reported by IOSTAT) 
increased dramatically to about 64%. With 5 parallel copies we get close to 90% 
and the systems behaves sluggish und unresponsive (e.g. opening a file in vim 
or opening a man page takes minutes).

My understanding is that a small value for max_queue_depth would force the 
kernel to write pages out pretty aggressively but how this contributes to the 
IO Wait I do not know. Also the transfer times went up a bit.

Next i tried a setting of 16: echo 16 > /proc/sys/vm/max_queue_depth
and copied 2 files in parallel and we see an IO Wait of around 65%

The system still felt very unresponsive so I went up to 
 echo 64 > /proc/sys/vm/max_queue_depth
and copied 2 files in parallel. This resulted in our error situation the NFS 
Server complains, that the client "did not send complete records".  The IO Wait 
went up to nearly 100% until I cancelled the copies after more than 40 miniutes.

So with a setting of 64 and two parallel copies we get into the range of our 
error again, lower settings result in massive IO/Waits (at least I suspect that 
bigger settings make the system more responsive and reduce the IO Wait)

As testing this is a bit tedious, because reproduing our error often requires a 
restart (sometimes of both linux and the NFS server) I would be gratefule if 
someone has some advice:

-IBM did test with a setting of "0", does this mean to not buffer anything or 
does this effectively disable the tuning of this parameter?

-Does it make sense to try intermediate values between 16 and 32?

-What traces/dumps (if any) would you need? Which test should we reproduce then?
  -One that leads to an error?
  -A single copy?
  -Parallel copies?
  -Which setting for /proc/sys/vm/max_queue_depth?











Comment 68 Alex Koch 2008-07-31 15:43:23 UTC
Sorry make that "reported by MPSTAT" not IOSTAT in the previous comment...

Comment 69 Peter Staubach 2008-08-06 15:40:01 UTC
Created attachment 313588 [details]
Draft Proposed patch

Here is a new draft patch.  It adds flow control to the NFS client to
limit the number of outstanding WRITE requests made to the server and
to block applications from creating more dirty data when there are
already enough WRITE requests outstanding to the server.

Comment 70 Peter Staubach 2008-08-06 15:48:53 UTC
I have attached a new patch that I would like to see tested.

This patch adds a new tunable, nfs_max_outstanding_writes.  It
can be manipulated either via the sysctl command or via
/proc/sys/fs/nfs/nfs_max_outstanding_writes.  The default value
is 0, which indicates that no flow controlling should be done.

I would suggest setting this tunable to 8 when trying the tests
in the UBS environment.  This seems to result, in my testing,
in a sequence of WRITE requests that I think that the IBM MVS
NFS server should be able to handle.

The testing should be tried without manipulating the other new
tunable, max_queue_depth.

There is 1 other new variable which is used to monitor the
number of concurrent NFS WRITE requests which are outstanding.
It is nfs_peak_outstanding_writes and is accessible via
/proc/sys/fs/nfs/nfs_peak_outstanding_writes.  After running
the testing, or perhaps even during the testing, this pathname
can be cat'd to see how the maximum number of concurrent
outstanding WRITE requests were made.

Comment 71 Alex Koch 2008-08-08 12:21:29 UTC
Created attachment 313803 [details]
Patch 2 tested on 20080808

Comment 72 Alex Koch 2008-08-08 12:22:09 UTC
I tested you new kernel patch today. I will put this information also into bugzilla. Tests with one and 2 parallel copies of 2.0GB files finsihed successfully. 5 parallel copies of 2.0GB resulted in the "partial records" error in the NFS Server side. 

I guess tuning nfs_max_outstanding_writes higher than 8 will not help here? Do you have any other suggestions on what we can test? Are you working on a new patch? If yes, when do you think will it be available?

Do you need more information? (Traces, Dumps etc). During the tests I logged output from sar, iostat and mpstat, I can send them if you think they would be helpful. 

Please find the test results attached  Patch 2 tested on 20080808

Comment 73 Peter Staubach 2008-08-19 12:18:33 UTC
Created attachment 314535 [details]
Draft proposed patch

Here is a new patch which is designed to attempt to address the
causes of the out of order WRITE requests.

The first change, in nfs_release_page, addresses a situation where
kswapd finds a page, through its LRU algorithms, and requests NFS
to release it.  Kswapd does not take the offset into the file into
account, so the page that it finds may be at a very high offset in
the file and there may be pages in the file, at lower offsets, which
are still dirty.  The solution is to flush all of the pages in the
file, starting at offset 0.

The second change is to add some synchronization so that when one
thread has acquired a list of pages which need to get flushed, it
can generate all of the WRITE requests for them, before another
thread can start generating WRITE requests for another range of
dirty pages in the file.

The third change is to add some instrumentation to attempt to detect
out of order WRITE requests being generated.

Please note that the instrumentation will only work when access to
the file is strictly sequential.  It will trigger on non-sequential
access and the information generated from these accesses should be
ignored.

Also please note that the synchronization code is not bullet proof.
It is timing and process scheduling dependent.  This is still being
looked into.  It mostly seems to work when there are only two threads
writing regions of the file, but may not for three and is not even
guaranteed to work for two.  It should be worth trying though.

Lastly, there is no flow control when the NFS client starts to generate
WRITE requests.  It will basically start sending as much dirty data as
its has.  The use of transport protocol like TCP may help to avoid
overwhelming the server, but it is possible that some flow control may
be required anyway.

Comment 74 Sachin Prabhu 2008-08-21 21:39:18 UTC
A test kernel based on the 2.6.18-99.el5 for the s390x architecture containing
the following patches

- close system call returns -ERESTARTSYS (Jeff Layton) [455729]
- Patch from https://bugzilla.redhat.com/process_bug.cgi#c73 (Peter Staubach)
[455408]

is available at
http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-99.el5.bz455408.3.s390x.tgz

This compressed file contains the following files
kernel-2.6.18-99.el5.bz455408.3.s390x.rpm
kernel-devel-2.6.18-99.el5.bz455408.3.s390x.rpm
kernel-headers-2.6.18-99.el5.bz455408.3.s390x.rpm

Comment 76 Peter Staubach 2008-09-10 11:32:45 UTC
Created attachment 316301 [details]
Draft Proposed patch

This patch was developed and has been tested with the -109 kernel.

Comment 77 Sachin Prabhu 2008-09-13 08:52:00 UTC
A test kernel based on the 2.6.18-113.el5 for the s390x architecture containing
the following patches

- Patch from https://bugzilla.redhat.com/process_bug.cgi#c76 (Peter Staubach)
[455408]

This also includes the earlier patch by jlayton
* Tue Jul 29 2008 Don Zickus <dzickus> [2.6.18-101.el5]
- [fs] vfs: wrong error code on interrupted close syscalls (Jeff Layton ) [455729]

is available at
http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-113.el5.bz455408.4.s390x.tgz

This compressed file contains the following files
kernel-2.6.18-113.el5.bz455408.4.s390x.rpm
kernel-devel-2.6.18-113.el5.bz455408.4.s390x.rpm
kernel-headers-2.6.18-113.el5.bz455408.4.s390x.rpm

Comment 80 Peter Staubach 2008-09-16 18:04:11 UTC
Created attachment 316870 [details]
Draft Proposed patch

Here is the updated patch.  The serial access detection code wasn't
right, so it wasn't flushing requests as it went along.

Comment 81 Fabio Olive Leite 2008-09-16 20:42:24 UTC
kernel-2.6.18-115.el5.bz455408.5.src.rpm is currently building, current build information is available at https://brewweb.devel.redhat.com/taskinfo?taskID=1477553 .

Comment 82 Sachin Prabhu 2008-09-17 13:48:49 UTC
A test kernel based on the 2.6.18-113.el5 for the s390x architecture containing
the following patch

- Patch from https://bugzilla.redhat.com/process_bug.cgi#c80 (Peter Staubach)
[455408]

This also includes the earlier patch by jlayton
* Tue Jul 29 2008 Don Zickus <dzickus> [2.6.18-101.el5]
- [fs] vfs: wrong error code on interrupted close syscalls (Jeff Layton )
[455729]

is available at
http://people.redhat.com/sprabhu/bz455408/kernel-2.6.18-115.el5.bz455408.5.s390x.tgz

This contains the following files
kernel-2.6.18-115.el5.bz455408.5.s390x.rpm
kernel-devel-2.6.18-115.el5.bz455408.5.s390x.rpm
kernel-headers-2.6.18-115.el5.bz455408.5.s390x.rpm

Comment 83 Sachin Prabhu 2008-09-17 14:32:11 UTC
The note above incorrectly mentions that the kernel is based on 2.6.18-113.el5. The new kernel is based on 2.6.18-115.el5.

cut.paste--

Comment 84 Peter Staubach 2008-09-22 17:52:47 UTC
Created attachment 317393 [details]
Draft Proposed patch

Here is a new draft proposed patch.  It is simpler, in some respects,
and seems to have better performance characteristics.  It seems worth
trying.

Comment 85 Fabio Olive Leite 2008-09-23 18:24:17 UTC
A new test kernel based on 2.6.18-116.el5 for the s390x architecture is available, containing the following patch:

- Patch from https://bugzilla.redhat.com/show_bug.cgi?id=455408#c84 (Peter Staubach)

This also includes the earlier patch by Jeff Layton:

* Tue Jul 29 2008 Don Zickus <dzickus> [2.6.18-101.el5]
- [fs] vfs: wrong error code on interrupted close syscalls (Jeff Layton) [455729]

A compressed tar archive is available at http://people.redhat.com/jmoyer/bz455408/kernel-2.6.18-116.el5.bz455408.6.s390x.tgz and contains the following files:

- kernel-2.6.18-116.el5.bz455408.6.s390x.rpm
- kernel-devel-2.6.18-116.el5.bz455408.6.s390x.rpm
- kernel-headers-2.6.18-116.el5.bz455408.6.s390x.rpm

Regards,
Fábio Olivé

Comment 86 Jeff Layton 2008-10-17 15:05:49 UTC
Looking over this patch. So with the nfs_is_serial() check below, we're only going to initiate a write when we have sequential writes:

+               } else if (nfs_is_serial(inode, pos) &&
+                          nfs_max_outstanding_writes) {
+                       if (NFS_I(inode)->ndirty >= NFS_SERVER(inode)->wpages) {
+                               nfs_flush_inode_async(inode, 0,
+                                               NFS_SERVER(inode)->wpages);
+                               }
+               }


...shouldn't we also consider something similar for random I/O? If I have a bunch of dirty pages that were written out of order by the app, shouldn't we try to flush them to the server as soon as we're able to stitch together a complete write request?

Comment 87 Peter Staubach 2008-10-17 15:10:31 UTC
Hmmm.  Maybe?  Got any suggestions?

The problem with doing it for random access is that the pages will be
scattered all over the file, thus causing the client to generate lots
of little WRITE requests to the server.  I think that it may be better
to wait and let the application do flushing based on its own criteria,
thus perhaps maximizing performance and/or correctness.

Comment 88 RHEL Program Management 2009-02-16 15:40:52 UTC
Updating PM score.

Comment 93 Jeff Layton 2009-11-24 12:54:14 UTC
I've had the latest patch for this in my test kernels for some time. I recently started looking at bug 516490. The test there is a simple dd test:

$ time dd if=/dev/zero of=/file/on/nfs bs=1024k count=5000

...the patch seems to make that test run 5-6 times slower:

Unpatched:
----------
$ time dd if=/dev/zero of=/mnt/dantu/ddtest.out bs=1024k count=5000
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 123.011 seconds, 42.6 MB/s

real	2m3.660s
user	0m0.028s
sys	0m7.684s

Patched:
--------
$ time dd if=/dev/zero of=/mnt/dantu/ddtest.out bs=1024k count=5000
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 744.942 seconds, 7.0 MB/s

real	12m26.704s
user	0m0.000s
sys	0m1.648s

...I think that problem will need to be addressed before we can take this for RHEL.

Comment 98 Brad Hinson 2010-02-08 15:57:19 UTC
Customer has closed the case, closing this BZ as we have no way to reproduce in-house.