Bug 436004 - 50-75 % drop in nfs-server rewrite performance compared to rhel 4.6+
Summary: 50-75 % drop in nfs-server rewrite performance compared to rhel 4.6+
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Eric Sandeen
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 391501 448130 448685
TreeView+ depends on / blocked
 
Reported: 2008-03-04 20:52 UTC by Sandeep K. Shandilya
Modified: 2009-10-16 12:33 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 19:36:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
info for the rhel 51 client mounting the rhel46 nfs server (2.35 KB, text/plain)
2008-03-04 20:52 UTC, Sandeep K. Shandilya
no flags Details
rhel 51 client mounting a rhel 51 server. (2.43 KB, text/plain)
2008-03-04 21:38 UTC, Sandeep K. Shandilya
no flags Details
testing with rhel 4.6 server and rhel 51 and rhel 46 clients (1.77 MB, application/x-gzip)
2008-04-09 17:20 UTC, Sandeep K. Shandilya
no flags Details
rhel 5.1 server with rhel46 and rhel51 client performance (1.72 MB, application/x-gzip)
2008-04-09 20:14 UTC, Sandeep K. Shandilya
no flags Details
NFS/IOzone best practices (241.10 KB, application/pdf)
2008-04-25 19:40 UTC, jacob liberman
no flags Details
vmstat 1 output for 1, 2, 4g files on rhel 4.6 and rhel 5.1 (7.50 KB, application/x-gzip)
2008-04-30 22:44 UTC, Sandeep K. Shandilya
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Sandeep K. Shandilya 2008-03-04 20:52:06 UTC
Description of problem:
50-75% percent drop in nfs-server performance when running the iozone benchmark.
The client is rhel51 nfs v3 server is rhel51 nfs v3 running the latest updates.
The same performance drop is observed when the client is on rhel 4 update 6
attached is the output of the iozone and nfsstat on rhel4.6 server and rhel 5.1
server. The configuration of the rhel51 system is the same as the rhel46 system
 in all respects.

the command line of the iozone is
iozone -S 1024 -s 8g -i 0 -i 1 -r 64k -f /mnt/nfs/test_file
The server has 4 gigs of RAM and the client has 4 gigs of RAM. the client is an
AMD box (poweredge 1435 and the server is a poweredge 1950)

Version-Release number of selected component (if applicable):
nfs-utils-1.0.9-24.el5


How reproducible:
always


Steps to Reproduce:

1. Configure nfs server client and server (rw share) with default nfs settings.
2. mount the share on the client
3. download iozone from http://www.iozone.org/src/current/iozone-3-291.src.rpm
  
Actual results:
There is a performance drop to the nature of 50-75 %

Expected results:
There should be minor difference in performance b/w rhel 5 and rhel 4

Additional info:
from the outputs of nfsstat we see that the no of commits is 2 times more than
the rhel 4.6 server.

Comment 1 Sandeep K. Shandilya 2008-03-04 20:52:06 UTC
Created attachment 296800 [details]
info for the rhel 51 client mounting the rhel46 nfs server

Comment 2 Sandeep K. Shandilya 2008-03-04 21:38:55 UTC
Created attachment 296812 [details]
rhel 51 client mounting a rhel 51 server.

compare this attachment with the previous one (id=296800) and observer that the
no of commits are more in this case of the rhel 51 server.

Comment 3 Peter Staubach 2008-03-11 17:29:05 UTC
This appears to be a duplicate of bz321111.  Please try those changes and
then let me know if the situation is not addressed.

Comment 4 Sandeep K. Shandilya 2008-03-12 17:31:26 UTC
Yet to try out comment #3, Will try this out and get back.
But further investigation has revealed that rhel 5.x is about 50 % slower in raw
block I/O performance as compared to rhel 4.x, using the command below.
time dd if=/dev/zero of=/dev/sda1 bs=1k count=$((1024 * 1024 *8)
Running the same benchmark directly on a filesystem also yeilds the same
results. RHEL 5.X is slow compared to RHEL 4.


Comment 5 Sandeep K. Shandilya 2008-03-13 11:48:04 UTC
I tried the test with the new kernel. there is no change in performance. rhel 5
performance is still 50 % slower than rhel 4.
I have already attached the output of nfsstat in attachments in comment #1 and
#2. It looks like this is a different problem altogether.
Comment no #4 seems to be worth investigating.

Comment 6 Peter Staubach 2008-03-13 13:55:29 UTC
I don't think that I understand this problem yet.

There was a comment discussing the difference in the number of COMMIT
operations.  The client generates COMMIT operations and it does so
when it decides and this is server independent.

Comment #4 talks about raw block i/o.  This has nothing to do with
NFS.

So, is this perceived to be an NFS problem or something else?

Comment 7 Sandeep K. Shandilya 2008-03-31 16:49:09 UTC
This is perceived to be an nfs problem. The fix in bz321111 does not correct the
problem. A 2.6.24.2 kernel on the NFS server does NOT reproduce this issue.


Comment 8 Peter Staubach 2008-03-31 17:33:15 UTC
There have been a great number of changes to NFS made since RHEL-5 was
cut.  Many will be impossible to backport due to kABI concerns or will
be simply too risky to do.

Despite this being considered to be an NFS issue, I am still concerned
about the talk about the changes in raw disk performance.  If that
performance is down, then there is very little that we, in NFS, can
do to compensate.

Comment 9 Sandeep K. Shandilya 2008-04-06 09:08:32 UTC
I want to rule out the talk about the changes in raw disk performance. It was an
error on my part. The rhel 5 was on a hardware RAID with 2 stripes where as the
rhel 4.6 was a on a hardware RAID with 4 stripes. I made changes to the RAID
configuration (both rhel 5 and rhel 4.6 have 4 stripes now) and observed that it
was indeed an nfs server issue.

Comment 10 Peter Staubach 2008-04-07 12:23:05 UTC
Just to be clear, Comment #4 and Comment #5 are considered to be red
herrings and that the raw disk i/o performance, for both RHEL-4 and
RHEL-5, on similar hardware and configuration, are relatively equivalent?

Comment 11 Sandeep K. Shandilya 2008-04-09 17:20:51 UTC
Created attachment 301853 [details]
testing with rhel 4.6 server and rhel 51 and rhel 46 clients

Here is the data with the rhel 4.6 server

Comment 12 Sandeep K. Shandilya 2008-04-09 17:55:57 UTC
My test setup

rhel 46 server
rhel 46 server and rhel 51 client 

The tarball attached contains the following files
rhel45_server.txt (dmi and nfsstat on the nfs server)
rhel46_client_rhel46server.txt (dmi, iozone test result, nfsstat on rhel 46
client connecting to rhel 46 server)
rhel51_client_rhel465serv.txt (dmi, iozone test result, nfsstat on rhel 51
client connecting to rhel 46 server)
sosreport-sshandilya.436004-19214-a71c83_rhel46_server.tar.bz2 
sosreport-sshandilya.436004-578754-741be7_rhel46_client.tar.bz2
sosreport-sshsandilya.436004-60005-bf549d_rhel51_client.tar.bz2

As you can see there is drop in nfs read performance in the case of the rhel 5.1
client connecting to the rhel 4.6 server.

I have more test results that will follow where I test rhel 5.1 as the nfs
server and rhel 4.6 , rhel 5.1 as clients.

Comment 13 Sandeep K. Shandilya 2008-04-09 20:14:02 UTC
Created attachment 301892 [details]
rhel 5.1 server with rhel46 and rhel51 client performance

Here is the test setup for rhel 5.1 as nfs server

rhel 5.1 nfs server
rhel 5.1 client and rhel 4.6 clients

rhel46_client_rhel51serv.txt (dmi, nfsstat and iozone test output)
rhel51_client_rhel51serv.txt (dmi, nfsstat and iozone test output)
rhel51_server.txt (dmi, nfsstat output)
sosreport-sshandilya.436004-114080-efa7c4_rhel51_client.tar.bz2 (sos report
rhel 51client and rhel51 client)
sosreport-sshandilya.436004-24116-45cbe1_rhel46_client.tar.bz2 (sos report
rhel46 client and rhel 51 server)
sosreport-sshandilya.436004-89616-e06d7a_rhel51_server.tar.bz2 (sos report of
rhel 51 server)

as you can see when the rhel 5.1 is the nfs server it does not matter what you
have as the client (rhel 4/ rhel 5) performance is always bad (comparing with
comment #11).

Comment 15 John Feeney 2008-04-17 20:11:55 UTC
Sandeep,

We looked at the data you previously reported (thanks) and now have
more questions so as to help us get to the bottom of this..
1. What type of servers were being used? Were the servers (4.6 and 5.1) 
   exactly the same? hardware? configuration? network?
2. Any chance of providing the data on the raw disk throughput
    mentioned in comment #9 to prove there is no difference there.
3. Could you provide iozone data when run directly on each server?
4. It is presumed that you are using ext3 so how does ext3 performance 
    compare when run on RHEL4 and RHEL5 with identical hardware?
5. If ext3 is not being used, what is then?
6. What are the export options being used on each server?
7. What mount options on each client? And which are actually being used? 
   See /proc/mounts on each system.

Well, that's it for now. As you can surmise, we would like to get some info
on each variable in this puzzle, since there is no use looking in the wrong
place for something.


Comment 16 Sandeep K. Shandilya 2008-04-18 07:19:09 UTC
My replies

1. The servers were both poweredge 1950's
4G RAM, 2 Intel Quad core CPU 2.33 Ghz 6144 KB cache. You can check this on the
dmidecode output that I have attached. The RAID controller is PERC 5/i
(megaraid_sas) I double check the hardware raid level and the disks on the RAID
controller.

2. The iozone output running directly on the servers is here.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
rhel 4.6 nfs server (megaraid_sas, 00.00.03.13)
-----------------------------------------------
        Command line used: iozone -S 6144 -s 8g -i 0 -i 1 -r 64k -f /data/test_file
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 6144 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random   
bkwd  record  stride                                   
              KB  reclen   write rewrite    read    reread    read   write   
read rewrite    read   fwrite frewrite   fread  freread
         8388608      64   34646   32890   109452   113864                     
         

rhel 5.1 nfs server (megaraid_sas version 00.00.03.10)
------------------------------------------------------
Command line used: iozone -S 6144 -s 8g -i 0 -i 1 -r 64k -f /data/test_file
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 6144 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
                         random  random    bkwd  record  stride                
                  
               KB  reclen   write rewrite    read    reread    read   write   
read rewrite    read   fwrite frewrite   fread  freread
         8388608      64  129699   98165   119781   119918                         
3. I have done it in 2.
+++++++++++++++++++++++

4. both the nfs servers are on lvm2.
+++++++++++++++++++++++++++++++++++++++++
rhel 4 nfs server.
------------------
Disk /dev/sda: 145.4 GB, 145492017152 bytes
255 heads, 63 sectors/track, 17688 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          65      522081   83  Linux
/dev/sda2              66       17688   141556747+  8e  Linux LVM
  ACTIVE            '/dev/VolGroup00/root' [58.59 GB] inherit
  ACTIVE            '/dev/VolGroup00/swap' [8.00 GB] inherit


rhel 5 nfs server.
------------------
Disk /dev/sda: 145.4 GB, 145492017152 bytes
255 heads, 63 sectors/track, 17688 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          65      522081   83  Linux
/dev/sda2              66       17688   141556747+  8e  Linux LVM
  ACTIVE            '/dev/VolGroup00/root' [58.59 GB] inherit
  ACTIVE            '/dev/VolGroup00/swap' [8.00 GB] inherit
5. both the servers have ext3 partitions here is the output of /proc/mounts
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

rhel 4 nfs server
-----------------
rootfs / rootfs rw 0 0
/proc /proc proc rw,nodiratime 0 0
none /dev tmpfs rw 0 0
/dev/root / ext3 rw 0 0
none /dev tmpfs rw 0 0
none /selinux selinuxfs rw 0 0
/proc /proc proc rw,nodiratime 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
/sys /sys sysfs rw 0 0
none /dev/pts devpts rw 0 0
/dev/sda1 /boot ext3 rw 0 0
none /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
nfsd /proc/fs/nfsd nfsd rw 0 0

rhel 5.1 nfs server
-------------------
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
devpts /dev/pts devpts rw 0 0
/dev/sda1 /boot ext3 rw,data=ordered 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
/etc/auto.misc /misc autofs
rw,fd=6,pgrp=3215,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs rw,fd=11,pgrp=3215,timeout=300,minproto=5,maxproto=5,indirect 0 0
nfsd /proc/fs/nfsd nfsd rw 0 0

6. Export options
++++++++++++++++++
rhel 4.6
--------
/data           172.16.64.0/24(rw,wdelay,no_root_squash)

rhel 5.1
--------
/data          
172.16.64.0/24(rw,wdelay,no_root_squash,no_subtree_check,anonuid=65534,anongid=65534)

7. mount options
++++++++++++++++
on the rhel 5.1 client connecting to rhel 5.1 server is
-------------------------------------------------------
172.16.64.164:/data /mnt/nfs nfs
rw,vers=3,rsize=32768,wsize=32768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=172.16.64.164
0 0
on the rhel 4.6 client connecting to rhel 4.6 server is
-------------------------------------------------------
172.16.64.203:/data /mnt/nfs nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.203
0 0

Comment 17 Peter Staubach 2008-04-18 12:21:51 UTC
Thank you for the information.

I have some more questions concerning the information though.

The response to question #2 was not answered.  The data posted was for
an iozone run made on top of a file system.  The question was concerning
the raw bandwidth available directly to the storage.

The response to question #3 or #2, depending upon how you view it,
seems incomplete or at best, I don't know how to read it.  There
seems to be many columns and only a few numbers.  That said, does
it appear that the write and rewrite numbers for RHEL-4 are _much_
slower than those for RHEL-5?

I am still unclear on the file system type being used.  The iozone
tests were run on /data, but I don't see /data listed in the
/proc/mounts output for either server.  Is /data just a directory
on the root file system?

The mount options used on the clients listed is good, but I also
need the mount options used when the 4.6 client talks to the 5.1
server.

Comment 18 Peter Staubach 2008-04-18 12:33:32 UTC
So, let's try one more time.

The goal of these questions is to attempt to rule out hardware differences
between the two server systems.  We need to ensure that the storage
subsystems perform with roughly the same performance characteristics.
Hence, we need to know raw hardware bandwidth.

Next, we'd like to rule out any differences due to the local file system
being on each server.  Thus, we need the performance characteristics as
run locally, on the exported file system, on each server.  We need _all_
of the numbers which got generated.

If any of the above information shows signficant differences, then we
need to investigate them prior to investigating the NFS server.

Once the hardware and file systems on each server show roughly the
same performance characteristics, then we need to check to see how the
NFS client is mounting each server.  We need to use 1 client for testing
against server.  The use of 2 clients just introduces more potential
differences which makes resolving this issue impossible.  1 client,
please.

I'd like to see the mount options used for that client which mounting
each server.  Please don't vary the options like using nointr and
changing the retrans values.  Please use the same options, the default
set.

We then need to ensure that each server is exporting the file system
in the same fashion.  You can add "no_subtree_check", but if you do,
then do it on both servers, please.  It would be easiest to just use
no options and let the options default.

After all that, the output from running iozone on that 1 client,
against each server, would be required.  The client needs to be
otherwise quiesced, ie. nothing else running.

After each run of iozone, we need the "nfsstat -s" statistics from
the server being tested.  It would be best if the server was rebooted
immediately before being tested and then not used for any other NFS
traffic.

I am assuming that the networks that these three systems are connected
to is a gigabit ethernet and that it is otherwise quiet, so there
should be no network effects.  If this is not true, please let me know.

The goal is here is to reduce the variables in the information down to
a situation where we can really determine what issues exists and where
it might be.

Comment 19 Sandeep K. Shandilya 2008-04-18 13:31:38 UTC
(In reply to comment #17)
> Thank you for the information.
> 
> I have some more questions concerning the information though.
> 
> The response to question #2 was not answered.  The data posted was for
> an iozone run made on top of a file system.  The question was concerning
> the raw bandwidth available directly to the storage.
I will get this data in an update.
> 
> The response to question #3 or #2, depending upon how you view it,
> seems incomplete or at best, I don't know how to read it.  There
> seems to be many columns and only a few numbers.  That said, does
> it appear that the write and rewrite numbers for RHEL-4 are _much_
> slower than those for RHEL-5?

read the numbers as follows
file size, record size, write, rewrite, read reread.
Yes, you right when you say the write and rewrite numbers for RHEL-4 is much
slower than RHEL-5. Read performance is the same.

> 
> I am still unclear on the file system type being used.  The iozone
> tests were run on /data, but I don't see /data listed in the
> /proc/mounts output for either server.  Is /data just a directory
> on the root file system?
Yes, /data is a directory on the root file system, do you want it to be a
seperate file system could be done.
> 
> The mount options used on the clients listed is good, but I also
> need the mount options used when the 4.6 client talks to the 5.1
> server.
Yes I will update you with this data soon in a couple of hours.



Comment 20 Sandeep K. Shandilya 2008-04-18 18:49:06 UTC
(In reply to comment #17)
> Thank you for the information.
> 
> I have some more questions concerning the information though.
> 
> The response to question #2 was not answered.  The data posted was for
> an iozone run made on top of a file system.  The question was concerning
> the raw bandwidth available directly to the storage.
raw performance output

on the rhel 4.6 nfs server
--------------------------
[root@localhost ~]# time dd if=/dev/zero of=/dev/VolGroup00/test bs=64k count=131072
131072+0 records in
131072+0 records out
    
real    4m43.574s
user    0m0.034s
sys     0m10.683s

[root@localhost ~]# time dd of=/dev/null if=/dev/VolGroup00/test bs=64k count=131072
131072+0 records in
131072+0 records out

real    1m19.554s
user    0m0.038s
sys     0m9.879s

on the rhel 5.1 nfs server
--------------------------
[root@localhost ~]# time dd if=/dev/zero of=/dev/VolGroup00/test bs=64k count=131072

131072+0 records in
131072+0 records out
8589934592 bytes (8.6 GB) copied, 80.4467 seconds, 107 MB/s

real    1m20.497s
user    0m0.059s
sys     0m10.809s
[root@localhost ~]# 
[root@localhost ~]# time dd of=/dev/null if=/dev/VolGroup00/test bs=64k count=131072
131072+0 records in
131072+0 records out
8589934592 bytes (8.6 GB) copied, 79.5618 seconds, 108 MB/s

real    1m19.620s
user    0m0.055s
sys     0m9.212s

> 
> The response to question #3 or #2, depending upon how you view it,
> seems incomplete or at best, I don't know how to read it.  There
> seems to be many columns and only a few numbers.  That said, does
> it appear that the write and rewrite numbers for RHEL-4 are _much_
> slower than those for RHEL-5?
> 
> I am still unclear on the file system type being used.  The iozone
> tests were run on /data, but I don't see /data listed in the
> /proc/mounts output for either server.  Is /data just a directory
> on the root file system?
Yes, /data was on the root file system.
> 
> The mount options used on the clients listed is good, but I also
> need the mount options used when the 4.6 client talks to the 5.1
> server.

172.16.64.164:/test /mnt/nfs nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.164
0 0

Comment 21 Sandeep K. Shandilya 2008-04-18 19:08:05 UTC
Here is the nfs iozone, nfsstat on server and client, and mount options for rhel
5.1 client
rhel 5.1 client data
++++++++++++++++++++
rhel 5.1 nfs server
-------------------
performance
filesize  record size  write rewrite  read  reread
8388608      64   43010   15196    28017    27862
nfsstat on client
Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
0          0          0          0          0       

Client rpc stats:
calls      retrans    authrefrsh
1055608    0          0       

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 5         0% 1         0% 9         0% 10        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524290   49% 524356   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 0         0% 6930      0% 

Client nfs v4:
null         read         write        commit       open         open_conf    
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
open_noat    open_dgrd    close        setattr      fsinfo       renew        
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
setclntid    confirm      lock         lockt        locku        access       
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
getattr      lookup       lookup_root  remove       rename       link         
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
symlink      create       pathconf     statfs       readlink     readdir      
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
server_caps  delegreturn  
0         0% 0         0% 
nfsstat on rhel 5.1 server

Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
1055611    0          0          0          0       

Server nfs v3:
null         getattr      setattr      lookup       access       readlink     
2         0% 5         0% 1         0% 9         0% 10        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524289   49% 524356   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 3         0% 0         0% 6930      0% 

mount options
172.16.64.164:/test /mnt/nfs nfs
rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=172.16.64.164
0 0

rhel 4.6 nfs server data
------------------------
iozone benchmark.
filesize  recordsize write	rewrite	  read	   reread
8388608      64      35491   30219    63876    65253 


nfsstat output on rhel 5.1 client
Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
0          0          0          0          0       

Client rpc stats:
calls      retrans    authrefrsh
1051656    0          0       

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 5         0% 1         0% 9         0% 10        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524290   49% 524344   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 0         0% 2990      0% 

Client nfs v4:
null         read         write        commit       open         open_conf    
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
open_noat    open_dgrd    close        setattr      fsinfo       renew        
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
setclntid    confirm      lock         lockt        locku        access       
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
getattr      lookup       lookup_root  remove       rename       link         
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
symlink      create       pathconf     statfs       readlink     readdir      
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
server_caps  delegreturn  
0         0% 0         0% 

nfsstat output on rhel 4.6 server
Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
1051660    0          0          0          0       

Server nfs v3:
null         getattr      setattr      lookup       access       readlink     
3         0% 5         0% 1         0% 9         0% 10        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524289   49% 524344   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 3         0% 0         0% 2990      0% 


mount options rhel 51 client on rhel 4.6 server.
172.16.64.203:/test /mnt/nfs nfs
rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=172.16.64.203
0 0



Comment 22 Sandeep K. Shandilya 2008-04-18 19:11:04 UTC
Here is the nfs iozone, nfsstat on server and client, and mount options for rhel
4.6 client

nfs 4.6 client data
+++++++++++++++++++

rhel 5.1 Server
--------------
iozone performance
filesize  recordsize write  rewrite   read    reread
8388608      64      57326   24917    34115    34225

nfsstat on rhel 4.6 client
Client rpc stats:
calls      retrans    authrefrsh
1048723    0          0       

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 14        0% 1         0% 3         0% 11        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524292   49% 524316   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 1         0% 0         0% 81        0% 

nfsstat on rhel 5.1 nfs server
Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
1048728    0          0          0          0       

Server nfs v3:
null         getattr      setattr      lookup       access       readlink     
2         0% 14        0% 1         0% 3         0% 11        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524292   49% 524316   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 0         0% 81        0% 

mount options

172.16.64.164:/test /mnt/nfs nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.164
0 0

rhel 4.6 client
---------------
iozone performance
filesize  recordsize write  rewrite   read    reread
8388608      64   24786   25949    92873    94640

nfsstat on rhel 4.6 client
Client rpc stats:
calls      retrans    authrefrsh
1048735    0          0       

Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 14        0% 1         0% 4         0% 10        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524292   49% 524319   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 1         0% 0         0% 90        0% 

nfsstat on rhel 4.6 server
Server rpc stats:
calls      badcalls   badauth    badclnt    xdrcall
1048738    0          0          0          0       

Server nfs v3:
null         getattr      setattr      lookup       access       readlink     
1         0% 14        0% 1         0% 4         0% 10        0% 0         0% 
read         write        create       mkdir        symlink      mknod        
524291   49% 524319   49% 2         0% 0         0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
2         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
fsstat       fsinfo       pathconf     commit       
0         0% 2         0% 0         0% 90        0% 

mount options
172.16.64.203:/test /mnt/nfs nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.203
0 0

Comment 23 Sandeep K. Shandilya 2008-04-18 19:18:06 UTC
(In reply to comment #18)
> So, let's try one more time.
> 
> The goal of these questions is to attempt to rule out hardware differences
> between the two server systems.  We need to ensure that the storage
> subsystems perform with roughly the same performance characteristics.
> Hence, we need to know raw hardware bandwidth.
> 
> Next, we'd like to rule out any differences due to the local file system
> being on each server.  Thus, we need the performance characteristics as
> run locally, on the exported file system, on each server.  We need _all_
> of the numbers which got generated.
> 
> If any of the above information shows signficant differences, then we
> need to investigate them prior to investigating the NFS server.
> 
> Once the hardware and file systems on each server show roughly the
> same performance characteristics, then we need to check to see how the
> NFS client is mounting each server.  We need to use 1 client for testing
> against server.  The use of 2 clients just introduces more potential
> differences which makes resolving this issue impossible.  1 client,
> please.
Check out comment #21 and #22.
> 
> I'd like to see the mount options used for that client which mounting
> each server.  Please don't vary the options like using nointr and
> changing the retrans values.  Please use the same options, the default
> set.
Comment #21 and #22 contain mount options using default set.
> 
> We then need to ensure that each server is exporting the file system
> in the same fashion.  You can add "no_subtree_check", but if you do,
> then do it on both servers, please.  It would be easiest to just use
> no options and let the options default.
> 
> After all that, the output from running iozone on that 1 client,
> against each server, would be required.  The client needs to be
> otherwise quiesced, ie. nothing else running.
> 
> After each run of iozone, we need the "nfsstat -s" statistics from
> the server being tested.  It would be best if the server was rebooted
> immediately before being tested and then not used for any other NFS
> traffic.
> 
> I am assuming that the networks that these three systems are connected
> to is a gigabit ethernet and that it is otherwise quiet, so there
> should be no network effects.  If this is not true, please let me know.
The network is a gigabit ethernet. Its quiet.
> 
> The goal is here is to reduce the variables in the information down to
> a situation where we can really determine what issues exists and where
> it might be.



Comment 24 Peter Staubach 2008-04-22 20:35:02 UTC
Thank you for all of the information.  It seems contradictory in many
respects, so we need to boil things down a bit further.

First, any idea why writing to the partition on RHEL-4 takes between
3 and 4 times as long as it did on RHEL-5?

What exports options are being used on each server?

Are the server systems identical, hardwise?

Comment 25 John Feeney 2008-04-22 22:36:55 UTC
Pardon me if I jump in and ask some questions too but I have to know...

Is the data consistent when the same test is executed in the same environment?
If not, how much variance is measured? Can an average be calculated across,
say, three runs? I guess I am with Peter in that it is hard to find a pattern
so I was wondering how much flucuation was occurring from one identical test 
to the next. Pick a client, a server, run the test three times, calculate
variance and average the read results if the variance isn't too great. 

Excuse me if that data has already been presented but I did not see it.

Comment 26 Sandeep K. Shandilya 2008-04-23 10:24:34 UTC
(In reply to comment #24)
> Thank you for all of the information.  It seems contradictory in many
> respects, so we need to boil things down a bit further.
> 
> First, any idea why writing to the partition on RHEL-4 takes between
> 3 and 4 times as long as it did on RHEL-5?
This could be another issue that we could track using another bugzilla.
Inspite of this, nfs write performance differs by a lesser margin
compared to direct IO.
> 
> What exports options are being used on each server?
/nfsshare 172.16.64.0/24(rw, no_root_squash)
> 
> Are the server systems identical, hardwise?
Yes they are identical in all respects.
please check the attachments in comment #11 and #13.

I have been able to reproduce the issue by just using dd over nfs.

Here is a simpler method to reproduce the issue.
1. take two systems with any configuration.
2. install rhel 4.6 on server create an nfs share.
3. create a big file of size=(2 * system RAM)on the server
3. mount the share on the client (client is rhel 5.1)
4. read the file on the client using the command
   #time dd if=/mnt/nfs/bigfile of=/dev/null 
5. repeat the same experiment after upgrading the server to rhel 5.1
6. and observe the results. In my case rhel 5.1 is way slower

I have tried this experiment about 10 time and there is hardly any variation in
the data.
I have tried various memory configurations, I still see the same result.
read performance is always less.

Could I do something with oprofile and get you more data? Will it help?

Comment 29 John Feeney 2008-04-23 22:52:54 UTC
Sandeep,
As a result of the last weekly call, I discussed this situation with several
parties involved and reviewed the data you have provided in the past. In
addition to our increased testing to be done on this bugzilla, one thing that
was determined was the possible involvement of virtual memory in this puzzle. 

Thus, I reviewed your sosreport from comments #2 and #3 to compare vm entries in
/proc but found them to be similar but there were a couple of differnces. We
have contacted the vm people to take a look at this.

As your data has shown, RHEL5.1 write throughput done locally appears to have
improved over RHEL4.6 (I refer to comments #16 and #20) but we wonder if the
writes are actually being written to disk or they are being cached and the app
finishes before they are completely written, hence the perceived "improvement"
in write speed. We believe there might be an option to iozone that requires the
app wait for the sync to take place. Perhaps that might prove our theory to be
true. This "ending before flushing is complete" may or may not be true, but if
it is, it would be an example of a subtle difference between RHEL4 and RHEL5
that makes this degradation tough to pin down. 

On the other hand, the rewrite data shows a degradation, too. We are speculating
if this degradation could be due to the rewrite having to wait for the flushing
of data before being able to access it.  

In addition, you have convinced me that the server hardware and configurations
are the same and as I stated during the call, I like the fact that your new
procedure calls for an upgrade of the server rather than introduce a
replacement. Thus, removing another variable (the server hardware).

Also, we think you are on the right track by using dd instead of iozone as
proposed in comment #26 since iozone has to create a file before writing it. Our
QE people liked the fact that now the file will already created and this should
reduce the caching variable that file creation introduces.

So with that all stated, we would like you to use an existing (8Gb) file and
then dump the cache before using this file. ("sync" and then "echo 3 >
/proc/sys/vm/drop_caches") Once this is done, we would like you to run dd tests
locally, no nfs, and monitor the dirty cache by the "echo m
>/proc/sysrq-trigger" command. I grep for "dirty" in /var/log/messages to see
how the value associated with "dirty" changes as the test proceeds, that is,
echo m and then grep /var/log/messages every couple of seconds. We think that it
is possible that your 8Gb file is causing an inordinate amount of dirty pages to
be flushed out, thus causing a performance bottleneck. This possibly could cause
further trouble in the i/o subsystem but let's figure out if there is a caching
issue first. It was suggested that the value of dirty be monitored as the test
proceeds so one can get a feel for how it changes over the course of time.  

Once the dirty values have been attained for local execution, the same could be
done in a NFS setting. It was also suggested that the tests be done with direct
I/O to take the cache variable out of the picture. 

It was suggested that the file size be reduced to say 1Gb which would reduce the
length of your test but it may also reduce the pressure on the cache and dirty
page handling, if these are the culprits. If a smaller file works remarkedly
better than a large one, it will give us all a better feel for the ramifications
of this bugzilla.

In addition, perhaps you could provide more detail on the striping being used,
e.g. hardware and software raid details. We know its a PERC 5/i, megaraid_sas,
and 4 stripes. I can see from the /var/log/messages file in the sosreport that
Fujitsu disks are being used but any other details would be helpful.



Comment 31 Larry Woodman 2008-04-24 17:23:13 UTC
This has probably been asked/investigated but it this an NFS-only problem?  In
otherwords do we see a significant performance drop running the same iozone test
on a 5.2 system without any NFS involvement at all?  Also, do we see a similar
problem if the server has 8GB or even 12GB of RAM?  Since we are over-commiting
the server's RAM perhaps its a caching issue??

Larry Woodman


Comment 32 Sandeep K. Shandilya 2008-04-24 18:25:24 UTC
(In reply to comment #31)
> This has probably been asked/investigated but it this an NFS-only problem?  In
> otherwords do we see a significant performance drop running the same iozone test
This is an nfs only problem. I dont see the issue if I scp the same file (size
8gb)over the same network or run the stress directly on the server.

> on a 5.2 system without any NFS involvement at all?  Also, do we see a similar
> problem if the server has 8GB or even 12GB of RAM?  Since we are over-commiting
> the server's RAM perhaps its a caching issue??
>
Instead of increasing the system RAM, I did something slightly different I
reduced the file size from 8gb to 2gb on both rhel46 and rhel51, the problem
does NOT reproduce. What you seem to thinking is right, might be a caching issue
with respect to nfs?
> Larry Woodman
> 



Comment 33 Sandeep K. Shandilya 2008-04-24 18:30:14 UTC
(In reply to comment #29)
> Sandeep,
> As a result of the last weekly call, I discussed this situation with several
> parties involved and reviewed the data you have provided in the past. In
> addition to our increased testing to be done on this bugzilla, one thing that
> was determined was the possible involvement of virtual memory in this puzzle. 
> 
> Thus, I reviewed your sosreport from comments #2 and #3 to compare vm entries in
> /proc but found them to be similar but there were a couple of differnces. We
> have contacted the vm people to take a look at this.
> 
> As your data has shown, RHEL5.1 write throughput done locally appears to have
> improved over RHEL4.6 (I refer to comments #16 and #20) but we wonder if the
> writes are actually being written to disk or they are being cached and the app
> finishes before they are completely written, hence the perceived "improvement"
> in write speed. We believe there might be an option to iozone that requires the
> app wait for the sync to take place. Perhaps that might prove our theory to be
> true. This "ending before flushing is complete" may or may not be true, but if
> it is, it would be an example of a subtle difference between RHEL4 and RHEL5
> that makes this degradation tough to pin down. 
> 
> On the other hand, the rewrite data shows a degradation, too. We are speculating
> if this degradation could be due to the rewrite having to wait for the flushing
> of data before being able to access it.  
> 
> In addition, you have convinced me that the server hardware and configurations
> are the same and as I stated during the call, I like the fact that your new
> procedure calls for an upgrade of the server rather than introduce a
> replacement. Thus, removing another variable (the server hardware).
> 
> Also, we think you are on the right track by using dd instead of iozone as
> proposed in comment #26 since iozone has to create a file before writing it. Our
> QE people liked the fact that now the file will already created and this should
> reduce the caching variable that file creation introduces.
> 
> So with that all stated, we would like you to use an existing (8Gb) file and
> then dump the cache before using this file. ("sync" and then "echo 3 >
> /proc/sys/vm/drop_caches") Once this is done, we would like you to run dd tests
> locally, no nfs, and monitor the dirty cache by the "echo m
> >/proc/sysrq-trigger" command. I grep for "dirty" in /var/log/messages to see
> how the value associated with "dirty" changes as the test proceeds, that is,
> echo m and then grep /var/log/messages every couple of seconds. We think that it
> is possible that your 8Gb file is causing an inordinate amount of dirty pages to
> be flushed out, thus causing a performance bottleneck. This possibly could cause
> further trouble in the i/o subsystem but let's figure out if there is a caching
> issue first. It was suggested that the value of dirty be monitored as the test
> proceeds so one can get a feel for how it changes over the course of time.  
> 
> Once the dirty values have been attained for local execution, the same could be
> done in a NFS setting. It was also suggested that the tests be done with direct
> I/O to take the cache variable out of the picture. 
> 
> It was suggested that the file size be reduced to say 1Gb which would reduce the
> length of your test but it may also reduce the pressure on the cache and dirty
> page handling, if these are the culprits. If a smaller file works remarkedly
> better than a large one, it will give us all a better feel for the ramifications
> of this bugzilla.
The performance is SAME in both rhel 46 and rhel 51 if the filesize is set to
1GB!. I have replied to larry woodman in comment #32. If it is still required
that I analyze the dirty page behaviour I could do this?
> 
> In addition, perhaps you could provide more detail on the striping being used,
> e.g. hardware and software raid details. We know its a PERC 5/i, megaraid_sas,
> and 4 stripes. I can see from the /var/log/messages file in the sosreport that
> Fujitsu disks are being used but any other details would be helpful.
> 
> 



Comment 34 jacob liberman 2008-04-25 19:40:21 UTC
Created attachment 303813 [details]
NFS/IOzone best practices

Comment 35 jacob liberman 2008-04-25 20:01:00 UTC
Hi Sandeep,

We (HPC engineering) saw similar slow read performance with RHEL4 NFS client on
iozone.  I attached a doc that describes correct way to measure NFS performance
with iozone.  Some highlights from our experience:

1) Make sure underlying hardware is like.  Same number of disks, speed, RAID
configuration, RAID cache policy. (Write thru versus write back, battery present.)

Until you do this there is no point in measuring standard deviation, etc.

2) Use -c and -U switches for iozone.  -c forces commit before write complete to
eliminate server side caching and -U unmounts filesystem between tests to force
sync.

3) Other iozone options can force synchronous file operations.  This will
flatten out performance differences (which I think is what you want) but not a
good idea for benchmarking since it carries undue penalty.

4) You are using a file size larger than combined RAM to circumvent file system
caching.  This is a good idea.  In our experience, RAM must be physically
removed from the boxes. Capping RAM with "mem=" does not work for iozone.

5) We saw wide performance variation across record size. 512k gave best and most
consistent performance in our config.

So after you verify hardware, read the NFS/iozone best practices and then rerun
the tests with additional switches: first -Uc and then the synchronous options. 

Please let me know if any of this is helpful.  I am happy to provide more
details if necessary.



Comment 36 Barry Marson 2008-04-27 22:26:59 UTC
The local test bed server here at Red Hat/Westford has been set up, and I ran
preliminary tests to establish local baseline numbers.  RHEL4.6 and RHEL5.1
builds on the system were tested, and they mount a common LVM volume.  The
system has 4GB of RAM.  The IOzone command was:

iozone -S 1024 -s 8g -i 0 -i 1 -r 64k

    RHEL     RUN Filesize Recsize   write rewrite     read   reread
    ----------------KB-------KB------KB------KB--------KB------KB--
    RHEL4.6 run1  8388608      64   57052   51086    57699    57881
    RHEL4.6 run2  8388608      64   57672   43608    57347    57821
    RHEL4.6 run3  8388608      64   58045   42814    57855    58029

    RHEL5.1 run1  8388608      64   59734   50124    57506    57804
    RHEL5.1 run2  8388608      64   61441   48304    57003    57840
    RHEL5.1 run3  8388608      64   60100   47918    57843    54687

As you can see the only blip was the rewrites in RHEL4.6.

Barry

Comment 37 Sandeep K. Shandilya 2008-04-28 17:03:26 UTC
(In reply to comment #36)
> system has 4GB of RAM.  The IOzone command was:
> 
> iozone -S 1024 -s 8g -i 0 -i 1 -r 64k

What happens if you just check read performance?
time dd if=/mnt/nfs/bigfile of=/dev/null. In our case both the rhel46 and rhel
51 server give performance close to 10ths of second.

Also can we check direct i/o on the two same servers. I guess that the systems
are AMD? I see that you have set the S param to 1024.
What is the client that you are using.
I am using RHEL 51 systems kernel version 2.6.18-53.1.14.el5 and rhel 46
2.6.9-67.0.7.ELsmp, the latest on RHN.


Comment 38 Larry Woodman 2008-04-30 18:00:44 UTC
Sandeep(in reply to comment #32), can you get us a "vmstat 1" output while
running the test with both the 8GB and 2GB file sizes on both RHEL5-U2 and
whatever base you were using that does not show this problem???

Larry Woodman

Comment 39 Sandeep K. Shandilya 2008-04-30 22:33:32 UTC
Hello all

More updates on my testing.
Here is some data taking into account
The system RAM is 2g enabling me to do tests faster. I test only reads
using time dd if=<file on share> of=/dev/null
the client is rhel 5.1
	                     File size in gigs				
	                  1       2       4      8      16
rhel51_direct, MB/sec	91.84	82.45	96.56	102.81	98.20
rhel46_direct, MB/sec	91.18	81.40	94.12	102.94	98.27
rhel51srv, MB/sec	34.80	34.10	35.60	35.40	34.90
rhel46srv, MB/sec	86.50	79.40	89.30	97.50	93.50

I observed one interesting thing.
If I dont remove server side caching i.e file size =< system RAM/2 or
running the same test with the same file over a remount of the share I get
the same results in rhel 5.1 and rhel 4.6.
here is the sequence of events on the rhel 5.1 client mounting a rhel 5.1 server.
[root@localhost ~]# mount 172.16.64.164:/test /mnt/nfs
[root@localhost ~]# dd if=/mnt/nfs/disc_1g_01.dat of=/dev/null
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 30.3699 seconds, 35.4 MB/s
You have mail in /var/spool/mail/root
[root@localhost ~]# umount /mnt/nfs
[root@localhost ~]# mount 172.16.64.164:/test /mnt/nfs
[root@localhost ~]# dd if=/mnt/nfs/disc_1g_01.dat of=/dev/null
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 9.16824 seconds, 117 MB/s
[root@localhost ~]# umount /mnt/nfs
[root@localhost ~]# mount 172.16.64.164:/test /mnt/nfs
[root@localhost ~]# dd if=/mnt/nfs/disc_4g_01.dat of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 123.168 seconds, 34.9 MB/s

I hope this helps to get closer to root cause.

Comment 40 Sandeep K. Shandilya 2008-04-30 22:44:14 UTC
Created attachment 304277 [details]
vmstat 1 output for 1, 2, 4g files on rhel 4.6 and rhel 5.1

Comment 41 Sandeep K. Shandilya 2008-04-30 22:45:49 UTC
(In reply to comment #38)
> Sandeep(in reply to comment #32), can you get us a "vmstat 1" output while
> running the test with both the 8GB and 2GB file sizes on both RHEL5-U2 and
> whatever base you were using that does not show this problem???
> 
> Larry Woodman
comment #40 contains what you need. its with 5.1 server, 4.6 server and 5.1 client.



Comment 42 Sandeep K. Shandilya 2008-04-30 23:13:04 UTC
> comment #40 contains what you need. its with 5.1 server, 4.6 server and 5.1
client.

I get the same performance with rhel 5.2 snapshot6.



Comment 43 Barry Marson 2008-05-01 12:35:10 UTC
We have been able to replicate the issue here.  Below are the results on our 
test bed.

 IOzone command using:  4GB file record size 64KB
 Systems booted with 2GB RAM (client and server)

 IOZONE FLAGS     KERNEL_OF_SERVER    WRITE  REWRITE     READ   REREAD
------------------------------------------------------------------------
default           2.6.18-53.1.14.el5  40079    18866    25212    25746
default           2.6.9-67.0.7.ELsmp  32572    38126    57213    57367

close             2.6.18-53.1.14.el5  35053    18530    26101    25208
close             2.6.9-67.0.7.ELsmp  32354    31711    56344    57652

osync+close       2.6.18-53.1.14.el5  23579     5163    27054    26249
osync+close       2.6.9-67.0.7.ELsmp  20056    31009    57906    56855

dio+close         2.6.18-53.1.14.el5   5048     2933    32304    32129
dio+close         2.6.9-67.0.7.ELsmp   5080     5560    42672    42338

dio+osync+close   2.6.18-53.1.14.el5   5049     2944    30670    32122
dio+osync+close   2.6.9-67.0.7.ELsmp   5050     5556    43756    43533
------------------------------------------------------------------------

We have some more investigating to do.  I have vmstats of both client and 
server during the runs.

Unfortunately we are literally in the middle of packing up for our move and 
probably cannot look further into this until early next week.

Barry

Comment 44 Marizol Martinez 2008-05-21 16:49:14 UTC
*** To our customers  ***

Both, Red Hat and Dell, are actively working to understand and address this
issue in a timely fashion. We appreciate your patience and request that you
refer to this bugzilla for the latest information on our progress.

Thank you.

Comment 46 Eric Sandeen 2008-05-21 17:49:57 UTC
Barry asked me to look at this one...

He noted that there was a lot of read activity going on during the rewrite phase
of an iozone test.

I used blktrace to keep track of what was being read, and tested iozone on a 2G
filesize, 64k IO size, write and rewrite, from a client to a rhel5 server with
only 500M of memory.

On stock rhel5, blktrace showed that during the rewrite phase (which rewrites 2G
of data), 2G of *reads* were being issued, and that each read IO was exactly 4k.
 This looked to me like the ll_rw_block in __block_prepare_write, which gets
called when we are writing a partial block.

Further, looking at the iovecs set up by nfsd, the first was of non-block-size:

0: iov_len 3940
1: iov_len 4096
...
 
which means that all later iovecs were not aligned.  But, this is the same
behavior as on rhel4, and upstream, which were tested and found not to have the
problem.

So I chased down the iovec submission path to generic_file_buffered_write, and
how the start & and for ->prepare_write() were set up, because it was this
partial-block write that was causing the reads (in the read-modify-write).

This stuck out at me in the diff:

+               /*
+                * Limit the size of the copy to that of the current segment,
+                * because fault_in_pages_readable() doesn't know how to walk
+                * segments.
+                */
+               bytes = min(bytes, cur_iov->iov_len - iov_base);


This upstream mod:

[PATCH] knfsd: stop NFSD writes from being broken into lots of little writes to
files ...
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=29dbb3fc8020f025bc38b262ec494e19fd3eac02

fixes the iozone test for me, and the comments say:

When NFSD receives a write request, the data is typically in a number of
1448 byte segments and writev is used to collect them together.

Unfortunately, generic_file_buffered_write passes these to the filesystem
one at a time, so an e.g.  32K over-write becomes a series of partial-page
writes to each page, causing the filesystem to have to pre-read those pages
- wasted effort.

generic_file_buffered_write handles one segment of the vector at a time as
it has to pre-fault in each segment to avoid deadlocks.  When writing from
kernel-space (and nfsd does) this is not an issue, so
generic_file_buffered_write does not need to break and iovec from nfsd into
little pieces.

This patch avoids the splitting when  get_fs is KERNEL_DS as it is
from NFSd.

the regression was introduced by an upstream change, 

[PATCH] generic_file_buffered_write(): deadlock on vectored write
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6527c2bdf1f833cc18e8f42bd97973d583e4aa83

which is not present in RHEL4.

Testing with this patch in place, I go from around 10MB/s on the iozone rewrite
test to around 40MB/s, on par with the original write.

Another note; for smaller file sizes in iozone testing this isn't obvious,
because all the pages stay in-cache from the initial write, and don't have to be
re-read from disk.

-Eric

Comment 47 Barry Marson 2008-05-21 18:46:03 UTC
The patch definitely solves the rewrite issue.  Rewrite performance improved
80-100% depending on which RHEL5 kernel we were testing.

This issue was actually noticed in looking at the read/reread performance. 
There still seems to be an issue with read performance and we are trying a patch
which keeps metadata cache in memory longer.

stay tuned.

Barry

Comment 49 RHEL Program Management 2008-05-21 20:20:36 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 50 Barry Marson 2008-05-21 20:36:02 UTC
Present status is that read/reread performance is still an issue.  Reads are off
by 40% still.  So while writes are performing we do have a serious read issue. 
I have a slew of tests to run to try and isolate this.

Barry

Comment 51 Eric Sandeen 2008-05-21 21:02:10 UTC
On the read side...

On my test setup, with 8 nfsd threads I am seeing about 20MB/s for read and
reread, and a fairly high seek rate, to the tune of around 200 seeks/s.

If I restrict to only 1 nfsd thread, I get 55MB/s and the seek rate is
substantially lower.

Additionally, if we look at the block IO stats for 1 thread:


Total (iozone_xfs_read_1thread_full):
 Reads Queued:      31,254,    4,000MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:   31,115,    4,000MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:   31,115,    4,000MiB	 Writes Completed:        0,        0KiB
 Read Merges:          139,   17,792KiB	 Write Merges:            0,        0KiB
 IO unplugs:        25,777        	 Timer unplugs:           0

Throughput (R/W): 74,414KiB/s / 0KiB/s

vs. 8 threads:

Total (iozone_xfs_read_full):
 Reads Queued:     121,516,    4,000MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:   65,893,    4,000MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:   65,893,    4,000MiB	 Writes Completed:        0,        0KiB
 Read Merges:       55,503,    1,768MiB	 Write Merges:            0,        0KiB
 IO unplugs:       125,270        	 Timer unplugs:           0

Throughput (R/W): 32,108KiB/s / 0KiB/s

we can see that this results in a very different IO pattern... with 1 thread
doing larger IOs.

Comment 52 Eric Sandeen 2008-05-21 21:17:31 UTC
I'm going to hazard a guess that on the read side, sharing read requests across
the nfsds is defeating readahead.

With 8 threads, for a block range I see requests issued like:

  8,21   1        6     0.000019677  4059  D   R 630773105 + 64 [nfsd]
  8,21   1       16     0.019685010  4059  D   R 630773297 + 256 [nfsd]
  8,21   1       28     0.049531279  4059  D   R 630773553 + 32 [nfsd]
  8,21   1       34     0.049584655  4060  D   R 630773585 + 64 [nfsd]
  8,21   1       40     0.049614579  4061  D   R 630773649 + 64 [nfsd]
  8,21   1       46     0.049652358  4058  D   R 630773713 + 64 [nfsd]
.... 
and more.

With 1 thread:

  8,21   1        6     0.635870590  4309  D   R 630773105 + 64 [nfsd]
  8,21   1       14     0.662722860  4309  D   R 630773169 + 384 [nfsd]
  8,21   1       23     0.684639922  4309  D   R 630773553 + 512 [nfsd]

this looks like a growing readahead window.

-Eric

Comment 53 Eric Sandeen 2008-05-21 21:40:51 UTC
hmm that might have been slightly anomalous but I do still see the single-thread
case consistently issuing larger IOs, to the tune of 256 sectors vs. 64, usually.

-Eric

Comment 54 Sandeep K. Shandilya 2008-05-22 10:39:44 UTC
(In reply to comment #52)
> I'm going to hazard a guess that on the read side, sharing read requests across
> the nfsds is defeating readahead.

Yes, I have confirmed this, with one thread, rhel5 server performance is equal
to rhel 4 performance.

sandeep

Comment 57 Barry Marson 2008-05-22 18:33:18 UTC
Here's the matrix of nfsd thread count I have come up with.  These were run with
 my simzone tool (100 lines vs 3000 lines of iozone)

nfsd    +--------- RHEL4 -67 ---------+-------- RHEL5  -88 ---------
threads | iwrite rewrite read  reread | iwrite rewrite read   reread
--------+-----------------------------+-----------------------------
1         47427  40593   75514 75764    40963  40465   75229  75697
2         41241  38589   75411 75526    41434  41385   13113  13218
4         42007  38706   70648 69157    46322  38657   16706  16715
8         36787  39707   56489 56650    43524  42155   31778  32141
16                                      44875  39903   45675  45682
32                                      42315  39434   45942  46185

As you can see, with a single nfsd thread RHEL5 read performance matches RHEL4.
 The biggest disparity is the read performance especially at 2 threads. 
Cranking up the thread count in RHEL5 does improve read performance, but it
never reaches RHEL4's performance.

I'm testing the 2 nfsd threads with a 1 cpu booted server.  What I see is
similar low read performance (15KB) as with the 8 cpu's (13KB) booted server. 

Barry 

Comment 58 John Feeney 2008-05-22 19:22:27 UTC
I applied the patch found in comment #46, built rpms in brew, and they can 
be found on my people page. I told how to access them in a separate email.
They said that they would test it.

Comment 62 Eric Sandeen 2008-05-23 17:01:46 UTC
Changing summary to narrow down to the rewrite portion so we can keep that
moving along; will file a new bug shortly to cover the read perf regresssion.

Comment 63 Eric Sandeen 2008-05-23 17:07:47 UTC
Read perf is bug #448130

Thanks,
-Eric

Comment 65 Peter Grandi 2008-05-26 15:50:44 UTC
I have largely the same problem. My RHEL5 server has a 6-drive MD RAID10 device with a filesystem 
on top. Locally I get 150MB/s writes using 'dd bs=32k', over NFS about 5-10MB/s.

On exactly the same server I get 85MB/s writes if I write to a non-MD filesystem. So it looks like 
that it is an interaction between NFS and MD likely as described in #46 above.

Comment 66 Eric Sandeen 2008-05-27 02:37:43 UTC
md is likely getting unaligned IO as well.

John, can you point Peter at your test RPMs too?

-Eric

Comment 67 Ben England 2008-05-27 13:37:59 UTC
I have been using a workaround described below, and have observed no regression
in RHEL5.1 single-threaded NFS reads when using this workaround.  This seems
consistent with the preceding results in this bug report -- i.e. 1 nfsd thread
is much faster.

The workaround is to add this line to /etc/rc.local boot script and then to run
that script:

# for n in /sys/block/sd*/queue/iosched/slice_idle ; do echo 1 > $n ; done

This parameter did not exist in the RHEL4 CFQ I/O scheduler.  A similar effect
can be achieved with use of the deadline or noop scheduler, but for writes we
have seen better results with CFQ.

The purpose of this workaround is to minimize overhead imposed by CFQ when
multiple threads are reading from the same file.  NFS uses a thread pool to
service RPCs, so that a sequential single-thread read at the application layer
becomes a multi-thread read at the NFS server.  CFQ treats threads as if they
were application processes, but in fact they are not here so the default delay
of 8 ms between switching to a different thread’s requests, represented by the
slice_idle block device tuning parameter, is unreasonable.    Others have seen
this problem, including the author of CFQ.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg05066.html

More research needs to be done on the effect of setting this parameter to zero,
but until we do a systematic test of all known workloads with this value I would
not recommend it as a general solution.  

Reproducer: A 43% improvement, from 24.7 to 35.4 MB/s, was observed using this
simple test done with 2 hosts running RHEL5.1 connected by a a 1-Gb Ethernet
link.  The NFS server exported a partition on the system disk, /dev/sda3,
mounted as an ext3 file system.  No NFS or ext3 tuning was used.  The workload was:

# dd of=/dev/null bs=64k count=16k if=/mnt/nfsext3/f



Comment 68 Eric Sandeen 2008-05-27 15:50:04 UTC
Thanks for the pointer to that thread, that's interesting beyond just nfs
performance...

Comment 70 John Feeney 2008-05-27 19:33:48 UTC
Per comment #66, please find rpms with the write side patch in 
mm/filemap.c forcommit 29dbb3fc8020f025bc38b262ec494e19fd3eac02
in http://people.redhat.com/jfeeney/.bz436004



Comment 73 Don Zickus 2008-07-09 21:11:38 UTC
in kernel-2.6.18-95.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 74 Sandeep K. Shandilya 2008-07-11 17:41:49 UTC
(In reply to comment #73)
> in kernel-2.6.18-95.el5
> You can download this test kernel from http://people.redhat.com/dzickus/el5
This kernel has been tested
rewrite performance is good.



Comment 75 Ben England 2008-07-14 16:06:01 UTC
The 2.6.18.96-el5 test kernel does not address comment 67.  I retested with this
kernel and found once again the same problem.  Should it have fixed this?  Is
there any kernel that addresses this problem?

Comment 76 Eric Sandeen 2008-07-14 17:09:14 UTC
Ben, we split this bug in two, a new one for read (bug 448130) and one for
rewrite (this bug).

the test kernel mentioned in this bug specifically addresses the rewrite issue
which was causing read-modify-write and thereby causing massive write slowdowns.
 There is not yet a test kernel which addresses the cfq issue you mentioned in
comment #67.

Thanks,
-Eric

Comment 78 jacob liberman 2008-09-26 05:08:24 UTC
John (#70) -- Do you have compiled GFS packages for this test kernel?  (kmod-gfs2 + any other GFS pkgs with kernel dependencies.)

We have some folks trying to export a GFS share over NFS.

Thanks again,  

Jacob Liberman
Dell HPC Engineering

Comment 79 John Feeney 2008-09-29 17:50:26 UTC
No, Jacob. I just build kernels. Sorry.

Comment 80 Subhendu Ghosh 2008-09-30 00:58:26 UTC
Jacob - the GFS kmods should be using the driver update model since RHEL 5.1. The existing kmods should be useable on the test kernels.

-regards
Subhendu

Comment 81 Chris Ward 2008-10-21 13:01:43 UTC
Attention Partners! 

RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop.

Thanks, more information about Beta testing to come.

 - Red Hat QE Partner Management

Comment 84 errata-xmlrpc 2009-01-20 19:36:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.