Description of problem: 50-75% percent drop in nfs-server performance when running the iozone benchmark. The client is rhel51 nfs v3 server is rhel51 nfs v3 running the latest updates. The same performance drop is observed when the client is on rhel 4 update 6 attached is the output of the iozone and nfsstat on rhel4.6 server and rhel 5.1 server. The configuration of the rhel51 system is the same as the rhel46 system in all respects. the command line of the iozone is iozone -S 1024 -s 8g -i 0 -i 1 -r 64k -f /mnt/nfs/test_file The server has 4 gigs of RAM and the client has 4 gigs of RAM. the client is an AMD box (poweredge 1435 and the server is a poweredge 1950) Version-Release number of selected component (if applicable): nfs-utils-1.0.9-24.el5 How reproducible: always Steps to Reproduce: 1. Configure nfs server client and server (rw share) with default nfs settings. 2. mount the share on the client 3. download iozone from http://www.iozone.org/src/current/iozone-3-291.src.rpm Actual results: There is a performance drop to the nature of 50-75 % Expected results: There should be minor difference in performance b/w rhel 5 and rhel 4 Additional info: from the outputs of nfsstat we see that the no of commits is 2 times more than the rhel 4.6 server.
Created attachment 296800 [details] info for the rhel 51 client mounting the rhel46 nfs server
Created attachment 296812 [details] rhel 51 client mounting a rhel 51 server. compare this attachment with the previous one (id=296800) and observer that the no of commits are more in this case of the rhel 51 server.
This appears to be a duplicate of bz321111. Please try those changes and then let me know if the situation is not addressed.
Yet to try out comment #3, Will try this out and get back. But further investigation has revealed that rhel 5.x is about 50 % slower in raw block I/O performance as compared to rhel 4.x, using the command below. time dd if=/dev/zero of=/dev/sda1 bs=1k count=$((1024 * 1024 *8) Running the same benchmark directly on a filesystem also yeilds the same results. RHEL 5.X is slow compared to RHEL 4.
I tried the test with the new kernel. there is no change in performance. rhel 5 performance is still 50 % slower than rhel 4. I have already attached the output of nfsstat in attachments in comment #1 and #2. It looks like this is a different problem altogether. Comment no #4 seems to be worth investigating.
I don't think that I understand this problem yet. There was a comment discussing the difference in the number of COMMIT operations. The client generates COMMIT operations and it does so when it decides and this is server independent. Comment #4 talks about raw block i/o. This has nothing to do with NFS. So, is this perceived to be an NFS problem or something else?
This is perceived to be an nfs problem. The fix in bz321111 does not correct the problem. A 2.6.24.2 kernel on the NFS server does NOT reproduce this issue.
There have been a great number of changes to NFS made since RHEL-5 was cut. Many will be impossible to backport due to kABI concerns or will be simply too risky to do. Despite this being considered to be an NFS issue, I am still concerned about the talk about the changes in raw disk performance. If that performance is down, then there is very little that we, in NFS, can do to compensate.
I want to rule out the talk about the changes in raw disk performance. It was an error on my part. The rhel 5 was on a hardware RAID with 2 stripes where as the rhel 4.6 was a on a hardware RAID with 4 stripes. I made changes to the RAID configuration (both rhel 5 and rhel 4.6 have 4 stripes now) and observed that it was indeed an nfs server issue.
Just to be clear, Comment #4 and Comment #5 are considered to be red herrings and that the raw disk i/o performance, for both RHEL-4 and RHEL-5, on similar hardware and configuration, are relatively equivalent?
Created attachment 301853 [details] testing with rhel 4.6 server and rhel 51 and rhel 46 clients Here is the data with the rhel 4.6 server
My test setup rhel 46 server rhel 46 server and rhel 51 client The tarball attached contains the following files rhel45_server.txt (dmi and nfsstat on the nfs server) rhel46_client_rhel46server.txt (dmi, iozone test result, nfsstat on rhel 46 client connecting to rhel 46 server) rhel51_client_rhel465serv.txt (dmi, iozone test result, nfsstat on rhel 51 client connecting to rhel 46 server) sosreport-sshandilya.436004-19214-a71c83_rhel46_server.tar.bz2 sosreport-sshandilya.436004-578754-741be7_rhel46_client.tar.bz2 sosreport-sshsandilya.436004-60005-bf549d_rhel51_client.tar.bz2 As you can see there is drop in nfs read performance in the case of the rhel 5.1 client connecting to the rhel 4.6 server. I have more test results that will follow where I test rhel 5.1 as the nfs server and rhel 4.6 , rhel 5.1 as clients.
Created attachment 301892 [details] rhel 5.1 server with rhel46 and rhel51 client performance Here is the test setup for rhel 5.1 as nfs server rhel 5.1 nfs server rhel 5.1 client and rhel 4.6 clients rhel46_client_rhel51serv.txt (dmi, nfsstat and iozone test output) rhel51_client_rhel51serv.txt (dmi, nfsstat and iozone test output) rhel51_server.txt (dmi, nfsstat output) sosreport-sshandilya.436004-114080-efa7c4_rhel51_client.tar.bz2 (sos report rhel 51client and rhel51 client) sosreport-sshandilya.436004-24116-45cbe1_rhel46_client.tar.bz2 (sos report rhel46 client and rhel 51 server) sosreport-sshandilya.436004-89616-e06d7a_rhel51_server.tar.bz2 (sos report of rhel 51 server) as you can see when the rhel 5.1 is the nfs server it does not matter what you have as the client (rhel 4/ rhel 5) performance is always bad (comparing with comment #11).
Sandeep, We looked at the data you previously reported (thanks) and now have more questions so as to help us get to the bottom of this.. 1. What type of servers were being used? Were the servers (4.6 and 5.1) exactly the same? hardware? configuration? network? 2. Any chance of providing the data on the raw disk throughput mentioned in comment #9 to prove there is no difference there. 3. Could you provide iozone data when run directly on each server? 4. It is presumed that you are using ext3 so how does ext3 performance compare when run on RHEL4 and RHEL5 with identical hardware? 5. If ext3 is not being used, what is then? 6. What are the export options being used on each server? 7. What mount options on each client? And which are actually being used? See /proc/mounts on each system. Well, that's it for now. As you can surmise, we would like to get some info on each variable in this puzzle, since there is no use looking in the wrong place for something.
My replies 1. The servers were both poweredge 1950's 4G RAM, 2 Intel Quad core CPU 2.33 Ghz 6144 KB cache. You can check this on the dmidecode output that I have attached. The RAID controller is PERC 5/i (megaraid_sas) I double check the hardware raid level and the disks on the RAID controller. 2. The iozone output running directly on the servers is here. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ rhel 4.6 nfs server (megaraid_sas, 00.00.03.13) ----------------------------------------------- Command line used: iozone -S 6144 -s 8g -i 0 -i 1 -r 64k -f /data/test_file Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 6144 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 8388608 64 34646 32890 109452 113864 rhel 5.1 nfs server (megaraid_sas version 00.00.03.10) ------------------------------------------------------ Command line used: iozone -S 6144 -s 8g -i 0 -i 1 -r 64k -f /data/test_file Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 6144 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 8388608 64 129699 98165 119781 119918 3. I have done it in 2. +++++++++++++++++++++++ 4. both the nfs servers are on lvm2. +++++++++++++++++++++++++++++++++++++++++ rhel 4 nfs server. ------------------ Disk /dev/sda: 145.4 GB, 145492017152 bytes 255 heads, 63 sectors/track, 17688 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 65 522081 83 Linux /dev/sda2 66 17688 141556747+ 8e Linux LVM ACTIVE '/dev/VolGroup00/root' [58.59 GB] inherit ACTIVE '/dev/VolGroup00/swap' [8.00 GB] inherit rhel 5 nfs server. ------------------ Disk /dev/sda: 145.4 GB, 145492017152 bytes 255 heads, 63 sectors/track, 17688 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 65 522081 83 Linux /dev/sda2 66 17688 141556747+ 8e Linux LVM ACTIVE '/dev/VolGroup00/root' [58.59 GB] inherit ACTIVE '/dev/VolGroup00/swap' [8.00 GB] inherit 5. both the servers have ext3 partitions here is the output of /proc/mounts ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ rhel 4 nfs server ----------------- rootfs / rootfs rw 0 0 /proc /proc proc rw,nodiratime 0 0 none /dev tmpfs rw 0 0 /dev/root / ext3 rw 0 0 none /dev tmpfs rw 0 0 none /selinux selinuxfs rw 0 0 /proc /proc proc rw,nodiratime 0 0 /proc/bus/usb /proc/bus/usb usbfs rw 0 0 /sys /sys sysfs rw 0 0 none /dev/pts devpts rw 0 0 /dev/sda1 /boot ext3 rw 0 0 none /dev/shm tmpfs rw 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 nfsd /proc/fs/nfsd nfsd rw 0 0 rhel 5.1 nfs server ------------------- rootfs / rootfs rw 0 0 /dev/root / ext3 rw,data=ordered 0 0 /dev /dev tmpfs rw 0 0 /proc /proc proc rw 0 0 /sys /sys sysfs rw 0 0 /proc/bus/usb /proc/bus/usb usbfs rw 0 0 devpts /dev/pts devpts rw 0 0 /dev/sda1 /boot ext3 rw,data=ordered 0 0 tmpfs /dev/shm tmpfs rw 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 /etc/auto.misc /misc autofs rw,fd=6,pgrp=3215,timeout=300,minproto=5,maxproto=5,indirect 0 0 -hosts /net autofs rw,fd=11,pgrp=3215,timeout=300,minproto=5,maxproto=5,indirect 0 0 nfsd /proc/fs/nfsd nfsd rw 0 0 6. Export options ++++++++++++++++++ rhel 4.6 -------- /data 172.16.64.0/24(rw,wdelay,no_root_squash) rhel 5.1 -------- /data 172.16.64.0/24(rw,wdelay,no_root_squash,no_subtree_check,anonuid=65534,anongid=65534) 7. mount options ++++++++++++++++ on the rhel 5.1 client connecting to rhel 5.1 server is ------------------------------------------------------- 172.16.64.164:/data /mnt/nfs nfs rw,vers=3,rsize=32768,wsize=32768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=172.16.64.164 0 0 on the rhel 4.6 client connecting to rhel 4.6 server is ------------------------------------------------------- 172.16.64.203:/data /mnt/nfs nfs rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.203 0 0
Thank you for the information. I have some more questions concerning the information though. The response to question #2 was not answered. The data posted was for an iozone run made on top of a file system. The question was concerning the raw bandwidth available directly to the storage. The response to question #3 or #2, depending upon how you view it, seems incomplete or at best, I don't know how to read it. There seems to be many columns and only a few numbers. That said, does it appear that the write and rewrite numbers for RHEL-4 are _much_ slower than those for RHEL-5? I am still unclear on the file system type being used. The iozone tests were run on /data, but I don't see /data listed in the /proc/mounts output for either server. Is /data just a directory on the root file system? The mount options used on the clients listed is good, but I also need the mount options used when the 4.6 client talks to the 5.1 server.
So, let's try one more time. The goal of these questions is to attempt to rule out hardware differences between the two server systems. We need to ensure that the storage subsystems perform with roughly the same performance characteristics. Hence, we need to know raw hardware bandwidth. Next, we'd like to rule out any differences due to the local file system being on each server. Thus, we need the performance characteristics as run locally, on the exported file system, on each server. We need _all_ of the numbers which got generated. If any of the above information shows signficant differences, then we need to investigate them prior to investigating the NFS server. Once the hardware and file systems on each server show roughly the same performance characteristics, then we need to check to see how the NFS client is mounting each server. We need to use 1 client for testing against server. The use of 2 clients just introduces more potential differences which makes resolving this issue impossible. 1 client, please. I'd like to see the mount options used for that client which mounting each server. Please don't vary the options like using nointr and changing the retrans values. Please use the same options, the default set. We then need to ensure that each server is exporting the file system in the same fashion. You can add "no_subtree_check", but if you do, then do it on both servers, please. It would be easiest to just use no options and let the options default. After all that, the output from running iozone on that 1 client, against each server, would be required. The client needs to be otherwise quiesced, ie. nothing else running. After each run of iozone, we need the "nfsstat -s" statistics from the server being tested. It would be best if the server was rebooted immediately before being tested and then not used for any other NFS traffic. I am assuming that the networks that these three systems are connected to is a gigabit ethernet and that it is otherwise quiet, so there should be no network effects. If this is not true, please let me know. The goal is here is to reduce the variables in the information down to a situation where we can really determine what issues exists and where it might be.
(In reply to comment #17) > Thank you for the information. > > I have some more questions concerning the information though. > > The response to question #2 was not answered. The data posted was for > an iozone run made on top of a file system. The question was concerning > the raw bandwidth available directly to the storage. I will get this data in an update. > > The response to question #3 or #2, depending upon how you view it, > seems incomplete or at best, I don't know how to read it. There > seems to be many columns and only a few numbers. That said, does > it appear that the write and rewrite numbers for RHEL-4 are _much_ > slower than those for RHEL-5? read the numbers as follows file size, record size, write, rewrite, read reread. Yes, you right when you say the write and rewrite numbers for RHEL-4 is much slower than RHEL-5. Read performance is the same. > > I am still unclear on the file system type being used. The iozone > tests were run on /data, but I don't see /data listed in the > /proc/mounts output for either server. Is /data just a directory > on the root file system? Yes, /data is a directory on the root file system, do you want it to be a seperate file system could be done. > > The mount options used on the clients listed is good, but I also > need the mount options used when the 4.6 client talks to the 5.1 > server. Yes I will update you with this data soon in a couple of hours.
(In reply to comment #17) > Thank you for the information. > > I have some more questions concerning the information though. > > The response to question #2 was not answered. The data posted was for > an iozone run made on top of a file system. The question was concerning > the raw bandwidth available directly to the storage. raw performance output on the rhel 4.6 nfs server -------------------------- [root@localhost ~]# time dd if=/dev/zero of=/dev/VolGroup00/test bs=64k count=131072 131072+0 records in 131072+0 records out real 4m43.574s user 0m0.034s sys 0m10.683s [root@localhost ~]# time dd of=/dev/null if=/dev/VolGroup00/test bs=64k count=131072 131072+0 records in 131072+0 records out real 1m19.554s user 0m0.038s sys 0m9.879s on the rhel 5.1 nfs server -------------------------- [root@localhost ~]# time dd if=/dev/zero of=/dev/VolGroup00/test bs=64k count=131072 131072+0 records in 131072+0 records out 8589934592 bytes (8.6 GB) copied, 80.4467 seconds, 107 MB/s real 1m20.497s user 0m0.059s sys 0m10.809s [root@localhost ~]# [root@localhost ~]# time dd of=/dev/null if=/dev/VolGroup00/test bs=64k count=131072 131072+0 records in 131072+0 records out 8589934592 bytes (8.6 GB) copied, 79.5618 seconds, 108 MB/s real 1m19.620s user 0m0.055s sys 0m9.212s > > The response to question #3 or #2, depending upon how you view it, > seems incomplete or at best, I don't know how to read it. There > seems to be many columns and only a few numbers. That said, does > it appear that the write and rewrite numbers for RHEL-4 are _much_ > slower than those for RHEL-5? > > I am still unclear on the file system type being used. The iozone > tests were run on /data, but I don't see /data listed in the > /proc/mounts output for either server. Is /data just a directory > on the root file system? Yes, /data was on the root file system. > > The mount options used on the clients listed is good, but I also > need the mount options used when the 4.6 client talks to the 5.1 > server. 172.16.64.164:/test /mnt/nfs nfs rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.164 0 0
Here is the nfs iozone, nfsstat on server and client, and mount options for rhel 5.1 client rhel 5.1 client data ++++++++++++++++++++ rhel 5.1 nfs server ------------------- performance filesize record size write rewrite read reread 8388608 64 43010 15196 28017 27862 nfsstat on client Server rpc stats: calls badcalls badauth badclnt xdrcall 0 0 0 0 0 Client rpc stats: calls retrans authrefrsh 1055608 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 5 0% 1 0% 9 0% 10 0% 0 0% read write create mkdir symlink mknod 524290 49% 524356 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 6930 0% Client nfs v4: null read write commit open open_conf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% open_noat open_dgrd close setattr fsinfo renew 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% setclntid confirm lock lockt locku access 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% getattr lookup lookup_root remove rename link 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% symlink create pathconf statfs readlink readdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% server_caps delegreturn 0 0% 0 0% nfsstat on rhel 5.1 server Server rpc stats: calls badcalls badauth badclnt xdrcall 1055611 0 0 0 0 Server nfs v3: null getattr setattr lookup access readlink 2 0% 5 0% 1 0% 9 0% 10 0% 0 0% read write create mkdir symlink mknod 524289 49% 524356 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 3 0% 0 0% 6930 0% mount options 172.16.64.164:/test /mnt/nfs nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=172.16.64.164 0 0 rhel 4.6 nfs server data ------------------------ iozone benchmark. filesize recordsize write rewrite read reread 8388608 64 35491 30219 63876 65253 nfsstat output on rhel 5.1 client Server rpc stats: calls badcalls badauth badclnt xdrcall 0 0 0 0 0 Client rpc stats: calls retrans authrefrsh 1051656 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 5 0% 1 0% 9 0% 10 0% 0 0% read write create mkdir symlink mknod 524290 49% 524344 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 2990 0% Client nfs v4: null read write commit open open_conf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% open_noat open_dgrd close setattr fsinfo renew 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% setclntid confirm lock lockt locku access 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% getattr lookup lookup_root remove rename link 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% symlink create pathconf statfs readlink readdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% server_caps delegreturn 0 0% 0 0% nfsstat output on rhel 4.6 server Server rpc stats: calls badcalls badauth badclnt xdrcall 1051660 0 0 0 0 Server nfs v3: null getattr setattr lookup access readlink 3 0% 5 0% 1 0% 9 0% 10 0% 0 0% read write create mkdir symlink mknod 524289 49% 524344 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 3 0% 0 0% 2990 0% mount options rhel 51 client on rhel 4.6 server. 172.16.64.203:/test /mnt/nfs nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=172.16.64.203 0 0
Here is the nfs iozone, nfsstat on server and client, and mount options for rhel 4.6 client nfs 4.6 client data +++++++++++++++++++ rhel 5.1 Server -------------- iozone performance filesize recordsize write rewrite read reread 8388608 64 57326 24917 34115 34225 nfsstat on rhel 4.6 client Client rpc stats: calls retrans authrefrsh 1048723 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 14 0% 1 0% 3 0% 11 0% 0 0% read write create mkdir symlink mknod 524292 49% 524316 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 81 0% nfsstat on rhel 5.1 nfs server Server rpc stats: calls badcalls badauth badclnt xdrcall 1048728 0 0 0 0 Server nfs v3: null getattr setattr lookup access readlink 2 0% 14 0% 1 0% 3 0% 11 0% 0 0% read write create mkdir symlink mknod 524292 49% 524316 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 81 0% mount options 172.16.64.164:/test /mnt/nfs nfs rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.164 0 0 rhel 4.6 client --------------- iozone performance filesize recordsize write rewrite read reread 8388608 64 24786 25949 92873 94640 nfsstat on rhel 4.6 client Client rpc stats: calls retrans authrefrsh 1048735 0 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 14 0% 1 0% 4 0% 10 0% 0 0% read write create mkdir symlink mknod 524292 49% 524319 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 1 0% 0 0% 90 0% nfsstat on rhel 4.6 server Server rpc stats: calls badcalls badauth badclnt xdrcall 1048738 0 0 0 0 Server nfs v3: null getattr setattr lookup access readlink 1 0% 14 0% 1 0% 4 0% 10 0% 0 0% read write create mkdir symlink mknod 524291 49% 524319 49% 2 0% 0 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 2 0% 0 0% 0 0% 0 0% 0 0% 0 0% fsstat fsinfo pathconf commit 0 0% 2 0% 0 0% 90 0% mount options 172.16.64.203:/test /mnt/nfs nfs rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,proto=tcp,timeo=600,retrans=5,addr=172.16.64.203 0 0
(In reply to comment #18) > So, let's try one more time. > > The goal of these questions is to attempt to rule out hardware differences > between the two server systems. We need to ensure that the storage > subsystems perform with roughly the same performance characteristics. > Hence, we need to know raw hardware bandwidth. > > Next, we'd like to rule out any differences due to the local file system > being on each server. Thus, we need the performance characteristics as > run locally, on the exported file system, on each server. We need _all_ > of the numbers which got generated. > > If any of the above information shows signficant differences, then we > need to investigate them prior to investigating the NFS server. > > Once the hardware and file systems on each server show roughly the > same performance characteristics, then we need to check to see how the > NFS client is mounting each server. We need to use 1 client for testing > against server. The use of 2 clients just introduces more potential > differences which makes resolving this issue impossible. 1 client, > please. Check out comment #21 and #22. > > I'd like to see the mount options used for that client which mounting > each server. Please don't vary the options like using nointr and > changing the retrans values. Please use the same options, the default > set. Comment #21 and #22 contain mount options using default set. > > We then need to ensure that each server is exporting the file system > in the same fashion. You can add "no_subtree_check", but if you do, > then do it on both servers, please. It would be easiest to just use > no options and let the options default. > > After all that, the output from running iozone on that 1 client, > against each server, would be required. The client needs to be > otherwise quiesced, ie. nothing else running. > > After each run of iozone, we need the "nfsstat -s" statistics from > the server being tested. It would be best if the server was rebooted > immediately before being tested and then not used for any other NFS > traffic. > > I am assuming that the networks that these three systems are connected > to is a gigabit ethernet and that it is otherwise quiet, so there > should be no network effects. If this is not true, please let me know. The network is a gigabit ethernet. Its quiet. > > The goal is here is to reduce the variables in the information down to > a situation where we can really determine what issues exists and where > it might be.
Thank you for all of the information. It seems contradictory in many respects, so we need to boil things down a bit further. First, any idea why writing to the partition on RHEL-4 takes between 3 and 4 times as long as it did on RHEL-5? What exports options are being used on each server? Are the server systems identical, hardwise?
Pardon me if I jump in and ask some questions too but I have to know... Is the data consistent when the same test is executed in the same environment? If not, how much variance is measured? Can an average be calculated across, say, three runs? I guess I am with Peter in that it is hard to find a pattern so I was wondering how much flucuation was occurring from one identical test to the next. Pick a client, a server, run the test three times, calculate variance and average the read results if the variance isn't too great. Excuse me if that data has already been presented but I did not see it.
(In reply to comment #24) > Thank you for all of the information. It seems contradictory in many > respects, so we need to boil things down a bit further. > > First, any idea why writing to the partition on RHEL-4 takes between > 3 and 4 times as long as it did on RHEL-5? This could be another issue that we could track using another bugzilla. Inspite of this, nfs write performance differs by a lesser margin compared to direct IO. > > What exports options are being used on each server? /nfsshare 172.16.64.0/24(rw, no_root_squash) > > Are the server systems identical, hardwise? Yes they are identical in all respects. please check the attachments in comment #11 and #13. I have been able to reproduce the issue by just using dd over nfs. Here is a simpler method to reproduce the issue. 1. take two systems with any configuration. 2. install rhel 4.6 on server create an nfs share. 3. create a big file of size=(2 * system RAM)on the server 3. mount the share on the client (client is rhel 5.1) 4. read the file on the client using the command #time dd if=/mnt/nfs/bigfile of=/dev/null 5. repeat the same experiment after upgrading the server to rhel 5.1 6. and observe the results. In my case rhel 5.1 is way slower I have tried this experiment about 10 time and there is hardly any variation in the data. I have tried various memory configurations, I still see the same result. read performance is always less. Could I do something with oprofile and get you more data? Will it help?
Sandeep, As a result of the last weekly call, I discussed this situation with several parties involved and reviewed the data you have provided in the past. In addition to our increased testing to be done on this bugzilla, one thing that was determined was the possible involvement of virtual memory in this puzzle. Thus, I reviewed your sosreport from comments #2 and #3 to compare vm entries in /proc but found them to be similar but there were a couple of differnces. We have contacted the vm people to take a look at this. As your data has shown, RHEL5.1 write throughput done locally appears to have improved over RHEL4.6 (I refer to comments #16 and #20) but we wonder if the writes are actually being written to disk or they are being cached and the app finishes before they are completely written, hence the perceived "improvement" in write speed. We believe there might be an option to iozone that requires the app wait for the sync to take place. Perhaps that might prove our theory to be true. This "ending before flushing is complete" may or may not be true, but if it is, it would be an example of a subtle difference between RHEL4 and RHEL5 that makes this degradation tough to pin down. On the other hand, the rewrite data shows a degradation, too. We are speculating if this degradation could be due to the rewrite having to wait for the flushing of data before being able to access it. In addition, you have convinced me that the server hardware and configurations are the same and as I stated during the call, I like the fact that your new procedure calls for an upgrade of the server rather than introduce a replacement. Thus, removing another variable (the server hardware). Also, we think you are on the right track by using dd instead of iozone as proposed in comment #26 since iozone has to create a file before writing it. Our QE people liked the fact that now the file will already created and this should reduce the caching variable that file creation introduces. So with that all stated, we would like you to use an existing (8Gb) file and then dump the cache before using this file. ("sync" and then "echo 3 > /proc/sys/vm/drop_caches") Once this is done, we would like you to run dd tests locally, no nfs, and monitor the dirty cache by the "echo m >/proc/sysrq-trigger" command. I grep for "dirty" in /var/log/messages to see how the value associated with "dirty" changes as the test proceeds, that is, echo m and then grep /var/log/messages every couple of seconds. We think that it is possible that your 8Gb file is causing an inordinate amount of dirty pages to be flushed out, thus causing a performance bottleneck. This possibly could cause further trouble in the i/o subsystem but let's figure out if there is a caching issue first. It was suggested that the value of dirty be monitored as the test proceeds so one can get a feel for how it changes over the course of time. Once the dirty values have been attained for local execution, the same could be done in a NFS setting. It was also suggested that the tests be done with direct I/O to take the cache variable out of the picture. It was suggested that the file size be reduced to say 1Gb which would reduce the length of your test but it may also reduce the pressure on the cache and dirty page handling, if these are the culprits. If a smaller file works remarkedly better than a large one, it will give us all a better feel for the ramifications of this bugzilla. In addition, perhaps you could provide more detail on the striping being used, e.g. hardware and software raid details. We know its a PERC 5/i, megaraid_sas, and 4 stripes. I can see from the /var/log/messages file in the sosreport that Fujitsu disks are being used but any other details would be helpful.
This has probably been asked/investigated but it this an NFS-only problem? In otherwords do we see a significant performance drop running the same iozone test on a 5.2 system without any NFS involvement at all? Also, do we see a similar problem if the server has 8GB or even 12GB of RAM? Since we are over-commiting the server's RAM perhaps its a caching issue?? Larry Woodman
(In reply to comment #31) > This has probably been asked/investigated but it this an NFS-only problem? In > otherwords do we see a significant performance drop running the same iozone test This is an nfs only problem. I dont see the issue if I scp the same file (size 8gb)over the same network or run the stress directly on the server. > on a 5.2 system without any NFS involvement at all? Also, do we see a similar > problem if the server has 8GB or even 12GB of RAM? Since we are over-commiting > the server's RAM perhaps its a caching issue?? > Instead of increasing the system RAM, I did something slightly different I reduced the file size from 8gb to 2gb on both rhel46 and rhel51, the problem does NOT reproduce. What you seem to thinking is right, might be a caching issue with respect to nfs? > Larry Woodman >
(In reply to comment #29) > Sandeep, > As a result of the last weekly call, I discussed this situation with several > parties involved and reviewed the data you have provided in the past. In > addition to our increased testing to be done on this bugzilla, one thing that > was determined was the possible involvement of virtual memory in this puzzle. > > Thus, I reviewed your sosreport from comments #2 and #3 to compare vm entries in > /proc but found them to be similar but there were a couple of differnces. We > have contacted the vm people to take a look at this. > > As your data has shown, RHEL5.1 write throughput done locally appears to have > improved over RHEL4.6 (I refer to comments #16 and #20) but we wonder if the > writes are actually being written to disk or they are being cached and the app > finishes before they are completely written, hence the perceived "improvement" > in write speed. We believe there might be an option to iozone that requires the > app wait for the sync to take place. Perhaps that might prove our theory to be > true. This "ending before flushing is complete" may or may not be true, but if > it is, it would be an example of a subtle difference between RHEL4 and RHEL5 > that makes this degradation tough to pin down. > > On the other hand, the rewrite data shows a degradation, too. We are speculating > if this degradation could be due to the rewrite having to wait for the flushing > of data before being able to access it. > > In addition, you have convinced me that the server hardware and configurations > are the same and as I stated during the call, I like the fact that your new > procedure calls for an upgrade of the server rather than introduce a > replacement. Thus, removing another variable (the server hardware). > > Also, we think you are on the right track by using dd instead of iozone as > proposed in comment #26 since iozone has to create a file before writing it. Our > QE people liked the fact that now the file will already created and this should > reduce the caching variable that file creation introduces. > > So with that all stated, we would like you to use an existing (8Gb) file and > then dump the cache before using this file. ("sync" and then "echo 3 > > /proc/sys/vm/drop_caches") Once this is done, we would like you to run dd tests > locally, no nfs, and monitor the dirty cache by the "echo m > >/proc/sysrq-trigger" command. I grep for "dirty" in /var/log/messages to see > how the value associated with "dirty" changes as the test proceeds, that is, > echo m and then grep /var/log/messages every couple of seconds. We think that it > is possible that your 8Gb file is causing an inordinate amount of dirty pages to > be flushed out, thus causing a performance bottleneck. This possibly could cause > further trouble in the i/o subsystem but let's figure out if there is a caching > issue first. It was suggested that the value of dirty be monitored as the test > proceeds so one can get a feel for how it changes over the course of time. > > Once the dirty values have been attained for local execution, the same could be > done in a NFS setting. It was also suggested that the tests be done with direct > I/O to take the cache variable out of the picture. > > It was suggested that the file size be reduced to say 1Gb which would reduce the > length of your test but it may also reduce the pressure on the cache and dirty > page handling, if these are the culprits. If a smaller file works remarkedly > better than a large one, it will give us all a better feel for the ramifications > of this bugzilla. The performance is SAME in both rhel 46 and rhel 51 if the filesize is set to 1GB!. I have replied to larry woodman in comment #32. If it is still required that I analyze the dirty page behaviour I could do this? > > In addition, perhaps you could provide more detail on the striping being used, > e.g. hardware and software raid details. We know its a PERC 5/i, megaraid_sas, > and 4 stripes. I can see from the /var/log/messages file in the sosreport that > Fujitsu disks are being used but any other details would be helpful. > >
Created attachment 303813 [details] NFS/IOzone best practices
Hi Sandeep, We (HPC engineering) saw similar slow read performance with RHEL4 NFS client on iozone. I attached a doc that describes correct way to measure NFS performance with iozone. Some highlights from our experience: 1) Make sure underlying hardware is like. Same number of disks, speed, RAID configuration, RAID cache policy. (Write thru versus write back, battery present.) Until you do this there is no point in measuring standard deviation, etc. 2) Use -c and -U switches for iozone. -c forces commit before write complete to eliminate server side caching and -U unmounts filesystem between tests to force sync. 3) Other iozone options can force synchronous file operations. This will flatten out performance differences (which I think is what you want) but not a good idea for benchmarking since it carries undue penalty. 4) You are using a file size larger than combined RAM to circumvent file system caching. This is a good idea. In our experience, RAM must be physically removed from the boxes. Capping RAM with "mem=" does not work for iozone. 5) We saw wide performance variation across record size. 512k gave best and most consistent performance in our config. So after you verify hardware, read the NFS/iozone best practices and then rerun the tests with additional switches: first -Uc and then the synchronous options. Please let me know if any of this is helpful. I am happy to provide more details if necessary.
The local test bed server here at Red Hat/Westford has been set up, and I ran preliminary tests to establish local baseline numbers. RHEL4.6 and RHEL5.1 builds on the system were tested, and they mount a common LVM volume. The system has 4GB of RAM. The IOzone command was: iozone -S 1024 -s 8g -i 0 -i 1 -r 64k RHEL RUN Filesize Recsize write rewrite read reread ----------------KB-------KB------KB------KB--------KB------KB-- RHEL4.6 run1 8388608 64 57052 51086 57699 57881 RHEL4.6 run2 8388608 64 57672 43608 57347 57821 RHEL4.6 run3 8388608 64 58045 42814 57855 58029 RHEL5.1 run1 8388608 64 59734 50124 57506 57804 RHEL5.1 run2 8388608 64 61441 48304 57003 57840 RHEL5.1 run3 8388608 64 60100 47918 57843 54687 As you can see the only blip was the rewrites in RHEL4.6. Barry
(In reply to comment #36) > system has 4GB of RAM. The IOzone command was: > > iozone -S 1024 -s 8g -i 0 -i 1 -r 64k What happens if you just check read performance? time dd if=/mnt/nfs/bigfile of=/dev/null. In our case both the rhel46 and rhel 51 server give performance close to 10ths of second. Also can we check direct i/o on the two same servers. I guess that the systems are AMD? I see that you have set the S param to 1024. What is the client that you are using. I am using RHEL 51 systems kernel version 2.6.18-53.1.14.el5 and rhel 46 2.6.9-67.0.7.ELsmp, the latest on RHN.
Sandeep(in reply to comment #32), can you get us a "vmstat 1" output while running the test with both the 8GB and 2GB file sizes on both RHEL5-U2 and whatever base you were using that does not show this problem??? Larry Woodman
Hello all More updates on my testing. Here is some data taking into account The system RAM is 2g enabling me to do tests faster. I test only reads using time dd if=<file on share> of=/dev/null the client is rhel 5.1 File size in gigs 1 2 4 8 16 rhel51_direct, MB/sec 91.84 82.45 96.56 102.81 98.20 rhel46_direct, MB/sec 91.18 81.40 94.12 102.94 98.27 rhel51srv, MB/sec 34.80 34.10 35.60 35.40 34.90 rhel46srv, MB/sec 86.50 79.40 89.30 97.50 93.50 I observed one interesting thing. If I dont remove server side caching i.e file size =< system RAM/2 or running the same test with the same file over a remount of the share I get the same results in rhel 5.1 and rhel 4.6. here is the sequence of events on the rhel 5.1 client mounting a rhel 5.1 server. [root@localhost ~]# mount 172.16.64.164:/test /mnt/nfs [root@localhost ~]# dd if=/mnt/nfs/disc_1g_01.dat of=/dev/null 2097152+0 records in 2097152+0 records out 1073741824 bytes (1.1 GB) copied, 30.3699 seconds, 35.4 MB/s You have mail in /var/spool/mail/root [root@localhost ~]# umount /mnt/nfs [root@localhost ~]# mount 172.16.64.164:/test /mnt/nfs [root@localhost ~]# dd if=/mnt/nfs/disc_1g_01.dat of=/dev/null 2097152+0 records in 2097152+0 records out 1073741824 bytes (1.1 GB) copied, 9.16824 seconds, 117 MB/s [root@localhost ~]# umount /mnt/nfs [root@localhost ~]# mount 172.16.64.164:/test /mnt/nfs [root@localhost ~]# dd if=/mnt/nfs/disc_4g_01.dat of=/dev/null 8388608+0 records in 8388608+0 records out 4294967296 bytes (4.3 GB) copied, 123.168 seconds, 34.9 MB/s I hope this helps to get closer to root cause.
Created attachment 304277 [details] vmstat 1 output for 1, 2, 4g files on rhel 4.6 and rhel 5.1
(In reply to comment #38) > Sandeep(in reply to comment #32), can you get us a "vmstat 1" output while > running the test with both the 8GB and 2GB file sizes on both RHEL5-U2 and > whatever base you were using that does not show this problem??? > > Larry Woodman comment #40 contains what you need. its with 5.1 server, 4.6 server and 5.1 client.
> comment #40 contains what you need. its with 5.1 server, 4.6 server and 5.1 client. I get the same performance with rhel 5.2 snapshot6.
We have been able to replicate the issue here. Below are the results on our test bed. IOzone command using: 4GB file record size 64KB Systems booted with 2GB RAM (client and server) IOZONE FLAGS KERNEL_OF_SERVER WRITE REWRITE READ REREAD ------------------------------------------------------------------------ default 2.6.18-53.1.14.el5 40079 18866 25212 25746 default 2.6.9-67.0.7.ELsmp 32572 38126 57213 57367 close 2.6.18-53.1.14.el5 35053 18530 26101 25208 close 2.6.9-67.0.7.ELsmp 32354 31711 56344 57652 osync+close 2.6.18-53.1.14.el5 23579 5163 27054 26249 osync+close 2.6.9-67.0.7.ELsmp 20056 31009 57906 56855 dio+close 2.6.18-53.1.14.el5 5048 2933 32304 32129 dio+close 2.6.9-67.0.7.ELsmp 5080 5560 42672 42338 dio+osync+close 2.6.18-53.1.14.el5 5049 2944 30670 32122 dio+osync+close 2.6.9-67.0.7.ELsmp 5050 5556 43756 43533 ------------------------------------------------------------------------ We have some more investigating to do. I have vmstats of both client and server during the runs. Unfortunately we are literally in the middle of packing up for our move and probably cannot look further into this until early next week. Barry
*** To our customers *** Both, Red Hat and Dell, are actively working to understand and address this issue in a timely fashion. We appreciate your patience and request that you refer to this bugzilla for the latest information on our progress. Thank you.
Barry asked me to look at this one... He noted that there was a lot of read activity going on during the rewrite phase of an iozone test. I used blktrace to keep track of what was being read, and tested iozone on a 2G filesize, 64k IO size, write and rewrite, from a client to a rhel5 server with only 500M of memory. On stock rhel5, blktrace showed that during the rewrite phase (which rewrites 2G of data), 2G of *reads* were being issued, and that each read IO was exactly 4k. This looked to me like the ll_rw_block in __block_prepare_write, which gets called when we are writing a partial block. Further, looking at the iovecs set up by nfsd, the first was of non-block-size: 0: iov_len 3940 1: iov_len 4096 ... which means that all later iovecs were not aligned. But, this is the same behavior as on rhel4, and upstream, which were tested and found not to have the problem. So I chased down the iovec submission path to generic_file_buffered_write, and how the start & and for ->prepare_write() were set up, because it was this partial-block write that was causing the reads (in the read-modify-write). This stuck out at me in the diff: + /* + * Limit the size of the copy to that of the current segment, + * because fault_in_pages_readable() doesn't know how to walk + * segments. + */ + bytes = min(bytes, cur_iov->iov_len - iov_base); This upstream mod: [PATCH] knfsd: stop NFSD writes from being broken into lots of little writes to files ... http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=29dbb3fc8020f025bc38b262ec494e19fd3eac02 fixes the iozone test for me, and the comments say: When NFSD receives a write request, the data is typically in a number of 1448 byte segments and writev is used to collect them together. Unfortunately, generic_file_buffered_write passes these to the filesystem one at a time, so an e.g. 32K over-write becomes a series of partial-page writes to each page, causing the filesystem to have to pre-read those pages - wasted effort. generic_file_buffered_write handles one segment of the vector at a time as it has to pre-fault in each segment to avoid deadlocks. When writing from kernel-space (and nfsd does) this is not an issue, so generic_file_buffered_write does not need to break and iovec from nfsd into little pieces. This patch avoids the splitting when get_fs is KERNEL_DS as it is from NFSd. the regression was introduced by an upstream change, [PATCH] generic_file_buffered_write(): deadlock on vectored write http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6527c2bdf1f833cc18e8f42bd97973d583e4aa83 which is not present in RHEL4. Testing with this patch in place, I go from around 10MB/s on the iozone rewrite test to around 40MB/s, on par with the original write. Another note; for smaller file sizes in iozone testing this isn't obvious, because all the pages stay in-cache from the initial write, and don't have to be re-read from disk. -Eric
The patch definitely solves the rewrite issue. Rewrite performance improved 80-100% depending on which RHEL5 kernel we were testing. This issue was actually noticed in looking at the read/reread performance. There still seems to be an issue with read performance and we are trying a patch which keeps metadata cache in memory longer. stay tuned. Barry
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Present status is that read/reread performance is still an issue. Reads are off by 40% still. So while writes are performing we do have a serious read issue. I have a slew of tests to run to try and isolate this. Barry
On the read side... On my test setup, with 8 nfsd threads I am seeing about 20MB/s for read and reread, and a fairly high seek rate, to the tune of around 200 seeks/s. If I restrict to only 1 nfsd thread, I get 55MB/s and the seek rate is substantially lower. Additionally, if we look at the block IO stats for 1 thread: Total (iozone_xfs_read_1thread_full): Reads Queued: 31,254, 4,000MiB Writes Queued: 0, 0KiB Read Dispatches: 31,115, 4,000MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 31,115, 4,000MiB Writes Completed: 0, 0KiB Read Merges: 139, 17,792KiB Write Merges: 0, 0KiB IO unplugs: 25,777 Timer unplugs: 0 Throughput (R/W): 74,414KiB/s / 0KiB/s vs. 8 threads: Total (iozone_xfs_read_full): Reads Queued: 121,516, 4,000MiB Writes Queued: 0, 0KiB Read Dispatches: 65,893, 4,000MiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 65,893, 4,000MiB Writes Completed: 0, 0KiB Read Merges: 55,503, 1,768MiB Write Merges: 0, 0KiB IO unplugs: 125,270 Timer unplugs: 0 Throughput (R/W): 32,108KiB/s / 0KiB/s we can see that this results in a very different IO pattern... with 1 thread doing larger IOs.
I'm going to hazard a guess that on the read side, sharing read requests across the nfsds is defeating readahead. With 8 threads, for a block range I see requests issued like: 8,21 1 6 0.000019677 4059 D R 630773105 + 64 [nfsd] 8,21 1 16 0.019685010 4059 D R 630773297 + 256 [nfsd] 8,21 1 28 0.049531279 4059 D R 630773553 + 32 [nfsd] 8,21 1 34 0.049584655 4060 D R 630773585 + 64 [nfsd] 8,21 1 40 0.049614579 4061 D R 630773649 + 64 [nfsd] 8,21 1 46 0.049652358 4058 D R 630773713 + 64 [nfsd] .... and more. With 1 thread: 8,21 1 6 0.635870590 4309 D R 630773105 + 64 [nfsd] 8,21 1 14 0.662722860 4309 D R 630773169 + 384 [nfsd] 8,21 1 23 0.684639922 4309 D R 630773553 + 512 [nfsd] this looks like a growing readahead window. -Eric
hmm that might have been slightly anomalous but I do still see the single-thread case consistently issuing larger IOs, to the tune of 256 sectors vs. 64, usually. -Eric
(In reply to comment #52) > I'm going to hazard a guess that on the read side, sharing read requests across > the nfsds is defeating readahead. Yes, I have confirmed this, with one thread, rhel5 server performance is equal to rhel 4 performance. sandeep
Here's the matrix of nfsd thread count I have come up with. These were run with my simzone tool (100 lines vs 3000 lines of iozone) nfsd +--------- RHEL4 -67 ---------+-------- RHEL5 -88 --------- threads | iwrite rewrite read reread | iwrite rewrite read reread --------+-----------------------------+----------------------------- 1 47427 40593 75514 75764 40963 40465 75229 75697 2 41241 38589 75411 75526 41434 41385 13113 13218 4 42007 38706 70648 69157 46322 38657 16706 16715 8 36787 39707 56489 56650 43524 42155 31778 32141 16 44875 39903 45675 45682 32 42315 39434 45942 46185 As you can see, with a single nfsd thread RHEL5 read performance matches RHEL4. The biggest disparity is the read performance especially at 2 threads. Cranking up the thread count in RHEL5 does improve read performance, but it never reaches RHEL4's performance. I'm testing the 2 nfsd threads with a 1 cpu booted server. What I see is similar low read performance (15KB) as with the 8 cpu's (13KB) booted server. Barry
I applied the patch found in comment #46, built rpms in brew, and they can be found on my people page. I told how to access them in a separate email. They said that they would test it.
Changing summary to narrow down to the rewrite portion so we can keep that moving along; will file a new bug shortly to cover the read perf regresssion.
Read perf is bug #448130 Thanks, -Eric
I have largely the same problem. My RHEL5 server has a 6-drive MD RAID10 device with a filesystem on top. Locally I get 150MB/s writes using 'dd bs=32k', over NFS about 5-10MB/s. On exactly the same server I get 85MB/s writes if I write to a non-MD filesystem. So it looks like that it is an interaction between NFS and MD likely as described in #46 above.
md is likely getting unaligned IO as well. John, can you point Peter at your test RPMs too? -Eric
I have been using a workaround described below, and have observed no regression in RHEL5.1 single-threaded NFS reads when using this workaround. This seems consistent with the preceding results in this bug report -- i.e. 1 nfsd thread is much faster. The workaround is to add this line to /etc/rc.local boot script and then to run that script: # for n in /sys/block/sd*/queue/iosched/slice_idle ; do echo 1 > $n ; done This parameter did not exist in the RHEL4 CFQ I/O scheduler. A similar effect can be achieved with use of the deadline or noop scheduler, but for writes we have seen better results with CFQ. The purpose of this workaround is to minimize overhead imposed by CFQ when multiple threads are reading from the same file. NFS uses a thread pool to service RPCs, so that a sequential single-thread read at the application layer becomes a multi-thread read at the NFS server. CFQ treats threads as if they were application processes, but in fact they are not here so the default delay of 8 ms between switching to a different thread’s requests, represented by the slice_idle block device tuning parameter, is unreasonable. Others have seen this problem, including the author of CFQ. http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg05066.html More research needs to be done on the effect of setting this parameter to zero, but until we do a systematic test of all known workloads with this value I would not recommend it as a general solution. Reproducer: A 43% improvement, from 24.7 to 35.4 MB/s, was observed using this simple test done with 2 hosts running RHEL5.1 connected by a a 1-Gb Ethernet link. The NFS server exported a partition on the system disk, /dev/sda3, mounted as an ext3 file system. No NFS or ext3 tuning was used. The workload was: # dd of=/dev/null bs=64k count=16k if=/mnt/nfsext3/f
Thanks for the pointer to that thread, that's interesting beyond just nfs performance...
Per comment #66, please find rpms with the write side patch in mm/filemap.c forcommit 29dbb3fc8020f025bc38b262ec494e19fd3eac02 in http://people.redhat.com/jfeeney/.bz436004
in kernel-2.6.18-95.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
(In reply to comment #73) > in kernel-2.6.18-95.el5 > You can download this test kernel from http://people.redhat.com/dzickus/el5 This kernel has been tested rewrite performance is good.
The 2.6.18.96-el5 test kernel does not address comment 67. I retested with this kernel and found once again the same problem. Should it have fixed this? Is there any kernel that addresses this problem?
Ben, we split this bug in two, a new one for read (bug 448130) and one for rewrite (this bug). the test kernel mentioned in this bug specifically addresses the rewrite issue which was causing read-modify-write and thereby causing massive write slowdowns. There is not yet a test kernel which addresses the cfq issue you mentioned in comment #67. Thanks, -Eric
John (#70) -- Do you have compiled GFS packages for this test kernel? (kmod-gfs2 + any other GFS pkgs with kernel dependencies.) We have some folks trying to export a GFS share over NFS. Thanks again, Jacob Liberman Dell HPC Engineering
No, Jacob. I just build kernels. Sorry.
Jacob - the GFS kmods should be using the driver update model since RHEL 5.1. The existing kmods should be useable on the test kernels. -regards Subhendu
Attention Partners! RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop. Thanks, more information about Beta testing to come. - Red Hat QE Partner Management
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html