From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: We've run into a problem with our RedHat NFS servers where local disk access blocks access from NFS clients. We have reproduced this with ext3, ext2, and jfs; with and without lvm; and with both a standalone SCSI disk and various 3ware RAID cards and configurations. In addition, we have reproduced this with the following kernels: 2.4.21-20.ELsmp 2.4.21-20.EL 2.4.21-4smp 2.4.21-23.ELsmp 2.4.21-25.ELsmp Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.test access from an nfs client to the nfs mounted disk: 'time touch /nfs/server/disk/test' 2.start something locally on the NFS server that hits on the exported disk. We have done this by running bonnie++ locally on the server and by simply tarring or copying large files. 3.while the disk is being accessed locally, test access from the nfs client: 'time touch /nfs/server/disk/test2' Actual Results: Depending on the filesystem, etc., the local disk access effectively blocks nfs access. The 'touch /nfs/server/disk/test2' takes anywhere from 30 seconds to 3 minutes! Expected Results: The nfs IO should not be blocked by local disk access. Additional info: Bug #121434 seems like it may be related.
I made a typo in this bug report. The "Actual Results:" should have said: The local disk access effectively blocks nfs access. Depending on the filesystem, etc., the 'touch /nfs/server/disk/test2' takes anywhere from 30 seconds to 3 minutes!
I've done a few more tests to try to narrow this down. First of all, this situation seems to be triggered by nfs access. In addition, when this happens, it also blocks subsequent local disk IO. Here are steps to show this behaviour: ---- 1.test access from an nfs client to the nfs mounted disk: [root@client]# time touch /nfs/server/disk/clienttest 2.test access from the nfs server to the exported disk: [root@server]# time touch /mnt/disk/testlocal 3.start bonnie++ on the nfs server to test the exported disk: [root@server]# bonnie++ -d /mnt/disk -n 0 -r 1024 -s 2048 -f 4.test IO from another shell on the nfs server to the exported disk: [root@server]# time touch /mnt/disk/testlocal2 5.access the exported disk from the nfs client: [root@client]# time touch /nfs/server/disk/clienttest2 6.one last time, test access to the exported disk from the nfs server: [root@server]# time touch /mnt/disk/testlocal3 ---- After these steps, you should see the last two 'touch' commands take over a minute to complete. We have also been able to reproduce this problem on RedHat 9 using kernel 2.4.20-31.9smp. On the other hand, RedHat 7.3 with the 2.4.20-28.7smp kernel does not show this problem.
Would it be possible to post a vmstat and alt-sysrq-m and -t.
Created attachment 108726 [details] test script that shows blocked nfs IO
Created attachment 108727 [details] 'vmstat 2' results from nfs client - starting just before nfstest.sh script is run
Created attachment 108728 [details] alt-sysrq-m and alt-sysrq-t from nfs client while nfs blocking is occuring
Created attachment 108729 [details] alt-sysrq-m and alt-sysrq-t from nfs server while IO blocking is occuring
Are you unable to reproduce this? To summarize the tests I've done so far, this problem appears in RedHat 9, RHEL3, and Fedora Core 3. It does not occur in RH7.3 or Slackware 10.0. Here are the results of running the nfstest.sh script I previously attached: [root@lnx108 fstest]# cat touchlocal1.log real 0.00 user 0.00 sys 0.01 [root@lnx108 fstest]# cat touchlocal2.log real 318.71 user 0.00 sys 0.00 [root@lnx233 root]# cat touchremote.log real 333.70 user 0.00 sys 0.01
It appears things are getting stuck in the lvm or scsi code... and it does not appear to be an NFS issue although NFS might be needed to cause the hang....
I agree that we don't know where the problem really is. However, I can reproduce this using a plain ext3 partition on an IDE disk. I will post alt-sysrq-m and -t for this setup. Thanks for following up on this.
Created attachment 108806 [details] test script to produce blocked nfs I/O
Created attachment 108807 [details] 'vmstat 2' results from nfs client - no LVM and no SCSI
Created attachment 108808 [details] alt-sysrq-m and -t from nfs client while nfs IO is blocked - no LVM and no SCSI
Created attachment 108809 [details] alt-sysrq-m and -t from nfs server while IO is blocked - no LVM and no SCSI
... and possibly we're running into a combination of problems. I do see the blocked IO when using ext3 on an IDE disk with no LVM. However, at least in these results, the situation may not be quite as bad as when using lvm and/or scsi ... After running the most recently attached test script with an ext3 partition on an ide disk with no lvm: [root@lnx108 root]# cat touchlocal1.log real 0.00 user 0.00 sys 0.00 [root@lnx108 root]# cat touchlocal2.log real 57.71 user 0.00 sys 0.01 [root@lnx233 nfstests]# cat ~/touchremote.log real 115.26 user 0.00 sys 0.00
For now, I'm oing to reassing this to our SCSI guy....
Created attachment 108973 [details] alt-sysrq-p, -t, and -m from nfs server while IO is blocked - single SCSI disk with no LVM Does this really look like a scsi problem? I get almost identical results when doing the same thing with a stand alone SCSI or IDE disk. After running bonnie++ on the server, touching the local filesystem, touching the remote filesystem, and then touching the local filesystem (on a SCSI disk with no LVM) we get these times: [root@lnx108 root]# cat touchlocal1.log real 0m0.001s user 0m0.000s sys 0m0.000s [root@lnx233 nfstests]# cat ~/touchremote.log real 1m12.868s user 0m0.000s sys 0m0.000s [root@lnx108 root]# cat touchlocal2.log real 1m10.167s user 0m0.000s sys 0m0.000s
Created attachment 108974 [details] alt-sysrq-p, -t, and -m from nfs client while nfs IO is blocked - single SCSI disk with no LVM
I just want to add that the attachments in comment #19 and comment #20 were taken after updating both server and client to everything rhn released yesterday and today (Update 4?).
I have reproduced the problem (using RHEL 3). I'll be looking at in detail after the New Year. It is most interesting that you see the same thing on FC3. The 2.6 kernel is a very different environment.
Here is what I found so far: 1. The problem can be seen when the NFS client and server run on the same machine. This excludes the network from the picture. 2. strace shows that the long delay in "touch" is during the utime syscall. Steve Dixon kindly wrote the attached test program to do utime in a loop. 3. I start the bonnie++ exerciser on the server, against the local filesystem. Then I run utime.c against the loopbacked nfs path to the fs under load. The utime program hangs in the utime syscall for 10's of seconds, on every iteration. When I run utime against the direct path to the fs it consistently takes about one second (the same time it takes when bonnie++ is not running). 4. When I turn off bonnie++ and run just utime against the direct path to the fs, vmstat shows about 12 "blocks out (bo)" every 5 sec. When I run utime against the loopbacked nfs path, vmstat shows 4700 "bo" per sec. The NFS server is doing a ton of writes for every utime syscall. Assigning to Steve for further investigation into the NFS server.
Created attachment 109496 [details] test program, loops in utime syscall
Just to be clear, in item 4, comment #24, there is no heavy I/O. The only thing running is utime.c. When utime.c runs over the NFS server it is causing 4700 blocks per second to be written. That seems excessive.
Hi, I have had this NFS problem as well. I have also been having unexpected crashes when the load on the affected disc drive was high. Turning off NFS has stopped these crashes happening. Someone else who has also been having this problem told me that using asynchronous NFS mounts instead of synchronous ones makes a big improvement.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.