491177 – Heavy NFS load to one disk causes all I/O on a system to hang

Bug 491177 - Heavy NFS load to one disk causes all I/O on a system to hang

Summary: Heavy NFS load to one disk causes all I/O on a system to hang

Keywords:
Status:	CLOSED DUPLICATE of bug 489889
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Staubach
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-03-19 18:11 UTC by Skylar Thompson
Modified:	2009-06-16 22:39 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-06-11 20:34:38 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Sysrq stack trace when NFS was hung on an RHEL 5.1 x86_64 system (694.92 KB, text/plain) 2009-03-19 18:11 UTC, Skylar Thompson	no flags	Details
View All

Description Skylar Thompson 2009-03-19 18:11:47 UTC

Created attachment 335899 [details]
Sysrq stack trace when NFS was hung on an RHEL 5.1 x86_64 system

Description of problem:
Heavy NFS load (over 1Gbps) to an RHEL5 NFS server can cause all I/O to hang, even to disks not receiving NFS traffic.

I've isolated two failure modes: at high I/O rates, all I/O stops, while at more moderate rates the nfsd threads just hang.

Version-Release number of selected component (if applicable):

Reproduced on RHEL 5.1-5.3, kernels 2.6.18-92 through 2.6.18-128

How reproducible:
This problem is very reproducible. It is easiest to reproduce by mounting the NFS filesystem over the loopback interface, but can also be reproduced by bonding two gigE NICs and sending traffic from multiple client nodes. This is also reproducible on RHEL4, but at higher I/O rates; I have only generated enough I/O by using the loopback mount method.

Steps to Reproduce:
1. Create and mount an ext3 filesystem on separate disk(s) from the system disk(s).
2. Add an exports rule for 127.0.0.1:
/fu              127.0.0.1(rw,async,root_squash) 
3. Mount that filesystem:
# mount /fu /mnt/fu -o nfsvers=3,tcp,hard,intr,rsize=32768,wsize=32768
4. Up the number of NFS threads to 512 in /etc/sysconfig/nfs:
RPCNFSDCOUNT=512
5. Start/Restart NFS:
# service nfs restart
6. Make a directory writable by a non-root user:
# mkdir -p /fu/user
chown user:user /fu/user
7. Run this script to start up some number of dd writers. I've managed to hang nfsd with just 128 writers, and hanging the entire system with 512 writers.

===

#!/bin/bash

NUM=$1
if test -z $NUM; then
    echo "Provide number of dd's!"
    exit 2
fi

for((i=0;i<$NUM;i++)); do
    dd if=/dev/zero of=/fu/user/`hostname -s`.$i bs=1M &
done

wait

===

8. Wait a few minutes for nfsd or the system to hang. I've seen the problem occur with as little as 4GB of data written out.

Actual results:
Depending on the number of writers, nfsd will hang, or all I/O on the system will hang. If vmstat and iostat are started before starting dd, vmstat will report no I/O, while iostat will still report I/O. When the system hangs, I/O wait on the system starts out normal, but eventually every CPU will be 100% I/O wait and will not go down unless the dd processes are killed. It is at the point that I/O hits 100% that the system becomes unusable.

Expected results:
System performance should degrade gracefully as I/O is increased.

Additional info:
I have replicated this with DASD disks on a MegaRAID card, and against an EMC CX380 connected with Fibre Channel and PowerPath multipathing. It is reproducible at higher I/O rates on RHEL4 using the same hardware. The system disks are completely separate from the disks on which the dd tests are run. I've attached a sysrq stack trace from the system when nfsd was hung.

Comment 1 Skylar Thompson 2009-03-19 19:55:54 UTC

I should point out that this doesn't happen with all local I/O; if I point the dd writers at local disk rather than NFS, I never run into problems.

Comment 2 Sander Grendelman 2009-04-08 14:51:14 UTC

We are also encountering this bug.
System: Red Hat 5 update 3 (x86_64).

The problem occurs using the default number of nfsd processes (8).

We are writing two (Oracle RMAN) streams over a single gigabit connection, at about 800mbit/s. The array we are writing to should be able to handle this amount of IO.

Comment 3 Joseph W. Breu 2009-04-24 16:00:44 UTC

I am encountering the same error with a Dell MD3000 mounted locally on /mnt/DAS and mounted on the same server via NFS.

Steps to reproduce:

/usr/sbin/bonnie++ -m direct-das1 -d /cms/fileserver/iozone-test/ -n 20:1m:4m:400 -p 2 -u 0:0

/usr/sbin/bonnie++ -m direct-das1 -d /cms/fileserver/iozone-test/ -n 20:1m:4m:400 -y -u 0:0

in another window:

/usr/sbin/bonnie++ -m direct-das1 -d /cms/fileserver/iozone-test/ -n 20:1m:4m:400 -y -u 0:0


Using 32 nfsd threads

All nfs threads hand and i/o to the DAS drops to 0.

We are running Red Hat Enterprise Linux Server release 5.2 (Tikanga)

Comment 5 Sachin Prabhu 2009-05-22 15:32:42 UTC

Cleaned up one particular stack trace from c#1 which could be causing the problem  

 nfsd          S ffff810001036500     0  7960      1          7962  7956 (L-TLB)
 ffff810425fb3980 0000000000000046 ffff81042d738b30 ffff81043fc64000
 0000000000000286 0000000000000009 ffff810425f957a0 ffff81043fc26080
 0000006888690e53 000000000000a3ee ffff810425f95988 0000000619bfcc00
 Call Trace:
 [<ffffffff884010c7>] :nfs:nfs_wait_bit_interruptible+0x22/0x28
 [<ffffffff80063ac7>] __wait_on_bit+0x40/0x6e
 [<ffffffff80063b61>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff8009db4f>] wake_bit_function+0x0/0x23
 [<ffffffff8840108b>] :nfs:nfs_wait_on_request+0x56/0x70
 [<ffffffff88404a96>] :nfs:nfs_wait_on_requests_locked+0x70/0xca
 [<ffffffff88405a81>] :nfs:nfs_sync_inode_wait+0x60/0x1db
 [<ffffffff883fbe7b>] :nfs:nfs_release_page+0x2c/0x4d
 [<ffffffff800c7606>] shrink_inactive_list+0x4e1/0x7f9
 [<ffffffff80012d02>] shrink_zone+0xf6/0x11c
 [<ffffffff800c801b>] try_to_free_pages+0x197/0x2b9
 [<ffffffff8000f271>] __alloc_pages+0x1cb/0x2ce 
 [<ffffffff8804ff1d>] :ext3:ext3_ordered_commit_write+0xa1/0xc7
 [<ffffffff8000fb8a>] generic_file_buffered_write+0x1b0/0x6d3
 [<ffffffff80016196>] __generic_file_aio_write_nolock+0x36c/0x3b8
 [<ffffffff800c2c0a>] __generic_file_write_nolock+0x8f/0xa8
 [<ffffffff80063bb6>] mutex_lock+0xd/0x1d
 [<ffffffff800c2c6b>] generic_file_writev+0x48/0xa2
 [<ffffffff800dbae0>] do_readv_writev+0x176/0x295
 [<ffffffff884905f4>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
 [<ffffffff88490e68>] :nfsd:nfsd_write+0xb5/0xd5
 [<ffffffff88497986>] :nfsd:nfsd3_proc_write+0xea/0x109
 [<ffffffff8848d1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
 [<ffffffff8834c48b>] :sunrpc:svc_process+0x454/0x71b
 [<ffffffff8848d746>] :nfsd:nfsd+0x1a5/0x2cb

Comment 6 Sachin Prabhu 2009-05-22 15:56:33 UTC

Skylar,

From the stack trace in c#5, nfsd tries to write pages onto the disk. The ext3 handler requests from memory which the system tries to free by syncing cached  pages owned by the nfs share. This could easily result in a deadlock.

Do you have any stack traces from a similarly hung system which doesn't mount nfs shares over loopback?

Sachin Prabhu

Comment 11 Skylar Thompson 2009-06-16 22:39:43 UTC

I've attached information to https://bugzilla.redhat.com/show_bug.cgi?id=489889

Note You need to log in before you can comment on or make changes to this bug.