103015 – kernel locks up under high NFS load

Bug 103015 - kernel locks up under high NFS load

Summary: kernel locks up under high NFS load

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Tom Coughlan
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	101028 103278
TreeView+	depends on / blocked

Reported:	2003-08-25 14:50 UTC by Tom "spot" Callaway
Modified:	2007-11-30 22:06 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-10-25 16:03:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Output from crash analysis (72.51 KB, text/plain) 2003-08-26 23:46 UTC, Tom "spot" Callaway	no flags	Details
updated "foreach bt" output (36.30 KB, text/plain) 2003-08-27 15:32 UTC, Dave Anderson	no flags	Details
Stack Trace from Wedged System (ext3) (34.17 KB, text/plain) 2003-08-28 00:27 UTC, Michael R. Barber	no flags	Details
Stack Trace from Wedged System (ext2) (36.43 KB, text/plain) 2003-08-28 00:31 UTC, Michael R. Barber	no flags	Details
netdump hang sequence of events (3.25 KB, text/plain) 2003-08-28 00:36 UTC, Michael R. Barber	no flags	Details
Stack Trace from Wedged System (ext3, different system than other stack traces) (138.87 KB, text/plain) 2003-08-28 15:51 UTC, Michael R. Barber	no flags	Details
Another Wedged System (crash foreach bt) (109.23 KB, text/plain) 2003-08-29 01:41 UTC, Michael R. Barber	no flags	Details
Backtrace from System Hung for 14 Hours (165.04 KB, text/plain) 2003-08-29 15:30 UTC, Michael R. Barber	no flags	Details
Two Stack Traces with Loaded Modules (118.62 KB, text/plain) 2003-08-30 16:29 UTC, Michael R. Barber	no flags	Details
profiling output from op_time (145.99 KB, text/plain) 2003-09-02 18:39 UTC, Michael R. Barber	no flags	Details
A simple script to test responsiveness and throughput (1.23 KB, text/plain) 2003-09-10 23:07 UTC, Michael R. Barber	no flags	Details
View All

Description Tom "spot" Callaway 2003-08-25 14:50:25 UTC

Description of problem:

Customer is having serious system hangs when putting RHEL 3 Beta 1 under high
NFS load. TCP/UDP does not seem to affect the problem, although it takes longer
to trigger with TCP.

The customer upgraded to the 405 kernel at my request.  They got a kernel panic
within about an hour of testing. They had not yet set up netdump so they copied
down some of the panic output:

---

CPU:    0
EIP:    0060:[<02144dc1>]       Not tainted
EFLAGS: 00010212

EIP is at do_flushpage [kernel] 0x11 (2.4.21-1.1931.2.405.enthugemem)
...
Process rm (pid: 3564, stackpage=0e453000)
...
Call Trace:   [<02144e60>] truncate_complete_page [kernel] 0x70 (0xe453ee8)
[<02144f57>] truncate_list_pages [kernel] 0xe7 (0xe453ef8)
[<0214511d>] trunace_inode_pages [kernel] 0x4d (0xe453f34)
[<f886ff80>] ext3_sops [ext3] 0x0 (0xe453f40)
[<0217cca2>] iput [kernel] 0x1a2 (0xe453f4c)
[<02171495>] vfs_unlink [kernel] 0x185 (0xe453f68)
[<02171709>] sys_unlink [kernel] 0x119 (0xe453f84)

Code: Bad EIP value.

Kernel panic: Fatal exception

Under the 405 kernel, when under high NFS load on the new kernel, even with NFS
over TCP we still get freezing. If a vmstat command is already running when it
occurs, we see:

procs                      memory      swap          io     system
cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
wa
 0  3      0 1058364  42524 6753164    0    0     0  1696  243    30  0 100
0 20100
 0  3      0 1058364  42524 6753140    0    0     0  1648  278    32  0  0
0 20200
 0  3      0 1058364  42524 6753140    0    0     0  1440  246    26  0  0
0 20200
 0  3      0 1058364  42524 6753132    0    0     0  1840  271    28  0 100
0 20100
 0  3      0 1058364  42524 6752748    0    0     0  1960  270    31  0  0
0 20200
 0  3      0 1058364  42524 6752652    0    0     0  1832  261   118  0  0
0 20100
 0  3      0 1058424  42524 6752480    0    0     0  1712  255   193  0  0
0 20200
 0  3      0 1058492  42524 6752408    0    0     0  1316  248    96  0  0
0 20200
 0  3      0 1058552  42524 6752352    0    0     0  1552  267    80  0 100
0 20100
 0  3      0 1058580  42524 6752324    0    0     0  1508  259    51  0  0
0 20200
 0  3      0 1058580  42524 6752324    0    0     0  1508  268    23  0  0
0 20200
 0  3      0 1058596  42524 6752308    0    0     0  1532  254    25  0  0
0 20200
 0  3      0 1058924  42524 6751976    0    0     0  2152  278    41  0  0
0 20100
 0  3      0 1058996  42524 6751908    0    0     0  1696  272    91  0  0
0 20200
 0  3      0 1059128  42524 6751772    0    0     0  1376  248    51  0 100
0 20100
 0  3      0 1059188  42524 6751716    0    0     0  1648  268    53  0 100
0 20100
 0  3      0 1059308  42524 6751596    0    0     0  1924  256    89  0  0
0 20000
 0  3      0 1059376  42524 6751524    0    0     0  1300  254    63  0  0
0 20200
 0  3      0 1059436  42524 6751468    0    0     0  1896  253    59  0  0
0 20200

However, if we stop the vmstat, any new command typed will hang.  Login from
console or ssh also hangs.  From remote, we are able to run an rpcinfo -p
against it, but not anything that engages NFS daemons such as showmount -e.

The customer has been able to reproduce this on three different machines. I have
instructed them to try it with the 409 hugemem and SMP kernel.

Comment 1 Steve Dickson 2003-08-26 01:47:43 UTC

What are running to cause this hang/panic... 

Also have them install the kernel-debuginfo rpm and
then run crash to see where the nfsd are hanging 
by typing the crash command "foreach nfsd bt"

Comment 2 Stephen Tweedie 2003-08-26 17:34:45 UTC

Hmm.

do_flushpage looks like:

static int do_flushpage(struct page *page, unsigned long offset)
{
	int (*flushpage) (struct page *, unsigned long);
	flushpage = page->mapping->a_ops->flushpage;

... which comes out as:

0x2144db0 <do_flushpage>:       sub    $0xc,%esp
0x2144db3 <do_flushpage+3>:     mov    0x10(%esp,1),%edx
0x2144db7 <do_flushpage+7>:     mov    0x14(%esp,1),%ecx
0x2144dbb <do_flushpage+11>:    mov    0x8(%edx),%eax
0x2144dbe <do_flushpage+14>:    mov    0x1c(%eax),%eax
0x2144dc1 <do_flushpage+17>:    mov    0x18(%eax),%eax

so we've oopsed dereferencing page->mapping->a_ops.  Do we have mapping set and
a_ops clear here?

We really need the rest of the oops trace to be sure.  Just copying down a bit
of it doesn't give us the whole picture.  If this is reproducible, then getting
as many complete traces as possible will be a great help.

Comment 4 Tom "spot" Callaway 2003-08-26 23:46:02 UTC

The customer has managed to get a full crash dump (all 6.5 GB of it), and ran
the command that Steve requested. The output of that command is attached as a
file, since its ~1750 lines.

Comment 5 Tom "spot" Callaway 2003-08-26 23:46:38 UTC

Created attachment 93967 [details]
Output from crash analysis

Comment 7 Dave Anderson 2003-08-27 15:32:15 UTC

Created attachment 93974 [details]
updated "foreach bt" output


This attachment contains clean backtrace output for all tasks
when the Alt-Sysrq-C was entered.

Comment 8 Dave Anderson 2003-08-27 17:52:14 UTC

On a related note, the customer had reported that the netdump
operation hung during the post-dump show_state() call while printing
all of the back traces.  So I had him just do an Alt-Sysrq-T during
runtime, which he said would hang the same way.  For clarification
I asked him:

>> Ok -- so after entering Alt-Sysrq-T, you get partial output, and
>> from then on you can no longer use the console?
>
> That is correct.  The only way I can recover the system at that
> point is to use the hardware reset button.

I'm not sure how this relates to the bug at hand.  But it's hard
to see where it would block; show_state() takes the task_list lock
for reading, starts printing out task stuff, but then hangs somewhere.

Should he try to find out where it's hung with nmi_watchdog?
Or just make it the subject of another bugzilla?

Comment 9 Michael R. Barber 2003-08-28 00:27:48 UTC

Created attachment 94008 [details]
Stack Trace from Wedged System (ext3)

The system became wedged again today under high NFS load.  I collected
a crash dump and have included a crash "foreach bt" as an attachment.

Comment 10 Michael R. Barber 2003-08-28 00:31:26 UTC

Created attachment 94009 [details]
Stack Trace from Wedged System (ext2)

Since the last two crash dumps have looked like things may be wedged on
ext3 journal updates, I remounted the storage as ext2 and re-tested.  Under
high NFS load, I got a similar system hang as before.  Attached is the crash
"foreach bt" from this test.

Comment 11 Michael R. Barber 2003-08-28 00:36:19 UTC

Created attachment 94010 [details]
netdump hang sequence of events

As noted earlier, when dumping a stack trace using alt-sysrq-t or dumping a
core with alt-sysrq-c, the system(s) hang during the stack dump, usually about
30-40 lines into it.  I ran tcpdump on the netdump server.  This attachment
shows the last several packets transmitted by the panic'ing system, then the
timeout logged on the netdump server, etc.  The hostnames have been changed to
protect the innocent.

Comment 12 Michael R. Barber 2003-08-28 15:51:25 UTC

Created attachment 94037 [details]
Stack Trace from Wedged System (ext3, different system than other stack traces)

We ran into this problem again today.  This dump is from a different server
with a like configuration as the other crash data I have submitted.  The system
was under an NFS load while sending backups to the tape server.  The system was
slow but usable up until I issued an rm command to delete 3-4 files, at which
time the system became quite wedged.  I included both a "foreach bt" and
"foreach bt -l" in the attachment.  Unlike the other dumps, several shell
commands are also shown in a hung state, such as ls, w, and rm.

Comment 13 Michael R. Barber 2003-08-29 01:29:23 UTC

More information about the hang which occurs when doing a crash dump or alt-
sysrq-t stack trace.

* When the stack trace is dumped in conjunction with a crash dump using alt-
sysrq-c, the dump works okay, but the stacktrace hangs after displaying 10-20 
lines.
* When the stack trace is dumped by doing a alt-sysrq-t, about 10-20 lines of 
backtrace are dumped, and I get "<6>NETDEV WATCHDOG: eth0: transmit timed out" 
at which point it hangs as before.
* If I dont load the netdump module, the alt-sysrq-t sequence dumps a full 
backtrace without problem.

Perhaps netdump doesn't like my NIC?  I am using a 10/100/1000 Intel 82546EB 
Gigabit Ethernet (e1000).

I beleive this issue is separate from my NFS hanging problem, perhaps this 
should be a different bug?

Comment 14 Michael R. Barber 2003-08-29 01:41:37 UTC

Created attachment 94059 [details]
Another Wedged System (crash foreach bt)

We had a few more system "hangs" today.  If I booted the system with NFS
disabled, the system would function fine.  Within 30 seconds of starting NFS
daemons, it would becomed wedged.  I found that a single NFS client (Solaris 9,
NFS over TCP, 8k blocks) was the only NFS client generating traffic to this
system at the time of it's problems.  In the case of the attached backtrace
data, I waited almost 3 hours before getting the crash dump.  I also collected
a netdump capture (snoop) of the NFS traffic, but right off hand nothing looks
unusual.  Please advise if you want me to upload it (its quite big).

Comment 15 Michael R. Barber 2003-08-29 15:30:08 UTC

Created attachment 94067 [details]
Backtrace from System Hung for 14 Hours

This system hung during NFS load as before.  I left it in a wedged state
overnight and collected the core file this morning.  In addition to the
behavior I have previously reported, the system also logged the following
errors on the console:

rpc-svc/tcp: nfsd: sent only -107 bytes of 124 - shutting down socket
rpc-svc/tcp: nfsd: sent only -107 bytes of 124 - shutting down socket
rpc-svc/tcp: nfsd: sent only -107 bytes of 1304 - shutting down socket
rpc-svc/tcp: nfsd: sent only -107 bytes of 1304 - shutting down socket

Attached is a "foreach bt" and "foreach bt -l" from this crash dump.

Comment 16 Steve Dickson 2003-08-30 13:44:06 UTC

It appears that the all nfsds are haning in the ext3 journal code
To make this much clear, could you please have crash load the
nfsd.o and ext3.o modules before doing the stack trace... The
steps would be on the machine that is hanging:

1) enter crash
2) mod -S # this loads all the needed modules
3) If step 2 is successful jump to step 6 if not goto step 4
4) mod -s nfs /lib/modules/kernel-version/kernel/fs/nfsd/nfsd.o
5) mod -s ext3 /lib/modules/kenrel-version/kernel/fs/ext3/ext3.o
6) foreach bt

Comment 17 Michael R. Barber 2003-08-30 16:29:30 UTC

Created attachment 94091 [details]
Two Stack Traces with Loaded Modules

Attached are two backtraces where I ran "mod -S" as requested.	Since I had
suspected the ext3 journal for the deadlock, I had mounted the storage using
ext2 (just for the NFS exported filesystems) and was also able to get a hang. 
A backtrace from a system using ext3 and a different system using ext2 are both
attached.

Comment 18 Stephen Tweedie 2003-09-01 11:40:49 UTC

From the latest bt_withmods.txt, ext3 trace:

nfsd pid 1021 is stuck waiting for IO:
PID: 1021   TASK: f5bcc000  CPU: 0   COMMAND: "nfsd"
 #0 [f5bcdd58] context_switch at 2125a04
 #1 [f5bcdd70] schedule at 2123c02
 #2 [f5bcddb4] io_schedule at 2124f45
 #3 [f5bcddc0] __wait_on_buffer at 21623c9
 #4 [f5bcddf8] journal_flushpage at f884a3c3
 #5 [f5bcde2c] ext3_flushpage at f885ed18

kjournald pid 36 is waiting on IO:
PID: 36     TASK: f6408000  CPU: 1   COMMAND: "kjournald"
 #0 [f6409dc8] context_switch at 2125a04
 #1 [f6409de0] schedule at 2123c02
 #2 [f6409e24] io_schedule at 2124f45
 #3 [f6409e30] __wait_on_buffer at 21623c9
 #4 [f6409e68] journal_commit_transaction at f884ba3b
 #5 [f6409ff0] kernel_thread_helper at 210956b

as are pids 295, 297 and 299.  The other kjournald threads are all idle except
for pid 298 which appears to be waiting for a transaction to complete.  All the
other ext3 waiters may be blocked behind these ones.

In the ext2 trace, nfsd pid 722 is waiting for IO (it's waiting for a free
request, so the IO queue is already full); pids 725 and 729 are in the same
state, and the others are blocked on VFS locks (perhaps ones held by the blocked
nfsds?)

Sadc (pid 993) is blocked inside ext2 waiting for IO to complete.

In both cases it looks like there are critical tasks stalled waiting block
device IOs.  Was the system 100% idle in each case?  If so, the root cause looks
like a storage-layer problem, not a filesystem problem.

What storage is being used on the various filesystems here --- the root, and the
NFS-exported filesystems?  What IO controllers and device drivers?

Comment 19 Arjan van de Ven 2003-09-01 13:34:06 UTC

Controller seems to be a 3ware one according to the backtraces

Comment 20 Michael R. Barber 2003-09-01 18:02:20 UTC

The storage in each system is a pair of 3ware controllers.  In one system, it 
is a pair of the escelade 8500 controllers, in the other two systems, it is a 
pair of escelade 7500 controllers.  Each controler has 8 250 GB IDE drives.  
From each 3ware controler, one drive is presented as JBOD, 7 drives as a RAID 5 
LUN.  The OS image is software mirrored across the two JBOD disks, and the 
remaining 2 RAID 5 volumes are NFS exported.  The device driver for these 3ware 
devices is the relase included with RHEL3.  The firmware on the 3ware 
controlers was upgraded to the latest release as one of the early 
troubleshooting steps when the problems started occuring.

It had occured to me this may be a 3ware driver problem last week, so I 
downloaded and tested the 2.4.22 kernel from www.kernel.org which includes the 
most recent version of the 3ware driver.  My test-case which reliably 
reproduces the problem within 10-20 seconds with RHEL 3 was not able to 
reproduce the problem with 2.4.22.  However, when using that www.kernel.org 
release under the systems typical load, I was able to get similar hangs.  To 
see if the newer release of the 3ware driver accounts for this improvement, I 
installed the www.kernel.org release of the 2.4.21 kernel and re-ran my tests, 
getting similar results.  Therefore, it seems that problem exists in the 
www.kernel.org relase as well as RHEL3 beta, but the RHEL3 beta seems to easier 
to reproduce the problem in.

To respond about question about the system being idle, when a 'vmstat' is left 
running during the problem, it usually continues to display information.  See 
the first section of this bugzilla report for an example of what vmstat shows 
while the problem is occuring [the output is from RHEL3 beta 1, the vmstat 
output from RHEL3 beta 2 is similar, however the "wa" column does not exceed %
100.]

Comment 21 Michael R. Barber 2003-09-02 18:39:47 UTC

Created attachment 94145 [details]
profiling output from op_time

To supplement current information, I gathered some profiling information and
ran op_time -l.  The system was impaired, but not completely wedged at this
time (otherwise, I would have not been able to run the profiling commands.)

Comment 22 Stephen Tweedie 2003-09-04 08:45:49 UTC

Update from Barber, Michael (UNIX Force Inc):


I've been pursuing the hardware also.  Since there are three systems
with similar symptoms, it seems that instead of a bad component,
perhaps I have a bad combonation of hardware + firmware + drivers.

One of the first troubleshooting steps I did while running RH8
was to update the firmware on the 3ware cards to the newest
release.  However, when looking at the driver version I can see
that RHEL does not have the most recent 3ware driver.  Therefore,
I down-reved the 3ware firmware to a level which matches the
RHEL driver.  This did not help.

I checked the BIOS levels and found that my system with a
tyan motherboard was up to date, but my other two systems with
supermicro motherboards were behind on the BIOS, so I updated
them.  No help.

I've also went through the BIOS settings and disabled any
"advanced" functions, such as hyper-threading, etc.  No help
here.

I opened a ticket with 3ware, and their initial response is
also that I must have some bad hardware.  They suggested that
it may be incompatible memory.  I'll check into this tomorrow, I
perhaps made a bad assumption that the VAR be bought these from
knew what kind of memory was needed and that the system would
not pass the memory test if the memory was bad.  I'll see if
either of these are bad assumptions tomorrow.

Comment 23 Michael R. Barber 2003-09-10 23:07:58 UTC

Created attachment 94393 [details]
A simple script to test responsiveness and throughput

I have done further testing and have more information to share.

* Since there was concern that we may have an incompatible collection
  of hardware, I did some research and found that the model of motherboard
  we are using is not on 3ware's HCL.  I purchased a different motherboard
  that is on their list and re-tested getting similar results.

* I borrowed a 3ware 7506 card from one of the local vendors to see if
  the problem occurs with more recent hardware (we have 7500 and 8500 cards.)
  No improvement.

* Heavy local I seems to be able to temporarily wedge the system, but not
  to the same extent NFS does.	Perhaps many readers/writers influences
  this behavior?  Since local IO doesn't seem to wedge the system as
  bad, I put together a scripted test which generates a lot of IO to
  both 3ware controllers and times how long it takes to do misc tasks
  like ps, ls, creating, renaming, deleting files.  For sake of comparison,
  I tested this against the kernel.org release of the 2.4.21 and 2.4.22
  kernels.  To simulate an interactive user, the script sleeps between
  some operations, the sleep time total of the script is 4 minutes,
  so everything beyond 4 minutes is how long the script took to run the
  other commands.  Results (runtime listed in minutes, the adjusted
  runtime is runtime - 4 minutes to remove deliberate "sleeps"):

    Kernel				   Runtime    Adjusted Runtime
    ==================================================================
    ext3 + 2.4.21-1.1931.2.409.entsmp	     16.92	    12.92
    ext3 + 2.4.21-1.1931.2.409.enthugemem    15.85	    11.85
    ext3 + 2.4.21			      6.45	     2.45
    ext3 + 2.4.22			      4.25	      .25

  So, it seems the open source 2.4.22 kernel running on this system
  (same ext3 filesystem and no hardware changes or tuning) is
  about 42 times more responsive than RHEL3b2 when under high IO load
  with my collection of hardware.

  I also put together a scripted throughput test that just does a
  simple dd write to each of the two 3ware filesystems.  The throughput
  was nearly the same for each kernel, only that the system was not
  very responsive when running the RHEL3 kernels during the throughput
  test.  Here are the results (with absolutely no tuning done):

    Throughput Test
    ==================================================================
    ext3 + 2.4.21			     34.75 MB/sec
    ext3 + 2.4.22			     34.64 MB/sec
    ext3 + 2.4.21-1.1931.2.409.enthugemem    34.42 MB/sec
    ext3 + 2.4.21-1.1931.2.409.entsmp	     38.28 MB/sec

Since it seems like we are not going to be able to solve this problem
with RHEL3 beta and I have so far not been able to reproduce problems
with the kernel.org 2.4.22 kernel, we are exploring it's use as an
alternative to RHEL.  Performance with the XFS filesystem add-on also
seems to be quite a bit better performance than ext3.

I suggest we leave this ticket open for the next week.	If I
encounter any serious problems with the kernel.org 2.4.22 kernel
we will then re-evaluate if we want to continue pursuing this
RHEL 3 issue.

Comment 24 Michael K. Johnson 2003-09-16 14:56:59 UTC

Try
elvtune -r 1 -w 1 -b 1 /dev/xxx
for each spindle and see if that helps -- it has helped in some of our
internal testing.

Comment 26 Tom Coughlan 2003-10-13 19:09:56 UTC

Have you tried this with the RHEL 3 Release Candidate?

Have you tried the elvtune suggestion?

Comment 27 Michael R. Barber 2003-10-13 19:42:17 UTC

I have not tried the RHEL release candidate, no have I tried the elvtune 
suggestion.  As I explained in my last post, I decided to try using the 
kernel.org 2.4.22 kernel with XFS.  Since converting, all three of these 
systems have been %100 stable (just over one month now.)  Unfortunately, I 
could never keep RHEL stable for more than a day or two on these same systems.  
I'd like to test the latest build out and see if the issue has been fixed, but 
the window has closed on this project since we have finally landed on a 
reasonably stable solution.  Sorry.

Thanks for all the effort those involved put into trying to resolve this 
problem.  If none of your other customers have run into this issue, feel free 
to close this bug.

Comment 28 Tom Coughlan 2004-10-25 16:03:35 UTC

No updates for one year.  If there is further interest in this issue,
please re-test with the most recent RHEL 3 update.

Note You need to log in before you can comment on or make changes to this bug.