Red Hat Bugzilla – Bug 103015
kernel locks up under high NFS load
Last modified: 2007-11-30 17:06:57 EST
Description of problem:
Customer is having serious system hangs when putting RHEL 3 Beta 1 under high
NFS load. TCP/UDP does not seem to affect the problem, although it takes longer
to trigger with TCP.
The customer upgraded to the 405 kernel at my request. They got a kernel panic
within about an hour of testing. They had not yet set up netdump so they copied
down some of the panic output:
EIP: 0060:[<02144dc1>] Not tainted
EIP is at do_flushpage [kernel] 0x11 (2.4.21-1.1931.2.405.enthugemem)
Process rm (pid: 3564, stackpage=0e453000)
Call Trace: [<02144e60>] truncate_complete_page [kernel] 0x70 (0xe453ee8)
[<02144f57>] truncate_list_pages [kernel] 0xe7 (0xe453ef8)
[<0214511d>] trunace_inode_pages [kernel] 0x4d (0xe453f34)
[<f886ff80>] ext3_sops [ext3] 0x0 (0xe453f40)
[<0217cca2>] iput [kernel] 0x1a2 (0xe453f4c)
[<02171495>] vfs_unlink [kernel] 0x185 (0xe453f68)
[<02171709>] sys_unlink [kernel] 0x119 (0xe453f84)
Code: Bad EIP value.
Kernel panic: Fatal exception
Under the 405 kernel, when under high NFS load on the new kernel, even with NFS
over TCP we still get freezing. If a vmstat command is already running when it
occurs, we see:
procs memory swap io system
r b swpd free buff cache si so bi bo in cs us sy id
0 3 0 1058364 42524 6753164 0 0 0 1696 243 30 0 100
0 3 0 1058364 42524 6753140 0 0 0 1648 278 32 0 0
0 3 0 1058364 42524 6753140 0 0 0 1440 246 26 0 0
0 3 0 1058364 42524 6753132 0 0 0 1840 271 28 0 100
0 3 0 1058364 42524 6752748 0 0 0 1960 270 31 0 0
0 3 0 1058364 42524 6752652 0 0 0 1832 261 118 0 0
0 3 0 1058424 42524 6752480 0 0 0 1712 255 193 0 0
0 3 0 1058492 42524 6752408 0 0 0 1316 248 96 0 0
0 3 0 1058552 42524 6752352 0 0 0 1552 267 80 0 100
0 3 0 1058580 42524 6752324 0 0 0 1508 259 51 0 0
0 3 0 1058580 42524 6752324 0 0 0 1508 268 23 0 0
0 3 0 1058596 42524 6752308 0 0 0 1532 254 25 0 0
0 3 0 1058924 42524 6751976 0 0 0 2152 278 41 0 0
0 3 0 1058996 42524 6751908 0 0 0 1696 272 91 0 0
0 3 0 1059128 42524 6751772 0 0 0 1376 248 51 0 100
0 3 0 1059188 42524 6751716 0 0 0 1648 268 53 0 100
0 3 0 1059308 42524 6751596 0 0 0 1924 256 89 0 0
0 3 0 1059376 42524 6751524 0 0 0 1300 254 63 0 0
0 3 0 1059436 42524 6751468 0 0 0 1896 253 59 0 0
However, if we stop the vmstat, any new command typed will hang. Login from
console or ssh also hangs. From remote, we are able to run an rpcinfo -p
against it, but not anything that engages NFS daemons such as showmount -e.
The customer has been able to reproduce this on three different machines. I have
instructed them to try it with the 409 hugemem and SMP kernel.
What are running to cause this hang/panic...
Also have them install the kernel-debuginfo rpm and
then run crash to see where the nfsd are hanging
by typing the crash command "foreach nfsd bt"
do_flushpage looks like:
static int do_flushpage(struct page *page, unsigned long offset)
int (*flushpage) (struct page *, unsigned long);
flushpage = page->mapping->a_ops->flushpage;
... which comes out as:
0x2144db0 <do_flushpage>: sub $0xc,%esp
0x2144db3 <do_flushpage+3>: mov 0x10(%esp,1),%edx
0x2144db7 <do_flushpage+7>: mov 0x14(%esp,1),%ecx
0x2144dbb <do_flushpage+11>: mov 0x8(%edx),%eax
0x2144dbe <do_flushpage+14>: mov 0x1c(%eax),%eax
0x2144dc1 <do_flushpage+17>: mov 0x18(%eax),%eax
so we've oopsed dereferencing page->mapping->a_ops. Do we have mapping set and
a_ops clear here?
We really need the rest of the oops trace to be sure. Just copying down a bit
of it doesn't give us the whole picture. If this is reproducible, then getting
as many complete traces as possible will be a great help.
The customer has managed to get a full crash dump (all 6.5 GB of it), and ran
the command that Steve requested. The output of that command is attached as a
file, since its ~1750 lines.
Created attachment 93967 [details]
Output from crash analysis
Created attachment 93974 [details]
updated "foreach bt" output
This attachment contains clean backtrace output for all tasks
when the Alt-Sysrq-C was entered.
On a related note, the customer had reported that the netdump
operation hung during the post-dump show_state() call while printing
all of the back traces. So I had him just do an Alt-Sysrq-T during
runtime, which he said would hang the same way. For clarification
I asked him:
>> Ok -- so after entering Alt-Sysrq-T, you get partial output, and
>> from then on you can no longer use the console?
> That is correct. The only way I can recover the system at that
> point is to use the hardware reset button.
I'm not sure how this relates to the bug at hand. But it's hard
to see where it would block; show_state() takes the task_list lock
for reading, starts printing out task stuff, but then hangs somewhere.
Should he try to find out where it's hung with nmi_watchdog?
Or just make it the subject of another bugzilla?
Created attachment 94008 [details]
Stack Trace from Wedged System (ext3)
The system became wedged again today under high NFS load. I collected
a crash dump and have included a crash "foreach bt" as an attachment.
Created attachment 94009 [details]
Stack Trace from Wedged System (ext2)
Since the last two crash dumps have looked like things may be wedged on
ext3 journal updates, I remounted the storage as ext2 and re-tested. Under
high NFS load, I got a similar system hang as before. Attached is the crash
"foreach bt" from this test.
Created attachment 94010 [details]
netdump hang sequence of events
As noted earlier, when dumping a stack trace using alt-sysrq-t or dumping a
core with alt-sysrq-c, the system(s) hang during the stack dump, usually about
30-40 lines into it. I ran tcpdump on the netdump server. This attachment
shows the last several packets transmitted by the panic'ing system, then the
timeout logged on the netdump server, etc. The hostnames have been changed to
protect the innocent.
Created attachment 94037 [details]
Stack Trace from Wedged System (ext3, different system than other stack traces)
We ran into this problem again today. This dump is from a different server
with a like configuration as the other crash data I have submitted. The system
was under an NFS load while sending backups to the tape server. The system was
slow but usable up until I issued an rm command to delete 3-4 files, at which
time the system became quite wedged. I included both a "foreach bt" and
"foreach bt -l" in the attachment. Unlike the other dumps, several shell
commands are also shown in a hung state, such as ls, w, and rm.
More information about the hang which occurs when doing a crash dump or alt-
sysrq-t stack trace.
* When the stack trace is dumped in conjunction with a crash dump using alt-
sysrq-c, the dump works okay, but the stacktrace hangs after displaying 10-20
* When the stack trace is dumped by doing a alt-sysrq-t, about 10-20 lines of
backtrace are dumped, and I get "<6>NETDEV WATCHDOG: eth0: transmit timed out"
at which point it hangs as before.
* If I dont load the netdump module, the alt-sysrq-t sequence dumps a full
backtrace without problem.
Perhaps netdump doesn't like my NIC? I am using a 10/100/1000 Intel 82546EB
Gigabit Ethernet (e1000).
I beleive this issue is separate from my NFS hanging problem, perhaps this
should be a different bug?
Created attachment 94059 [details]
Another Wedged System (crash foreach bt)
We had a few more system "hangs" today. If I booted the system with NFS
disabled, the system would function fine. Within 30 seconds of starting NFS
daemons, it would becomed wedged. I found that a single NFS client (Solaris 9,
NFS over TCP, 8k blocks) was the only NFS client generating traffic to this
system at the time of it's problems. In the case of the attached backtrace
data, I waited almost 3 hours before getting the crash dump. I also collected
a netdump capture (snoop) of the NFS traffic, but right off hand nothing looks
unusual. Please advise if you want me to upload it (its quite big).
Created attachment 94067 [details]
Backtrace from System Hung for 14 Hours
This system hung during NFS load as before. I left it in a wedged state
overnight and collected the core file this morning. In addition to the
behavior I have previously reported, the system also logged the following
errors on the console:
rpc-svc/tcp: nfsd: sent only -107 bytes of 124 - shutting down socket
rpc-svc/tcp: nfsd: sent only -107 bytes of 124 - shutting down socket
rpc-svc/tcp: nfsd: sent only -107 bytes of 1304 - shutting down socket
rpc-svc/tcp: nfsd: sent only -107 bytes of 1304 - shutting down socket
Attached is a "foreach bt" and "foreach bt -l" from this crash dump.
It appears that the all nfsds are haning in the ext3 journal code
To make this much clear, could you please have crash load the
nfsd.o and ext3.o modules before doing the stack trace... The
steps would be on the machine that is hanging:
1) enter crash
2) mod -S # this loads all the needed modules
3) If step 2 is successful jump to step 6 if not goto step 4
4) mod -s nfs /lib/modules/kernel-version/kernel/fs/nfsd/nfsd.o
5) mod -s ext3 /lib/modules/kenrel-version/kernel/fs/ext3/ext3.o
6) foreach bt
Created attachment 94091 [details]
Two Stack Traces with Loaded Modules
Attached are two backtraces where I ran "mod -S" as requested. Since I had
suspected the ext3 journal for the deadlock, I had mounted the storage using
ext2 (just for the NFS exported filesystems) and was also able to get a hang.
A backtrace from a system using ext3 and a different system using ext2 are both
From the latest bt_withmods.txt, ext3 trace:
nfsd pid 1021 is stuck waiting for IO:
PID: 1021 TASK: f5bcc000 CPU: 0 COMMAND: "nfsd"
#0 [f5bcdd58] context_switch at 2125a04
#1 [f5bcdd70] schedule at 2123c02
#2 [f5bcddb4] io_schedule at 2124f45
#3 [f5bcddc0] __wait_on_buffer at 21623c9
#4 [f5bcddf8] journal_flushpage at f884a3c3
#5 [f5bcde2c] ext3_flushpage at f885ed18
kjournald pid 36 is waiting on IO:
PID: 36 TASK: f6408000 CPU: 1 COMMAND: "kjournald"
#0 [f6409dc8] context_switch at 2125a04
#1 [f6409de0] schedule at 2123c02
#2 [f6409e24] io_schedule at 2124f45
#3 [f6409e30] __wait_on_buffer at 21623c9
#4 [f6409e68] journal_commit_transaction at f884ba3b
#5 [f6409ff0] kernel_thread_helper at 210956b
as are pids 295, 297 and 299. The other kjournald threads are all idle except
for pid 298 which appears to be waiting for a transaction to complete. All the
other ext3 waiters may be blocked behind these ones.
In the ext2 trace, nfsd pid 722 is waiting for IO (it's waiting for a free
request, so the IO queue is already full); pids 725 and 729 are in the same
state, and the others are blocked on VFS locks (perhaps ones held by the blocked
Sadc (pid 993) is blocked inside ext2 waiting for IO to complete.
In both cases it looks like there are critical tasks stalled waiting block
device IOs. Was the system 100% idle in each case? If so, the root cause looks
like a storage-layer problem, not a filesystem problem.
What storage is being used on the various filesystems here --- the root, and the
NFS-exported filesystems? What IO controllers and device drivers?
Controller seems to be a 3ware one according to the backtraces
The storage in each system is a pair of 3ware controllers. In one system, it
is a pair of the escelade 8500 controllers, in the other two systems, it is a
pair of escelade 7500 controllers. Each controler has 8 250 GB IDE drives.
From each 3ware controler, one drive is presented as JBOD, 7 drives as a RAID 5
LUN. The OS image is software mirrored across the two JBOD disks, and the
remaining 2 RAID 5 volumes are NFS exported. The device driver for these 3ware
devices is the relase included with RHEL3. The firmware on the 3ware
controlers was upgraded to the latest release as one of the early
troubleshooting steps when the problems started occuring.
It had occured to me this may be a 3ware driver problem last week, so I
downloaded and tested the 2.4.22 kernel from www.kernel.org which includes the
most recent version of the 3ware driver. My test-case which reliably
reproduces the problem within 10-20 seconds with RHEL 3 was not able to
reproduce the problem with 2.4.22. However, when using that www.kernel.org
release under the systems typical load, I was able to get similar hangs. To
see if the newer release of the 3ware driver accounts for this improvement, I
installed the www.kernel.org release of the 2.4.21 kernel and re-ran my tests,
getting similar results. Therefore, it seems that problem exists in the
www.kernel.org relase as well as RHEL3 beta, but the RHEL3 beta seems to easier
to reproduce the problem in.
To respond about question about the system being idle, when a 'vmstat' is left
running during the problem, it usually continues to display information. See
the first section of this bugzilla report for an example of what vmstat shows
while the problem is occuring [the output is from RHEL3 beta 1, the vmstat
output from RHEL3 beta 2 is similar, however the "wa" column does not exceed %
Created attachment 94145 [details]
profiling output from op_time
To supplement current information, I gathered some profiling information and
ran op_time -l. The system was impaired, but not completely wedged at this
time (otherwise, I would have not been able to run the profiling commands.)
Update from Barber, Michael (UNIX Force Inc):
I've been pursuing the hardware also. Since there are three systems
with similar symptoms, it seems that instead of a bad component,
perhaps I have a bad combonation of hardware + firmware + drivers.
One of the first troubleshooting steps I did while running RH8
was to update the firmware on the 3ware cards to the newest
release. However, when looking at the driver version I can see
that RHEL does not have the most recent 3ware driver. Therefore,
I down-reved the 3ware firmware to a level which matches the
RHEL driver. This did not help.
I checked the BIOS levels and found that my system with a
tyan motherboard was up to date, but my other two systems with
supermicro motherboards were behind on the BIOS, so I updated
them. No help.
I've also went through the BIOS settings and disabled any
"advanced" functions, such as hyper-threading, etc. No help
I opened a ticket with 3ware, and their initial response is
also that I must have some bad hardware. They suggested that
it may be incompatible memory. I'll check into this tomorrow, I
perhaps made a bad assumption that the VAR be bought these from
knew what kind of memory was needed and that the system would
not pass the memory test if the memory was bad. I'll see if
either of these are bad assumptions tomorrow.
Created attachment 94393 [details]
A simple script to test responsiveness and throughput
I have done further testing and have more information to share.
* Since there was concern that we may have an incompatible collection
of hardware, I did some research and found that the model of motherboard
we are using is not on 3ware's HCL. I purchased a different motherboard
that is on their list and re-tested getting similar results.
* I borrowed a 3ware 7506 card from one of the local vendors to see if
the problem occurs with more recent hardware (we have 7500 and 8500 cards.)
* Heavy local I seems to be able to temporarily wedge the system, but not
to the same extent NFS does. Perhaps many readers/writers influences
this behavior? Since local IO doesn't seem to wedge the system as
bad, I put together a scripted test which generates a lot of IO to
both 3ware controllers and times how long it takes to do misc tasks
like ps, ls, creating, renaming, deleting files. For sake of comparison,
I tested this against the kernel.org release of the 2.4.21 and 2.4.22
kernels. To simulate an interactive user, the script sleeps between
some operations, the sleep time total of the script is 4 minutes,
so everything beyond 4 minutes is how long the script took to run the
other commands. Results (runtime listed in minutes, the adjusted
runtime is runtime - 4 minutes to remove deliberate "sleeps"):
Kernel Runtime Adjusted Runtime
ext3 + 2.4.21-1.1931.2.409.entsmp 16.92 12.92
ext3 + 2.4.21-1.1931.2.409.enthugemem 15.85 11.85
ext3 + 2.4.21 6.45 2.45
ext3 + 2.4.22 4.25 .25
So, it seems the open source 2.4.22 kernel running on this system
(same ext3 filesystem and no hardware changes or tuning) is
about 42 times more responsive than RHEL3b2 when under high IO load
with my collection of hardware.
I also put together a scripted throughput test that just does a
simple dd write to each of the two 3ware filesystems. The throughput
was nearly the same for each kernel, only that the system was not
very responsive when running the RHEL3 kernels during the throughput
test. Here are the results (with absolutely no tuning done):
ext3 + 2.4.21 34.75 MB/sec
ext3 + 2.4.22 34.64 MB/sec
ext3 + 2.4.21-1.1931.2.409.enthugemem 34.42 MB/sec
ext3 + 2.4.21-1.1931.2.409.entsmp 38.28 MB/sec
Since it seems like we are not going to be able to solve this problem
with RHEL3 beta and I have so far not been able to reproduce problems
with the kernel.org 2.4.22 kernel, we are exploring it's use as an
alternative to RHEL. Performance with the XFS filesystem add-on also
seems to be quite a bit better performance than ext3.
I suggest we leave this ticket open for the next week. If I
encounter any serious problems with the kernel.org 2.4.22 kernel
we will then re-evaluate if we want to continue pursuing this
RHEL 3 issue.
elvtune -r 1 -w 1 -b 1 /dev/xxx
for each spindle and see if that helps -- it has helped in some of our
Have you tried this with the RHEL 3 Release Candidate?
Have you tried the elvtune suggestion?
I have not tried the RHEL release candidate, no have I tried the elvtune
suggestion. As I explained in my last post, I decided to try using the
kernel.org 2.4.22 kernel with XFS. Since converting, all three of these
systems have been %100 stable (just over one month now.) Unfortunately, I
could never keep RHEL stable for more than a day or two on these same systems.
I'd like to test the latest build out and see if the issue has been fixed, but
the window has closed on this project since we have finally landed on a
reasonably stable solution. Sorry.
Thanks for all the effort those involved put into trying to resolve this
problem. If none of your other customers have run into this issue, feel free
to close this bug.
No updates for one year. If there is further interest in this issue,
please re-test with the most recent RHEL 3 update.