Description of problem: Customer is having serious system hangs when putting RHEL 3 Beta 1 under high NFS load. TCP/UDP does not seem to affect the problem, although it takes longer to trigger with TCP. The customer upgraded to the 405 kernel at my request. They got a kernel panic within about an hour of testing. They had not yet set up netdump so they copied down some of the panic output: --- CPU: 0 EIP: 0060:[<02144dc1>] Not tainted EFLAGS: 00010212 EIP is at do_flushpage [kernel] 0x11 (2.4.21-1.1931.2.405.enthugemem) ... Process rm (pid: 3564, stackpage=0e453000) ... Call Trace: [<02144e60>] truncate_complete_page [kernel] 0x70 (0xe453ee8) [<02144f57>] truncate_list_pages [kernel] 0xe7 (0xe453ef8) [<0214511d>] trunace_inode_pages [kernel] 0x4d (0xe453f34) [<f886ff80>] ext3_sops [ext3] 0x0 (0xe453f40) [<0217cca2>] iput [kernel] 0x1a2 (0xe453f4c) [<02171495>] vfs_unlink [kernel] 0x185 (0xe453f68) [<02171709>] sys_unlink [kernel] 0x119 (0xe453f84) Code: Bad EIP value. Kernel panic: Fatal exception Under the 405 kernel, when under high NFS load on the new kernel, even with NFS over TCP we still get freezing. If a vmstat command is already running when it occurs, we see: procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 3 0 1058364 42524 6753164 0 0 0 1696 243 30 0 100 0 20100 0 3 0 1058364 42524 6753140 0 0 0 1648 278 32 0 0 0 20200 0 3 0 1058364 42524 6753140 0 0 0 1440 246 26 0 0 0 20200 0 3 0 1058364 42524 6753132 0 0 0 1840 271 28 0 100 0 20100 0 3 0 1058364 42524 6752748 0 0 0 1960 270 31 0 0 0 20200 0 3 0 1058364 42524 6752652 0 0 0 1832 261 118 0 0 0 20100 0 3 0 1058424 42524 6752480 0 0 0 1712 255 193 0 0 0 20200 0 3 0 1058492 42524 6752408 0 0 0 1316 248 96 0 0 0 20200 0 3 0 1058552 42524 6752352 0 0 0 1552 267 80 0 100 0 20100 0 3 0 1058580 42524 6752324 0 0 0 1508 259 51 0 0 0 20200 0 3 0 1058580 42524 6752324 0 0 0 1508 268 23 0 0 0 20200 0 3 0 1058596 42524 6752308 0 0 0 1532 254 25 0 0 0 20200 0 3 0 1058924 42524 6751976 0 0 0 2152 278 41 0 0 0 20100 0 3 0 1058996 42524 6751908 0 0 0 1696 272 91 0 0 0 20200 0 3 0 1059128 42524 6751772 0 0 0 1376 248 51 0 100 0 20100 0 3 0 1059188 42524 6751716 0 0 0 1648 268 53 0 100 0 20100 0 3 0 1059308 42524 6751596 0 0 0 1924 256 89 0 0 0 20000 0 3 0 1059376 42524 6751524 0 0 0 1300 254 63 0 0 0 20200 0 3 0 1059436 42524 6751468 0 0 0 1896 253 59 0 0 0 20200 However, if we stop the vmstat, any new command typed will hang. Login from console or ssh also hangs. From remote, we are able to run an rpcinfo -p against it, but not anything that engages NFS daemons such as showmount -e. The customer has been able to reproduce this on three different machines. I have instructed them to try it with the 409 hugemem and SMP kernel.
What are running to cause this hang/panic... Also have them install the kernel-debuginfo rpm and then run crash to see where the nfsd are hanging by typing the crash command "foreach nfsd bt"
Hmm. do_flushpage looks like: static int do_flushpage(struct page *page, unsigned long offset) { int (*flushpage) (struct page *, unsigned long); flushpage = page->mapping->a_ops->flushpage; ... which comes out as: 0x2144db0 <do_flushpage>: sub $0xc,%esp 0x2144db3 <do_flushpage+3>: mov 0x10(%esp,1),%edx 0x2144db7 <do_flushpage+7>: mov 0x14(%esp,1),%ecx 0x2144dbb <do_flushpage+11>: mov 0x8(%edx),%eax 0x2144dbe <do_flushpage+14>: mov 0x1c(%eax),%eax 0x2144dc1 <do_flushpage+17>: mov 0x18(%eax),%eax so we've oopsed dereferencing page->mapping->a_ops. Do we have mapping set and a_ops clear here? We really need the rest of the oops trace to be sure. Just copying down a bit of it doesn't give us the whole picture. If this is reproducible, then getting as many complete traces as possible will be a great help.
The customer has managed to get a full crash dump (all 6.5 GB of it), and ran the command that Steve requested. The output of that command is attached as a file, since its ~1750 lines.
Created attachment 93967 [details] Output from crash analysis
Created attachment 93974 [details] updated "foreach bt" output This attachment contains clean backtrace output for all tasks when the Alt-Sysrq-C was entered.
On a related note, the customer had reported that the netdump operation hung during the post-dump show_state() call while printing all of the back traces. So I had him just do an Alt-Sysrq-T during runtime, which he said would hang the same way. For clarification I asked him: >> Ok -- so after entering Alt-Sysrq-T, you get partial output, and >> from then on you can no longer use the console? > > That is correct. The only way I can recover the system at that > point is to use the hardware reset button. I'm not sure how this relates to the bug at hand. But it's hard to see where it would block; show_state() takes the task_list lock for reading, starts printing out task stuff, but then hangs somewhere. Should he try to find out where it's hung with nmi_watchdog? Or just make it the subject of another bugzilla?
Created attachment 94008 [details] Stack Trace from Wedged System (ext3) The system became wedged again today under high NFS load. I collected a crash dump and have included a crash "foreach bt" as an attachment.
Created attachment 94009 [details] Stack Trace from Wedged System (ext2) Since the last two crash dumps have looked like things may be wedged on ext3 journal updates, I remounted the storage as ext2 and re-tested. Under high NFS load, I got a similar system hang as before. Attached is the crash "foreach bt" from this test.
Created attachment 94010 [details] netdump hang sequence of events As noted earlier, when dumping a stack trace using alt-sysrq-t or dumping a core with alt-sysrq-c, the system(s) hang during the stack dump, usually about 30-40 lines into it. I ran tcpdump on the netdump server. This attachment shows the last several packets transmitted by the panic'ing system, then the timeout logged on the netdump server, etc. The hostnames have been changed to protect the innocent.
Created attachment 94037 [details] Stack Trace from Wedged System (ext3, different system than other stack traces) We ran into this problem again today. This dump is from a different server with a like configuration as the other crash data I have submitted. The system was under an NFS load while sending backups to the tape server. The system was slow but usable up until I issued an rm command to delete 3-4 files, at which time the system became quite wedged. I included both a "foreach bt" and "foreach bt -l" in the attachment. Unlike the other dumps, several shell commands are also shown in a hung state, such as ls, w, and rm.
More information about the hang which occurs when doing a crash dump or alt- sysrq-t stack trace. * When the stack trace is dumped in conjunction with a crash dump using alt- sysrq-c, the dump works okay, but the stacktrace hangs after displaying 10-20 lines. * When the stack trace is dumped by doing a alt-sysrq-t, about 10-20 lines of backtrace are dumped, and I get "<6>NETDEV WATCHDOG: eth0: transmit timed out" at which point it hangs as before. * If I dont load the netdump module, the alt-sysrq-t sequence dumps a full backtrace without problem. Perhaps netdump doesn't like my NIC? I am using a 10/100/1000 Intel 82546EB Gigabit Ethernet (e1000). I beleive this issue is separate from my NFS hanging problem, perhaps this should be a different bug?
Created attachment 94059 [details] Another Wedged System (crash foreach bt) We had a few more system "hangs" today. If I booted the system with NFS disabled, the system would function fine. Within 30 seconds of starting NFS daemons, it would becomed wedged. I found that a single NFS client (Solaris 9, NFS over TCP, 8k blocks) was the only NFS client generating traffic to this system at the time of it's problems. In the case of the attached backtrace data, I waited almost 3 hours before getting the crash dump. I also collected a netdump capture (snoop) of the NFS traffic, but right off hand nothing looks unusual. Please advise if you want me to upload it (its quite big).
Created attachment 94067 [details] Backtrace from System Hung for 14 Hours This system hung during NFS load as before. I left it in a wedged state overnight and collected the core file this morning. In addition to the behavior I have previously reported, the system also logged the following errors on the console: rpc-svc/tcp: nfsd: sent only -107 bytes of 124 - shutting down socket rpc-svc/tcp: nfsd: sent only -107 bytes of 124 - shutting down socket rpc-svc/tcp: nfsd: sent only -107 bytes of 1304 - shutting down socket rpc-svc/tcp: nfsd: sent only -107 bytes of 1304 - shutting down socket Attached is a "foreach bt" and "foreach bt -l" from this crash dump.
It appears that the all nfsds are haning in the ext3 journal code To make this much clear, could you please have crash load the nfsd.o and ext3.o modules before doing the stack trace... The steps would be on the machine that is hanging: 1) enter crash 2) mod -S # this loads all the needed modules 3) If step 2 is successful jump to step 6 if not goto step 4 4) mod -s nfs /lib/modules/kernel-version/kernel/fs/nfsd/nfsd.o 5) mod -s ext3 /lib/modules/kenrel-version/kernel/fs/ext3/ext3.o 6) foreach bt
Created attachment 94091 [details] Two Stack Traces with Loaded Modules Attached are two backtraces where I ran "mod -S" as requested. Since I had suspected the ext3 journal for the deadlock, I had mounted the storage using ext2 (just for the NFS exported filesystems) and was also able to get a hang. A backtrace from a system using ext3 and a different system using ext2 are both attached.
From the latest bt_withmods.txt, ext3 trace: nfsd pid 1021 is stuck waiting for IO: PID: 1021 TASK: f5bcc000 CPU: 0 COMMAND: "nfsd" #0 [f5bcdd58] context_switch at 2125a04 #1 [f5bcdd70] schedule at 2123c02 #2 [f5bcddb4] io_schedule at 2124f45 #3 [f5bcddc0] __wait_on_buffer at 21623c9 #4 [f5bcddf8] journal_flushpage at f884a3c3 #5 [f5bcde2c] ext3_flushpage at f885ed18 kjournald pid 36 is waiting on IO: PID: 36 TASK: f6408000 CPU: 1 COMMAND: "kjournald" #0 [f6409dc8] context_switch at 2125a04 #1 [f6409de0] schedule at 2123c02 #2 [f6409e24] io_schedule at 2124f45 #3 [f6409e30] __wait_on_buffer at 21623c9 #4 [f6409e68] journal_commit_transaction at f884ba3b #5 [f6409ff0] kernel_thread_helper at 210956b as are pids 295, 297 and 299. The other kjournald threads are all idle except for pid 298 which appears to be waiting for a transaction to complete. All the other ext3 waiters may be blocked behind these ones. In the ext2 trace, nfsd pid 722 is waiting for IO (it's waiting for a free request, so the IO queue is already full); pids 725 and 729 are in the same state, and the others are blocked on VFS locks (perhaps ones held by the blocked nfsds?) Sadc (pid 993) is blocked inside ext2 waiting for IO to complete. In both cases it looks like there are critical tasks stalled waiting block device IOs. Was the system 100% idle in each case? If so, the root cause looks like a storage-layer problem, not a filesystem problem. What storage is being used on the various filesystems here --- the root, and the NFS-exported filesystems? What IO controllers and device drivers?
Controller seems to be a 3ware one according to the backtraces
The storage in each system is a pair of 3ware controllers. In one system, it is a pair of the escelade 8500 controllers, in the other two systems, it is a pair of escelade 7500 controllers. Each controler has 8 250 GB IDE drives. From each 3ware controler, one drive is presented as JBOD, 7 drives as a RAID 5 LUN. The OS image is software mirrored across the two JBOD disks, and the remaining 2 RAID 5 volumes are NFS exported. The device driver for these 3ware devices is the relase included with RHEL3. The firmware on the 3ware controlers was upgraded to the latest release as one of the early troubleshooting steps when the problems started occuring. It had occured to me this may be a 3ware driver problem last week, so I downloaded and tested the 2.4.22 kernel from www.kernel.org which includes the most recent version of the 3ware driver. My test-case which reliably reproduces the problem within 10-20 seconds with RHEL 3 was not able to reproduce the problem with 2.4.22. However, when using that www.kernel.org release under the systems typical load, I was able to get similar hangs. To see if the newer release of the 3ware driver accounts for this improvement, I installed the www.kernel.org release of the 2.4.21 kernel and re-ran my tests, getting similar results. Therefore, it seems that problem exists in the www.kernel.org relase as well as RHEL3 beta, but the RHEL3 beta seems to easier to reproduce the problem in. To respond about question about the system being idle, when a 'vmstat' is left running during the problem, it usually continues to display information. See the first section of this bugzilla report for an example of what vmstat shows while the problem is occuring [the output is from RHEL3 beta 1, the vmstat output from RHEL3 beta 2 is similar, however the "wa" column does not exceed % 100.]
Created attachment 94145 [details] profiling output from op_time To supplement current information, I gathered some profiling information and ran op_time -l. The system was impaired, but not completely wedged at this time (otherwise, I would have not been able to run the profiling commands.)
Update from Barber, Michael (UNIX Force Inc): I've been pursuing the hardware also. Since there are three systems with similar symptoms, it seems that instead of a bad component, perhaps I have a bad combonation of hardware + firmware + drivers. One of the first troubleshooting steps I did while running RH8 was to update the firmware on the 3ware cards to the newest release. However, when looking at the driver version I can see that RHEL does not have the most recent 3ware driver. Therefore, I down-reved the 3ware firmware to a level which matches the RHEL driver. This did not help. I checked the BIOS levels and found that my system with a tyan motherboard was up to date, but my other two systems with supermicro motherboards were behind on the BIOS, so I updated them. No help. I've also went through the BIOS settings and disabled any "advanced" functions, such as hyper-threading, etc. No help here. I opened a ticket with 3ware, and their initial response is also that I must have some bad hardware. They suggested that it may be incompatible memory. I'll check into this tomorrow, I perhaps made a bad assumption that the VAR be bought these from knew what kind of memory was needed and that the system would not pass the memory test if the memory was bad. I'll see if either of these are bad assumptions tomorrow.
Created attachment 94393 [details] A simple script to test responsiveness and throughput I have done further testing and have more information to share. * Since there was concern that we may have an incompatible collection of hardware, I did some research and found that the model of motherboard we are using is not on 3ware's HCL. I purchased a different motherboard that is on their list and re-tested getting similar results. * I borrowed a 3ware 7506 card from one of the local vendors to see if the problem occurs with more recent hardware (we have 7500 and 8500 cards.) No improvement. * Heavy local I seems to be able to temporarily wedge the system, but not to the same extent NFS does. Perhaps many readers/writers influences this behavior? Since local IO doesn't seem to wedge the system as bad, I put together a scripted test which generates a lot of IO to both 3ware controllers and times how long it takes to do misc tasks like ps, ls, creating, renaming, deleting files. For sake of comparison, I tested this against the kernel.org release of the 2.4.21 and 2.4.22 kernels. To simulate an interactive user, the script sleeps between some operations, the sleep time total of the script is 4 minutes, so everything beyond 4 minutes is how long the script took to run the other commands. Results (runtime listed in minutes, the adjusted runtime is runtime - 4 minutes to remove deliberate "sleeps"): Kernel Runtime Adjusted Runtime ================================================================== ext3 + 2.4.21-1.1931.2.409.entsmp 16.92 12.92 ext3 + 2.4.21-1.1931.2.409.enthugemem 15.85 11.85 ext3 + 2.4.21 6.45 2.45 ext3 + 2.4.22 4.25 .25 So, it seems the open source 2.4.22 kernel running on this system (same ext3 filesystem and no hardware changes or tuning) is about 42 times more responsive than RHEL3b2 when under high IO load with my collection of hardware. I also put together a scripted throughput test that just does a simple dd write to each of the two 3ware filesystems. The throughput was nearly the same for each kernel, only that the system was not very responsive when running the RHEL3 kernels during the throughput test. Here are the results (with absolutely no tuning done): Throughput Test ================================================================== ext3 + 2.4.21 34.75 MB/sec ext3 + 2.4.22 34.64 MB/sec ext3 + 2.4.21-1.1931.2.409.enthugemem 34.42 MB/sec ext3 + 2.4.21-1.1931.2.409.entsmp 38.28 MB/sec Since it seems like we are not going to be able to solve this problem with RHEL3 beta and I have so far not been able to reproduce problems with the kernel.org 2.4.22 kernel, we are exploring it's use as an alternative to RHEL. Performance with the XFS filesystem add-on also seems to be quite a bit better performance than ext3. I suggest we leave this ticket open for the next week. If I encounter any serious problems with the kernel.org 2.4.22 kernel we will then re-evaluate if we want to continue pursuing this RHEL 3 issue.
Try elvtune -r 1 -w 1 -b 1 /dev/xxx for each spindle and see if that helps -- it has helped in some of our internal testing.
Have you tried this with the RHEL 3 Release Candidate? Have you tried the elvtune suggestion?
I have not tried the RHEL release candidate, no have I tried the elvtune suggestion. As I explained in my last post, I decided to try using the kernel.org 2.4.22 kernel with XFS. Since converting, all three of these systems have been %100 stable (just over one month now.) Unfortunately, I could never keep RHEL stable for more than a day or two on these same systems. I'd like to test the latest build out and see if the issue has been fixed, but the window has closed on this project since we have finally landed on a reasonably stable solution. Sorry. Thanks for all the effort those involved put into trying to resolve this problem. If none of your other customers have run into this issue, feel free to close this bug.
No updates for one year. If there is further interest in this issue, please re-test with the most recent RHEL 3 update.