Description of problem: Machine hangs every few days. Sysrq shows a panic Version-Release number of selected component (if applicable): 2.4.21-27.0.2.ELsmp How reproducible: Always (if you wait long enough) Steps to Reproduce: 1. Put regression test in loop 2. Wait a few days 3. Actual results: System hangs Expected results: System does not hang Additional info: We filed a support ticket with RH (Service Request: 474076) on Jan 28 and have been going back and forth with them since then with little to show for it. At the suggestion of RH, we have turned on nmi_watchdog and turned off apic: % cat /proc/cmdline ro root=LABEL=/ console=tty0 console=ttyS0,9600n8 nmi_watchdog=2 noapic Postmortem analysis of logs shows a variety of different activities at the time of the crash. In some cases a large number of Java processes were running. In at least one case, the only thing running was a large rm -rf. In several cases sysrq-t shows a panic in page-fault code. Three examples are included below. One other interesting fact is that in several cases, the machine started showing eractic behavior an hour or so before the crash. For example, a crash of the Java virtual machine or a null "this" pointer. The best guess would appear to be some sort of race condition in the paging code in the kernel. Any advice would be greatly appreciated. Trace 1 (Feb 16): java R 00000002 0 16639 16638 (NOTLB) Call Trace: [<c01d1aca>] generic_make_request [kernel] 0xea (0xd24a3c44) [<c0123f14>] schedule [kernel] 0x2f4 (0xd24a3c58) [<c01d1b69>] submit_bh_rsector [kernel] 0x49 (0xd24a3c6c) [<c016593d>] write_some_buffers [kernel] 0x17d (0xd24a3c9c) [<c0165975>] write_unlocked_buffers [kernel] 0x25 (0xd24a3d40) [<c0165a9a>] sync_buffers [kernel] 0x1a (0xd24a3d4c) [<c0165c47>] fsync_dev [kernel] 0x27 (0xd24a3d64) [<c0165daf>] sys_sync [kernel] 0xf (0xd24a3d78) [<c0128b97>] panic [kernel] 0x187 (0xd24a3d80) [<c010c68c>] die [kernel] 0xac (0xd24a3d98) [<c011fffe>] do_page_fault [kernel] 0x30e (0xd24a3dac) [<c013ee35>] vm_set_pte [kernel] 0x75 (0xd24a3dd0) [<c0142001>] do_wp_page [kernel] 0xa81 (0xd24a3dfc) [<c011fcf0>] do_page_fault [kernel] 0x0 (0xd24a3e68) [<c017e3e1>] d_lookup [kernel] 0x71 (0xd24a3ea4) [<c017288b>] cached_lookup [kernel] 0x1b (0xd24a3ee0) [<c0172fc8>] link_path_walk [kernel] 0x428 (0xd24a3ef0) [<c010cf98>] default_do_nmi [kernel] 0x98 (0xd24a3f1c) [<c0173549>] path_lookup [kernel] 0x39 (0xd24a3f30) [<c0173b0e>] open_namei [kernel] 0x7e (0xd24a3f40) [<c0123f14>] schedule [kernel] 0x2f4 (0xd24a3f4c) [<c0162ff3>] filp_open [kernel] 0x43 (0xd24a3f70) [<c0163423>] sys_open [kernel] 0x53 (0xd24a3fa8) Trace 2 (Feb 19) kswapd D 00000001 4592 11 1 12 10 (L-TLB) Call Trace: [<c0123f14>] schedule [kernel] 0x2f4 (0xc667bda8) [<c01247d2>] sleep_on [kernel] 0x52 (0xc667bdec) [<f885db58>] log_wait_commit_Rsmp_c80020b3 [jbd] 0x68 (0xc667be1c) [<f8875120>] ext3_sync_fs [ext3] 0x0 (0xc667be30) [<f8875157>] ext3_sync_fs [ext3] 0x37 (0xc667be34) [<c016b2e9>] sync_supers [kernel] 0x129 (0xc667be44) [<c010c820>] do_invalid_op [kernel] 0x0 (0xc667be58) [<c0165c87>] fsync_dev [kernel] 0x67 (0xc667be60) [<c0165daf>] sys_sync [kernel] 0xf (0xc667be74) [<c0128b97>] panic [kernel] 0x187 (0xc667be7c) [<c010c68c>] die [kernel] 0xac (0xc667be94) [<c010c887>] do_invalid_op [kernel] 0x67 (0xc667bea8) [<c01566c4>] rebalance_dirty_zone [kernel] 0x164 (0xc667bed4) [<c0125ed4>] context_switch [kernel] 0xa4 (0xc667bee4) [<c01520f2>] free_block [kernel] 0x32 (0xc667bef0) [<c01566c4>] rebalance_dirty_zone [kernel] 0x164 (0xc667bf7c) [<c0156c0b>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc667bfac) [<c0156d38>] kswapd [kernel] 0x68 (0xc667bfd0) [<c0156cd0>] kswapd [kernel] 0x0 (0xc667bfe4) [<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc667bff0) Trace 3 (Feb 20) rm R 00000002 0 25118 24409 (NOTLB) Call Trace: [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093db18) [<c0123f14>] schedule [kernel] 0x2f4 (0xd093db2c) [<c01341fb>] del_timer_sync [kernel] 0x1b (0xd093db5c) [<c0134f6d>] schedule_timeout [kernel] 0x6d (0xd093db70) [<c010d04d>] do_nmi [kernel] 0x2d (0xd093db84) [<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dba4) [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dbb4) [<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dbc8) [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dbd8) [<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dbe0) [<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dc00) [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dc10) [<c010cf98>] default_do_nmi [kernel] 0x98 (0xd093dc30) [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dc44) [<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dc4c) [<c011d5f0>] smp_apic_timer_interrupt [kernel] 0x0 (0xd093dc60) [<c010cf98>] default_do_nmi [kernel] 0x98 (0xd093dc64) [<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dc84) [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dc94) [<c010cf98>] default_do_nmi [kernel] 0x98 (0xd093dcb4) [<c0134933>] update_process_time_intertick [kernel] 0x53 (0xd093dcd0) [<c0134bcb>] update_process_times_statistical [kernel] 0x7b (0xd093dcf4) [<c011d5f0>] smp_apic_timer_interrupt [kernel] 0x0 (0xd093dd20) [<c011d5f0>] smp_apic_timer_interrupt [kernel] 0x0 (0xd093dd38) [<c011d660>] smp_apic_timer_interrupt [kernel] 0x70 (0xd093dd3c) [<c010d04d>] do_nmi [kernel] 0x2d (0xd093dd40) [<c016a5c5>] .text.lock.buffer [kernel] 0x4d (0xd093dd84) [<c0165daf>] sys_sync [kernel] 0xf (0xd093dda0) [<c0128b97>] panic [kernel] 0x187 (0xd093dda8) [<c010c68c>] die [kernel] 0xac (0xd093ddc0) [<c011fffe>] do_page_fault [kernel] 0x30e (0xd093ddd4) [<c01669f0>] getblk [kernel] 0x60 (0xd093dde4) [<f886c71d>] ext3_getblk [ext3] 0xad (0xd093de04) [<f8857cc8>] do_get_write_access [jbd] 0x328 (0xd093de38) [<f8857cc8>] do_get_write_access [jbd] 0x328 (0xd093de48) [<c011fcf0>] do_page_fault [kernel] 0x0 (0xd093de90) [<f8860068>] .rodata.str1.1 [jbd] 0x5e8 (0xd093dec0) [<c0147552>] __remove_inode_page [kernel] 0x12 (0xd093decc) [<c01475fd>] remove_inode_page [kernel] 0x2d (0xd093dee4) [<c01477ca>] truncate_complete_page [kernel] 0x3a (0xd093def0) [<c01478e7>] truncate_list_pages [kernel] 0xd7 (0xd093df00) [<f8858bbe>] journal_stop_Rsmp_74af6844 [jbd] 0x17e (0xd093df08) [<c0147abd>] truncate_inode_pages [kernel] 0x4d (0xd093df34) [<f887d980>] ext3_sops [ext3] 0x0 (0xd093df40) [<c0180b94>] iput [kernel] 0x1a4 (0xd093df4c) [<c0174ed5>] vfs_unlink [kernel] 0x185 (0xd093df68) [<c0175149>] sys_unlink [kernel] 0x119 (0xd093df84)
Created attachment 111346 [details] Full description of problem The original bug report seem to be missing the first half. This is the whole thing.
Not sure if this is related, but I too have a group of machines doing heavy NFS i/o along with java jvm's that panic every couple of days. Sometimes they get hosed before crapping out, but most of the time they just panic. Some of the panic call traces refer to kswapd, others to nfs_ and sys_, others to do_page_fault and company. I'm running U3 patched systems with the U4 kernel. We only seems to experience this sort of instability on heavy NFS clients. I'm running a battery of cat "foo" > /proc/sysrq-trigger every 2 seconds on these hosts to get more info about my panics. Not sure if this is related, but I feel your pain. :-) -charles.
Thanks, I appreciate your sympathy. Misery loves company :-). Perhaps with a large enough critical mass of victims, we can make some progress on getting this thing solved. Just for the record, we've pretty much ruled out NFS as being related to our problem. We are seeing these crashes on two machines running the same user workload. The workload does not touch any NFS file systems, and just to make sure, we configures one of the machines not to mount *any* NFS file systems. In fact we were seeing NFS problems with the previous kernel, which is why we upgraded to 2.4.21-27.0.2.ELsmp. We have been seeing these crashes ever since. Unfortunately, the crashes only happen once every few days, and when they do, no crash file is produced and sysrq often doesn't work.
Created attachment 111936 [details] Transcript of a series of sysrq commands This is the transcript of a series of sysrq commands entered into a hung system. More detailed comments are included in the next attachment.
Created attachment 111937 [details] Discussion of attachment 111936 [details] This is a discussion and anlysis of the sysrq output contained in attachment 111936 [details].
Our support services are really a far more appropriate place to deal with reports of this nature; please pursue the problem there. They will be more familiar with other problems that might be related to this one and will be able to help to capture the information necessary to route this to the correct place in engineering once a footprint has been identified. For now, there is not enough information in this report to identify the problem; full oops traces (not just sysrq-t) would be needed for a start. Thanks, Stephen.