149529 – panic in do_page_fault

Bug 149529 - panic in do_page_fault

Summary: panic in do_page_fault

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-23 20:14 UTC by Marvin Solomon
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-08-31 17:49:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Full description of problem (7.13 KB, text/plain) 2005-02-23 20:22 UTC, Marvin Solomon	no flags	Details
Transcript of a series of sysrq commands (43.61 KB, text/plain) 2005-03-13 13:02 UTC, Marvin Solomon	no flags	Details
Discussion of attachment 111936 (4.04 KB, text/plain) 2005-03-13 13:04 UTC, Marvin Solomon	no flags	Details
View All

Description Marvin Solomon 2005-02-23 20:14:56 UTC

Description of problem:
Machine hangs every few days.  Sysrq shows a panic

Version-Release number of selected component (if applicable):
2.4.21-27.0.2.ELsmp


How reproducible:
Always (if you wait long enough)

Steps to Reproduce:
1.  Put regression test in loop
2.  Wait a few days
3.
  
Actual results:
System hangs

Expected results:
System does not hang

Additional info:
We filed a support ticket with RH (Service Request: 474076) on Jan 28 and have
been going back and forth with them since then with little to show for it.  At
the suggestion of RH, we have turned on nmi_watchdog and turned off apic:

    % cat /proc/cmdline 
    ro root=LABEL=/ console=tty0 console=ttyS0,9600n8 nmi_watchdog=2 noapic

Postmortem analysis of logs shows a variety of different activities at the
time of the crash.  In some cases a large number of Java processes were
running.  In at least one case, the only thing running was a large rm -rf.
In several cases sysrq-t shows a panic in page-fault code.  Three examples
are included below.

One other interesting fact is that in several cases, the machine started
showing eractic behavior an hour or so before the crash.  For example,
a crash of the Java virtual machine or a null "this" pointer.

The best guess would appear to be some sort of race condition in the paging
code in the kernel.

Any advice would be greatly appreciated.

Trace 1 (Feb 16):

java          R 00000002     0 16639  16638                     (NOTLB)
Call Trace:   [<c01d1aca>] generic_make_request [kernel] 0xea (0xd24a3c44)
[<c0123f14>] schedule [kernel] 0x2f4 (0xd24a3c58)
[<c01d1b69>] submit_bh_rsector [kernel] 0x49 (0xd24a3c6c)
[<c016593d>] write_some_buffers [kernel] 0x17d (0xd24a3c9c)
[<c0165975>] write_unlocked_buffers [kernel] 0x25 (0xd24a3d40)
[<c0165a9a>] sync_buffers [kernel] 0x1a (0xd24a3d4c)
[<c0165c47>] fsync_dev [kernel] 0x27 (0xd24a3d64)
[<c0165daf>] sys_sync [kernel] 0xf (0xd24a3d78)
[<c0128b97>] panic [kernel] 0x187 (0xd24a3d80)
[<c010c68c>] die [kernel] 0xac (0xd24a3d98)
[<c011fffe>] do_page_fault [kernel] 0x30e (0xd24a3dac)
[<c013ee35>] vm_set_pte [kernel] 0x75 (0xd24a3dd0)
[<c0142001>] do_wp_page [kernel] 0xa81 (0xd24a3dfc)
[<c011fcf0>] do_page_fault [kernel] 0x0 (0xd24a3e68)
[<c017e3e1>] d_lookup [kernel] 0x71 (0xd24a3ea4) 
[<c017288b>] cached_lookup [kernel] 0x1b (0xd24a3ee0)
[<c0172fc8>] link_path_walk [kernel] 0x428 (0xd24a3ef0)
[<c010cf98>] default_do_nmi [kernel] 0x98 (0xd24a3f1c)
[<c0173549>] path_lookup [kernel] 0x39 (0xd24a3f30)
[<c0173b0e>] open_namei [kernel] 0x7e (0xd24a3f40)
[<c0123f14>] schedule [kernel] 0x2f4 (0xd24a3f4c)      
[<c0162ff3>] filp_open [kernel] 0x43 (0xd24a3f70)
[<c0163423>] sys_open [kernel] 0x53 (0xd24a3fa8)

Trace 2 (Feb 19)

kswapd        D 00000001  4592    11      1            12    10 (L-TLB)
Call Trace:   [<c0123f14>] schedule [kernel] 0x2f4 (0xc667bda8)
[<c01247d2>] sleep_on [kernel] 0x52 (0xc667bdec)
[<f885db58>] log_wait_commit_Rsmp_c80020b3 [jbd] 0x68 (0xc667be1c)
[<f8875120>] ext3_sync_fs [ext3] 0x0 (0xc667be30)
[<f8875157>] ext3_sync_fs [ext3] 0x37 (0xc667be34)
[<c016b2e9>] sync_supers [kernel] 0x129 (0xc667be44)
[<c010c820>] do_invalid_op [kernel] 0x0 (0xc667be58)
[<c0165c87>] fsync_dev [kernel] 0x67 (0xc667be60)
[<c0165daf>] sys_sync [kernel] 0xf (0xc667be74) 
[<c0128b97>] panic [kernel] 0x187 (0xc667be7c)
[<c010c68c>] die [kernel] 0xac (0xc667be94) 
[<c010c887>] do_invalid_op [kernel] 0x67 (0xc667bea8)
[<c01566c4>] rebalance_dirty_zone [kernel] 0x164 (0xc667bed4)
[<c0125ed4>] context_switch [kernel] 0xa4 (0xc667bee4)
[<c01520f2>] free_block [kernel] 0x32 (0xc667bef0)
[<c01566c4>] rebalance_dirty_zone [kernel] 0x164 (0xc667bf7c)
[<c0156c0b>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc667bfac)
[<c0156d38>] kswapd [kernel] 0x68 (0xc667bfd0)
[<c0156cd0>] kswapd [kernel] 0x0 (0xc667bfe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc667bff0)

Trace 3 (Feb 20)

rm            R 00000002     0 25118  24409                     (NOTLB)
Call Trace:   [<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093db18)
[<c0123f14>] schedule [kernel] 0x2f4 (0xd093db2c)
[<c01341fb>] del_timer_sync [kernel] 0x1b (0xd093db5c)
[<c0134f6d>] schedule_timeout [kernel] 0x6d (0xd093db70)
[<c010d04d>] do_nmi [kernel] 0x2d (0xd093db84)
[<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dba4)
[<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dbb4)
[<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dbc8)
[<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dbd8)
[<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dbe0)
[<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dc00)
[<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dc10)
[<c010cf98>] default_do_nmi [kernel] 0x98 (0xd093dc30)
[<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dc44)    
[<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dc4c)
[<c011d5f0>] smp_apic_timer_interrupt [kernel] 0x0 (0xd093dc60)
[<c010cf98>] default_do_nmi [kernel] 0x98 (0xd093dc64)
[<c0129f5e>] profile_hook [kernel] 0x2e (0xd093dc84)
[<c011dc90>] nmi_watchdog_tick [kernel] 0x30 (0xd093dc94)
[<c010cf98>] default_do_nmi [kernel] 0x98 (0xd093dcb4)
[<c0134933>] update_process_time_intertick [kernel] 0x53 (0xd093dcd0)
[<c0134bcb>] update_process_times_statistical [kernel] 0x7b (0xd093dcf4)
[<c011d5f0>] smp_apic_timer_interrupt [kernel] 0x0 (0xd093dd20)
[<c011d5f0>] smp_apic_timer_interrupt [kernel] 0x0 (0xd093dd38)
[<c011d660>] smp_apic_timer_interrupt [kernel] 0x70 (0xd093dd3c)
[<c010d04d>] do_nmi [kernel] 0x2d (0xd093dd40)
[<c016a5c5>] .text.lock.buffer [kernel] 0x4d (0xd093dd84)
[<c0165daf>] sys_sync [kernel] 0xf (0xd093dda0)
[<c0128b97>] panic [kernel] 0x187 (0xd093dda8)
[<c010c68c>] die [kernel] 0xac (0xd093ddc0)
[<c011fffe>] do_page_fault [kernel] 0x30e (0xd093ddd4)
[<c01669f0>] getblk [kernel] 0x60 (0xd093dde4)
[<f886c71d>] ext3_getblk [ext3] 0xad (0xd093de04)
[<f8857cc8>] do_get_write_access [jbd] 0x328 (0xd093de38)
[<f8857cc8>] do_get_write_access [jbd] 0x328 (0xd093de48)
[<c011fcf0>] do_page_fault [kernel] 0x0 (0xd093de90)
[<f8860068>] .rodata.str1.1 [jbd] 0x5e8 (0xd093dec0)
[<c0147552>] __remove_inode_page [kernel] 0x12 (0xd093decc)
[<c01475fd>] remove_inode_page [kernel] 0x2d (0xd093dee4)
[<c01477ca>] truncate_complete_page [kernel] 0x3a (0xd093def0)
[<c01478e7>] truncate_list_pages [kernel] 0xd7 (0xd093df00)
[<f8858bbe>] journal_stop_Rsmp_74af6844 [jbd] 0x17e (0xd093df08)
[<c0147abd>] truncate_inode_pages [kernel] 0x4d (0xd093df34)
[<f887d980>] ext3_sops [ext3] 0x0 (0xd093df40)
[<c0180b94>] iput [kernel] 0x1a4 (0xd093df4c)
[<c0174ed5>] vfs_unlink [kernel] 0x185 (0xd093df68)
[<c0175149>] sys_unlink [kernel] 0x119 (0xd093df84)

Comment 1 Marvin Solomon 2005-02-23 20:22:44 UTC

Created attachment 111346 [details]
Full description of problem

The original bug report seem to be missing the first half.
This is the whole thing.

Comment 2 Charles C. Figueiredo 2005-03-04 19:38:00 UTC

Not sure if this is related, but I too have a group of machines doing heavy NFS 
i/o along with java jvm's that panic every couple of days. Sometimes they get 
hosed before crapping out, but most of the time they just panic. Some of the 
panic call traces refer to kswapd, others to nfs_ and sys_, others to 
do_page_fault and company. I'm running U3 patched systems with the U4 kernel. 
We only seems to experience this sort of instability on heavy NFS clients.

I'm running a battery of cat "foo" > /proc/sysrq-trigger every 2 seconds on 
these hosts to get more info about my panics. Not sure if this is related, but 
I feel your pain. :-)

-charles.

Comment 3 Marvin Solomon 2005-03-05 16:15:07 UTC

Thanks, I appreciate your sympathy.  Misery loves company :-).  Perhaps with
a large enough critical mass of victims, we can make some progress on getting
this thing solved.  

Just for the record, we've pretty much ruled out NFS as being related to our
problem.  We are seeing these crashes on two machines running the same user
workload.  The workload does not touch any NFS file systems, and just to make
sure, we configures one of the machines not to mount *any* NFS file systems.
In fact we were seeing NFS problems with the previous kernel, which is why we
upgraded to 2.4.21-27.0.2.ELsmp.  We have been seeing these crashes ever since.
Unfortunately, the crashes only happen once every few days, and when they
do, no crash file is produced and sysrq often doesn't work.

Comment 4 Marvin Solomon 2005-03-13 13:02:13 UTC

Created attachment 111936 [details]
Transcript of a series of sysrq commands

This is the transcript of a series of sysrq commands
entered into a hung system.  More detailed comments
are included in the next attachment.

Comment 5 Marvin Solomon 2005-03-13 13:04:52 UTC

Created attachment 111937 [details]
Discussion of attachment 111936 [details]

This is a discussion and anlysis of the sysrq output contained
in attachment 111936 [details].

Comment 6 Stephen Tweedie 2005-08-31 17:49:04 UTC

Our support services are really a far more appropriate place to deal with
reports of this nature; please pursue the problem there.  They will be more
familiar with other problems that might be related to this one and will be able
to help to capture the information necessary to route this to the correct place
in engineering once a footprint has been identified.

For now, there is not enough information in this report to identify the problem;
full oops traces (not just sysrq-t) would be needed for a start.  

Thanks,
 Stephen.

Note You need to log in before you can comment on or make changes to this bug.