Bug 431909 - Oops: Kernel access of bad area, sig: 11 [#1] after a long string of oomkills
Summary: Oops: Kernel access of bad area, sig: 11 [#1] after a long string of oomkills
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: ppc64
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: David Howells
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-07 19:36 UTC by Mike Gahagan
Modified: 2009-11-11 20:00 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-11-11 20:00:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mike Gahagan 2008-02-07 19:36:24 UTC
Description of problem:
While running the /kernel/storage/lvm/snapshot_remove test, I hit this oops:

Unable to handle kernel paging request for data at address 0x7a0000002e677a00
Faulting instruction address: 0xd000000000317dd0
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA 
Modules linked in: loop autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6
xfrm_nalgo crypto_api dm_multipath snd_powermac snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm
snd_page_alloc snd_timer snd soundcore i2c_core parport_pc lp parport sg ide_cd
cdrom e1000 shpchp dm_snapshot dm_zero dm_mirror dm_mod ipr libata sd_mod
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
NIP: D000000000317DD0 LR: D000000000242008 CTR: D000000000317DB4
REGS: c0000000436b7500 TRAP: 0300   Not tainted  (2.6.18-78.el5)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 84004484  XER: 00000000
DAR: 7A0000002E677A00, DSISR: 0000000040000000
TASK = c00000002168b490[29123] 'lvcreate' THREAD: c0000000436b4000 CPU: 1
GPR00: D000000000242008 C0000000436B7780 D0000000003345A0 7A0000002E677A00 
GPR04: C00000002168B760 0000000000002534 0000000024004482 0000000000472900 
GPR08: 0000000000000000 D00000000024AF48 C0000000004325E8 D000000000317DB4 
GPR12: D0000000002431C0 C000000000465000 000000000FDF9614 000000000FDE5890 
GPR16: 000000000FDF9790 00000000C138FD09 00000000FF84A420 000000000FDE5864 
GPR20: 00000000FF84A438 000000001070B290 D000000000864000 C000000001927AF0 
GPR24: C000000001927B00 0000000000000050 FFFFFFFFFFFFFFF4 D000000004E50080 
GPR28: D000000000860160 7A0000002E677A00 D000000000252A80 C000000001927A80 
NIP [D000000000317DD0] .dm_io_client_destroy+0x1c/0x54 [dm_mod]
LR [D000000000242008] .persistent_destroy+0x24/0x5c [dm_snapshot]
Call Trace:
[C0000000436B7780] [C0000000436B7820] 0xc0000000436b7820 (unreliable)
[C0000000436B7810] [D000000000242008] .persistent_destroy+0x24/0x5c [dm_snapshot]
[C0000000436B78A0] [D0000000002416F8] .snapshot_ctr+0x484/0x5bc [dm_snapshot]
[C0000000436B7970] [D00000000031364C] .dm_table_add_target+0x1b4/0x394 [dm_mod]
[C0000000436B7A40] [D0000000003160D4] .table_load+0xfc/0x240 [dm_mod]
[C0000000436B7B10] [D000000000316F80] .ctl_ioctl+0x29c/0x318 [dm_mod]
[C0000000436B7D00] [D000000000317020] .dm_compat_ctl_ioctl+0x24/0x34 [dm_mod]
[C0000000436B7D70] [C00000000012DCBC] .compat_sys_ioctl+0x158/0x3b4
[C0000000436B7E30] [C0000000000086A4] syscall_exit+0x0/0x40
Instruction dump:
eb61ffd8 eb81ffe0 eba1ffe8 7c0803a6 4e800020 7c0802a6 fba1ffe8 7c7d1b78 
f8010010 f821ff71 60000000 60000000 <e8630000> 48002455 e8410028 e87d0008 
 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
-78.el5 kernel, 0206.nightly tree

How reproducible:
This test has produced oom kills before, but only on ppc. This is the first time
I've seen this panic. 

Steps to Reproduce:
1. kickoff rhts job that includes /kernel/storage/lvm/snapshot_remove
2. wait (I'm not sure if any of the preceding tests this system ran have
anything to do with this issue or not.)
3. The test can also be checked out of CVS and run manually if desired.
  
Actual results:
panic

Expected results:
test runs to completion and passes. 

Additional info:

no vmcore is available, however console log is available at:
http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=1870623

Comment 1 Mike Gahagan 2008-02-20 16:05:24 UTC
So far I have been unable to reproduce this with the 0212.0 tree & the -79 
kernel. It looks like at this point that either something got fixed somewhere 
or this is related to an interaction with other tests that were run as part of 
the same job. 


Comment 2 Mike Gahagan 2008-03-05 15:43:37 UTC
I have not seen the panic happen again, but I'm reproducing the oom kills that
lead up to this pretty regulary.. see the ppc recipe under:

http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=16951

The oom kills start with the snapshot_remove test.


Comment 3 David Howells 2008-03-09 11:50:36 UTC
That RHTS job also shows x86_64 failed.  Is there any way to find out if that 
OOM'd also?

Comment 4 Mike Gahagan 2008-03-11 20:23:31 UTC
The failure you are seeing for x86_64 was for a different test.. libhugetlbfs..
it looked like the test failed because it could not allocate enough huge pages
and I don't think it has anything to do with what I was seeing on the
snapshot_remove test. 

By the way, I opened BZ  436494 to address the oom-kills.

So far I have not seen anything have any difficulty running the snapshot_remove
test other than powerpc systems with approx 2GB or less of RAM.




Note You need to log in before you can comment on or make changes to this bug.