179605 – journal_get_undo_access: No memory for committed data

Bug 179605 - journal_get_undo_access: No memory for committed data

Summary: journal_get_undo_access: No memory for committed data

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-02-01 16:55 UTC by Jeff Burke
Modified:	2007-11-30 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHEL4-U5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-07-10 18:55:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/log/messages file for issue reported (21.02 KB, text/plain) 2006-02-01 16:57 UTC, Jeff Burke	no flags	Details
/var/log/messages (38.24 KB, application/octet-stream) 2006-02-17 10:18 UTC, Tom G. Christensen	no flags	Details
View All

Description Jeff Burke 2006-02-01 16:55:23 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
* This is not a regression - I have seen it before but could not reproduce, It is very infrequent that it happens. *

While running the stress testing suite. The system gets into a state where it could not recover from oom kills. The file system gets mounted as read-only and the system becomes unresponsive.

Once the system gets into this state the only thing I could do was power off
when powering on the system it goes into single user mode and force the user to do a manual fsck.

journal_get_undo_access: No memory for committed data
ext3_try_to_allocate_with_rsv: aborting transaction: Out of memory in __ext3_journal_get_undo_access
EXT3-fs error (device md1) in ext3_new_block: Out of memory
Aborting journal on device md1.
ext3_abort called.
EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

Version-Release number of selected component (if applicable):
kernel-2.6.9-29.EL.smp

How reproducible:
Sometimes

Steps to Reproduce:
1. Using pe2850he run the stress kernel rpm test suite.
2. After a period of time this may or may not happen.

Actual Results: * See attached log *

Expected Results: system _should_ be able to recover.

Additional info:

I have several systems in the same about the same configuration. I have never see this issue on the other two systems. The big difference on this system is that we are using software raid level 1.

The other systems are not using raid.

Comment 1 Jeff Burke 2006-02-01 16:57:36 UTC

Created attachment 123973 [details]
/var/log/messages file for issue reported

Comment 2 Larry Woodman 2006-02-01 18:16:24 UTC

Strange, but when this happens it appears that kswapd and callers to
try_to_free_pages() do not run.  No progress reclaiming memory appears to be made.

Larry

Comment 3 Tom G. Christensen 2006-02-17 10:15:34 UTC

We've seen this several times aswell but with the U2 kernel (2.6.9-22.0.2smp)
The machine config is similar to the initial report but we're using a PERC4/Di
hardware RAID controller.
The problem showed itself during some very heavy filesystem activity.

Comment 4 Tom G. Christensen 2006-02-17 10:18:13 UTC

Created attachment 124804 [details]
/var/log/messages

/var/log/messages for my report

Comment 5 Larry Woodman 2007-07-10 17:47:10 UTC

Jeff and Tom, are either of you two seeing this problem anymore on RHEL4?

Larry Woodman

Comment 6 Jeff Burke 2007-07-10 18:43:55 UTC

Larry,
   I have no see this in quite some time.
Jeff

Comment 7 Larry Woodman 2007-07-10 18:55:06 UTC

Fixes for the old "kswapd0: page allocation failure. order:0, mode:0x0" were
committed to RHEL4 between U3, U4 and U5.  Since these changes were committed I
dont think we've seen this problem again.

Larry Woodman

Comment 8 Tom G. Christensen 2007-07-11 06:18:48 UTC

If I remember correctly the problem went away when we turned off dir_index on
the filesystem that caused the problem.
This also gave us a vast performance gain for our testcase which consisted of
millions of small files managed by the Fedora Object Management system.

Note You need to log in before you can comment on or make changes to this bug.