Bug 156937 - kernel: ENOMEM in journal_alloc_journal_head, retrying Dell 6650
kernel: ENOMEM in journal_alloc_journal_head, retrying Dell 6650
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
2.1
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-05-05 11:16 EDT by Kaushik Lad
Modified: 2007-11-30 17:06 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-05-05 11:39:59 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Kaushik Lad 2005-05-05 11:16:40 EDT
+++ This bug was initially created as a clone of Bug #99025 +++

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
Under heavy nfs load (from several rsyncs) on a large 556GB filesystem the
following message appears in /var/log/messages:

Jul 11 14:44:00 srs kernel: ENOMEM in journal_get_undo_access_Rsmp_767cdac6,
retrying.
Jul 11 14:44:07 srs kernel: ENOMEM in journal_alloc_journal_head, retrying.
Jul 11 14:44:10 srs kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 11 14:44:18 srs kernel: ENOMEM in journal_get_undo_access_Rsmp_767cdac6,
retrying.
Jul 11 14:44:28 srs kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 11 14:45:30 srs kernel: journal_write_metadata_buffer: ENOMEM at
get_unused_buffer_head, trying again.
Jul 11 14:45:38 srs kernel: ENOMEM in do_get_write_access, retrying.

etc...

At the end of an hour or more of these messages the machine hangs. It can still
respond to ping, but it isn't possible to login on the console or via ssh. The
only resolution appears to be switching it off and on again.

I see a similar bug has been logged concerning the tg3 driver for the broadcom
NetXtreme ethernet card, but I am not using the tg3 driver - I am using the
bcm5700 driver...




Version-Release number of selected component (if applicable):
Linux 2.4.9-e.25enterprise

How reproducible:
Sometimes

Steps to Reproduce:
1. A heavy network load, typically produced by running several rsyncs.
2.
3.
    

Actual Results:  Machine hangs indefinitely

Expected Results:  Other possibly relevant information: the machine is 4
processor with hyperthreading enabled and has 32 GB of memory.

Additional info:


Hi My name is Kaushik.
I am using Dell Power Edge 2600 with 12 GB of memory. This server is recently 
upgraded to kernel "Linux findb2 2.4.9-e.57enterprise #1 SMP Thu Dec 2 20:45:51 
EST 2004 i686 unknown". This server belongs to master server for Veritas 
Netbackup 4.5. Today Morning my production backup failed with veritas error 
status 41 "network connection timed out" at 4:26 AM. When I checked 
the/var/log/message file on my master server, I saw the entry "kernel: ENOMEM 
in journal_get_undo_access_Rsmp_767cdac6, retrying" 
I checked the bug 99025 which says that it resolved in 2.4.9-e.57enterprise 
kernel but I received the message again. 
Please help us to figure out more in depth so that same message will not be 
appear again. I can't reproduce this.
Comment 1 Stephen Tweedie 2005-05-05 11:39:59 EDT
To repeat the information in bug 99025:

> The journal messages are symptoms of heavy memory pressure but do not, by
> themselves, signal any failures.  There are plenty of other places in the 
> kernel where similar memory allocators simply retry silently under that 
> pressure: the journal code is just one of the very few places that logs 
> this condition.

> These messages are entirely harmless in themselves.  The hang will need more
> information to diagnose.  Can you capture an alt-sysrq-t or alt-sysrq-p
> backtrace when the hang occurs?

So the presence of that "ENOMEM ..., retrying" message is an indication that
we're under memory pressure, but it is NOT a bug in and of itself.  It's just a
status indication.  Serious memory pressure can occur from time to time; this
message indicates that ext3 is *surviving* that pressure, not that it is
failing.  The ext3 filesystem that is giving this message retries automatically
until the memory allocation succeeds.

The core of your bug report is that the message is occurring; that's not a bug.
 You seem to be mentioning that you are having other problems at the same time;
that _may_ be a bug, but will need opening as a separate bug report, with proper
information about those other symptoms, before we can deal with it.

Note You need to log in before you can comment on or make changes to this bug.