Bug 123226

Summary: System hangs with error RPC: sendmsg returned error 12
Product: Red Hat Enterprise Linux 3 Reporter: Michael K. Ellis <mkellis>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: allykken, ara.t.howard, bchacko, boone_dave, dpa.clt, dwa, gfrankliu, jmtt, k.georgiou, konradr, krafthef, krzeszut, mikem, nhorman, nick, normelton, p.c.m.chiu, perbu, petrides, riel, steved, tao
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-12 20:26:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 170417, 185486, 186960    
Attachments:
Description Flags
sysrq-m from a hang
none
meminfo & slabinfo trace
none
Log from serial console while machine is running pounder2-01-24
none
Very simple patch I used to trigger the dump. none

Description Michael K. Ellis 2004-05-13 23:53:34 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5)
Gecko/20031007 Firebird/0.7

Description of problem:
We have multiple systems crashing regularly with the error message
'RPC: sendmsg returned error 12', assuming we get any messages at all.
 Systems are dual-processor P4 2.8 Xeon compute servers running heavy
I/O against Sun Solaris 8 NFS servers over gigabit ethernet.  Crashes
happen on both kinds of compute servers: Sun v60x 1u Xeons and IBM
Blades.  The Solaris NFS servers are under heavy strain, but none of
the Solaris clients are crashing, and crashes are intermittant on the
intel-based gear.

Version-Release number of selected component (if applicable):
kernel-2.4.21-9.0.3.EL

How reproducible:
Didn't try

Steps to Reproduce:
1. Can't reproduce, seems to happen semi-randomly depending on user
compute patterns, and possibly on overall system load.


Additional info:

System logs aren't capturing any of the output, so it's only what
remains on console.

Comment 1 Ernie Petrides 2004-05-14 19:44:40 UTC
Could you please attach console output containing the oops output
or at least a panic message?  If you can't get this with a graphics
console, could you please arrange to use a serial console to collect
this info?

Thanks.  -ernie


Comment 2 Michael K. Ellis 2004-05-18 19:05:46 UTC
We're having trouble getting serial output: the system primarily
manifests itself on an IBM blade server, which isn't easy to get
serial console from (no serial ports save an aux USB port on the
front, which tends to move with the KVM).

In the mean time, we're seeing more issues.  I have a live window on a
machine that has a hung NFS mount.  All other NFS mounts (including to
the same server) are fine.  NFS server is a SunFire 280R, occasionally
under heavy load.  It isn't right now, but the mount point is hung. 
With the live window I can pull any info from /proc that you might want.

Comment 3 Ian Holsman 2004-06-08 07:25:56 UTC
We're getting the same thing.
I was just on the console when it happened.

hundreds of these messages, and then the ASR kicked in and rebooted
the bog.

we also have this machine mounting a NFS drive (as well as exporting a
NFS mount. (the NFS drive it is mounting is running RH7.3, this box is
running EL3.

our reboots are happening twice a day.. (and we have a console) so
I'll try to get some more info about it.
BUT I didnt see a panic or a OOPs.

Comment 4 Antonio Cruz 2004-06-14 14:39:00 UTC
I have the same information on /var/log/messages, after the system 
restarts. I was copying several 200M files over NFS to a Sun Solaris 
6 box and the system crashed with many of RPC error 12 on the console.

Jun 11 18:37:56 vega login(pam_unix)[26560]: session opened for user 
root by LOGIN(uid=0)
Jun 11 18:37:56 vega -- root[26560]: ROOT LOGIN ON tty2
Jun 11 18:51:16 vega kernel: RPC: sendmsg returned error 12
Jun 11 18:51:16 vega last message repeated 119 times
Jun 11 18:51:16 vega kernel: RPC: sendmsgreturned error 12
Jun 11 18:51:16 vega kernel: RPC: sendmsg returned error 12
Jun 11 18:56:38 vega syslogd 1.4.1: restart.
Jun 11 18:56:38 vega syslog: syslogd startup succeeded
Jun 11 18:56:38 vega kernel: klogd 1.4.1, log source = /proc/kmsg 
started.
Jun 11 18:56:38 vega kernel: Linux version 2.4.21-15.ELsmp 
(bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red 
Hat Linux 3.2.3-34)) #1 SMP Thu Apr 22 00:18:24 EDT 2004

Antonio

Comment 5 Per Andreas Buer 2004-07-30 10:05:22 UTC
I just saw this happen on a system of ours. Kernel 2.4.21-15.ELsmp
running som J2EE-app (ibm jvm 1.4.1 sr2) with moderate NFS-load.

No relevant entries in syslog. The message was 10s of times per second
- so I am unable to say if there where any messages preceding this.


Comment 6 Al Lykken 2004-09-05 15:03:34 UTC
We've seen this same problem on our Dual Intel 2.8 Ghz Xeon Systems. 
We also are using gigabit ethernet and NFS.  We have multiple NFS
mounts to a remote HP system.  One or more of the NFS mounts will lock
up or the system itself will lock up with the message "RPC: sendmsg
retunred error 12" or sometimes error 101.  We are using the 2.4.21-4
kernel.  The ethernet card is an Intel 82540 Gigabit Ethernet
Controller.  

The following was from the system log when one of our systems hung.
The error messages continued.

Aug 30 08:20:39 db2 kernel: RPC: sendmsg returned error 101
Aug 30 08:20:39 db2 kernel: nfs: RPC call returned error 101
Aug 30 08:20:39 db2 kernel: RPC: sendmsg returned error 101
Aug 30 08:20:39 db2 kernel: nfs: RPC call returned error 101
Aug 30 08:20:39 db2 kernel: RPC: sendmsg returned error 101
Aug 30 08:20:39 db2 kernel: nfs: RPC call returned error 101

Comment 7 jason andrade 2004-09-15 07:33:44 UTC
i'm also seeing this behaviour - so far only on a server that is
writing to the nfs filesystem.

RPC: sendmessage returned error 12

and the system has to be physically reset.

2.4.21-20 here, the nfs client is a single processor P4/2.8 with
hyperthreading turned on, smp kernel, 2G of ram and gigE nfs.

Comment 8 jason andrade 2004-09-15 07:35:25 UTC
sigh, i should have been more precise - a 2RU server that is a NFS
client is the one that is regularly crashing.  this hasn't happened to
us previously with the 2.4.21-15.0.4, 15.0.3, 15, 9.0.3, 9.0.1, 9 and 4
kernels so i'm not sure if it is something recent introduced.

-jason

Comment 9 Damian Noordink 2004-09-20 02:46:32 UTC
We've had this error occur for the first time on 2 servers running
2.4.21-20 over the weekend.  Servers are dual processor zeon 2.66's.

There is no indication in syslog, the message just repeats on the
console and servers are unresponsive - had to power cycle.

There would have been heavy NFS traffic at gigabit speeds when each
server experienced the error.


Comment 10 Damian Noordink 2004-09-23 01:31:56 UTC
Just readding this, as the crash has repeated overnight.  

Two of our servers are crashing with a known bug (bugzilla #123226,
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123226).

The message "RPC: sendmsg returned error 12" scrolls up the console, and
the server is pingable but unresponsive.  The server has to be power
cycled to recover, ctrl-alt-del doesn't do anything.

kernel 2.4.21-20.EL #1 SMP Wed Aug 18 20:46:40 EDT 2004
We think it is an NFS fault, it has happened twice while an NFS
copy is in progress.  It isn't happening all the time though.
All updates for the channels rhel-i386-as-3, rhel-i386-as-3-extras and
rhel-i386-as-3-devsuite have been applied as of today (Thu 23 Sep AEDT
GMT+10).
Syslog has nothing in it about any crash or error, the last log entry
is from named: update forwarding denied



Comment 11 Steve Dickson 2004-10-14 21:16:07 UTC
I believe we've found some recent page caching problems in
recent RHEL3 kernels. I wonder if this issue is a result of those
problem. I'm going to reassing to our VM guy to see what
he thinks.

Comment 12 Georg Melzer 2004-10-27 06:30:37 UTC
We have the same problem!

We are running an NFS client on Red Hat Enterprise 3 on a dual Xenon
3.2Ghz (patched) (Sun V65x). The NFS Server is a Sun Fire 280R with
Solaris 9 (patched).

The NFS client is a webserver who mounts the data from the NFS Server
(should be an clustered environment soon).

The NFS Server crashed reporting: RPC: sendmsg returned error 12
(running in a loop on console)
I was able to ping the NFS client webserver but no logins or responses
seem to be possible.

I need an immediate fix for this problem !!! 


Comment 13 Peter Chiu 2004-12-22 09:20:06 UTC
I wonder what the current status on this problem.

This morning I have found one of our dual Opteron system hang
with the identical symptons: system responds to ping, but nothing
else.  The console is full of repeated messages of
RPC: sendmsg returned error 12.

Thanks.
Peter

Comment 14 Larry Woodman 2005-02-14 20:55:51 UTC
There were changes in RHEL3-U4 to address ENOMEM failures.  Can you
please try to reporduce this problem with RHEL3-U4 and let me know
if that fixes the problem you are seeing?

Larry Woodman


Comment 15 Norman Elton 2005-02-28 15:26:17 UTC
(In reply to comment #14)
> 
> There were changes in RHEL3-U4 to address ENOMEM failures.  Can you
> please try to reporduce this problem with RHEL3-U4 and let me know
> if that fixes the problem you are seeing?

We are experiencing the same problem on a RHEL3 box with all the latest patches. Both times, the box 
has been under heavy I/O to two SATA drives striped using software RAID. NFS is used, but I don't 
believe there was any I/O at the time the system locked up. I could be wrong on this one though.

Any advice?

Norman

Comment 16 ara howard 2005-03-07 21:15:23 UTC
we too are seeing this problem.  i can trigger by running a simple compilation
in an nfs mounted directory.  sysinfo:

  * Red Hat Enterprise Linux WS release 3 (Taroon Update 4)
  * Linux 2.4.21-27.0.2.ELsmp i686

-a

Comment 17 Kevin Krafthefer 2005-03-14 11:44:45 UTC
A customer is seeing this with RHEL3-U4 on an IBM x445

Comment 18 Neil Horman 2005-03-14 14:13:02 UTC
Can anyone attach a copy of /proc/meminfo and /proc/slabinfo when this failure
occurs?  Also, Is tcp or udp being used as the transport here?  Lastly, are
ACL's generally in use here?  Starting with RHEL3-U4 I believe, there are some
ACL tunables which allow you to reduce the amount of memory used when conducting
ACL transactions.  I don't think they are directly related to the problem, but
their use may well reduce memory usage on a loaded system, which could help
avoid/work around the problem. I believe setting nfs3_max_acl=255 modifies acl
transactions such that they use an order 0 allocation rather than an order 2
allocation.


Comment 20 Larry Woodman 2005-03-14 15:04:32 UTC

Can someone who is experiencing this problem please get several
AltSysrq-M outputs when the system is incurring these errors?  I need
this data to determine what is going on when this happens.

Thanks, Larry Woodman


Comment 24 Steve Dickson 2005-04-01 13:52:07 UTC
ara,

Have you tried turning down the memory consumption
by setting the  nfs3_max_acl? This should help... 

Comment 25 Frank Liu 2005-04-26 21:46:10 UTC
Same problem here.
dual PIII Intel. 2.4.21-27.0.2.ELsmp
NFS client of some Sun NFS servers, with automount.

system hangs with rapid scrolling of "rpc sendmsg returned error 12" on the 
console. system is still pingable but I can't login in. had to reset.

Comment 26 James Takahashi 2005-06-14 21:27:51 UTC
(In reply to comment #18)
> Can anyone attach a copy of /proc/meminfo and /proc/slabinfo when this failure
> occurs?  

I've been intermittently hitting this problem during testing of RHEL3 Update 5 
on an HS40 (8839) blade.  I captured the output of 'while [ 1 ]; do date; 
cat /proc/meminfo; cat /proc/slabinfo; sleep 30; done' leading up the hang, 
which I'll attach separately.  Here's what 'top' said at the time of the hang:

 13:09:15  up  2:29,  3 users,  load average: 23.80, 22.75, 20.87
137 processes: 128 sleeping, 9 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total   87.4%    0.0%   12.3%   0.1%     0.0%    0.0%    0.0%
           cpu00   82.8%    0.0%   16.9%   0.1%     0.0%    0.0%    0.0%
           cpu01   92.0%    0.0%    7.7%   0.1%     0.0%    0.0%    0.0%
Mem:  2055444k av, 2025760k used,   29684k free,       0k shrd,   17996k buff
                   1549664k actv,  299828k in_d,   36340k in_c
Swap: 2047744k av,   32212k used, 2015532k free                 1829724k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 3425 root      19   0  125M 125M  125M R    30.8  6.2   6:00   0 snake
 3332 root      19   0  125M 125M  125M R    26.4  6.2   6:19   0 snake
12915 root      15   0   556  556   408 D    18.1  0.0   0:34   0 growfiles
 3462 root      24   0  125M 125M  125M R     3.9  6.2   6:00   1 snake
12583 root      25   0  2136 2136   716 R     3.9  0.1   0:41   0 lame
12588 root      25   0  2136 2136   716 R     3.9  0.1   0:42   1 lame
12962 root      25   0   252  252   208 R     3.9  0.0   0:12   0 inconsistency
13062 root      25   0  125M  63M   332 R     3.9  3.1   0:04   1 memxfer5b
12582 root      16   0  1504 1504   420 S     2.9  0.0   0:18   0 dd
   15 root      15   0     0    0     0 DW    0.6  0.0   0:51   0 kjournald
15306 root      15   0    80   80    20 D     0.3  0.0   0:39   0 dd
 8668 root      15   0   744  712   380 R     0.2  0.0   0:21   1 top
12380 root      15   0   440  440   320 D     0.2  0.0   0:00   0 bonnie++
    7 root      15   0     0    0     0 SW    0.1  0.0   0:41   1 kswapd
15303 root      15   0    80   80    20 D     0.1  0.0   0:40   1 dd
    1 root      15   0   108   80    52 S     0.0  0.0   0:07   1 init
    2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
    3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
    4 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 keventd
    5 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
    6 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1


Comment 27 James Takahashi 2005-06-14 21:30:33 UTC
Created attachment 115433 [details]
meminfo & slabinfo trace

Here's the promised meminfo/slabinfo trace

Comment 28 Kevin Taylor 2005-09-19 14:30:21 UTC
Has anyone come up with a resolution for this. We're seeing it also.
Dell PE 6650 server with 4x2.7gHz cpus + 2GB RAM. RHEL 3 with all current updates.


Comment 29 Evan Rempel 2005-09-27 21:19:54 UTC
I made the following changes on camel.comp.uvic.ca

Changed these
% sudo /sbin/sysctl -a | egrep 'core.*[rw]mem'
net.core.rmem_default = 65535
net.core.wmem_default = 65535
net.core.rmem_max = 131071
net.core.wmem_max = 131071

to

% sudo /sbin/sysctl -a | egrep 'core.*[rw]mem'
net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.rmem_max = 1048576
net.core.wmem_max = 1048576

which is the amount of ram allocated in bytes for reading and writing to
the network.

This only delays when the issue occurs, but so far, the machines seems to be
able to recover fast enough to not reach this limit. I have not had a
reoccurance of the problem with these settings.

Comment 30 Neil Horman 2005-10-10 19:13:44 UTC
As steve and I have asked in comment #18 and comment #24, does this problem
persist when the nfs_max_acl parameter is set to 255 when insmodding the nfs module?

Comment 31 James Takahashi 2005-10-21 02:42:17 UTC
(In reply to comment #30)
> As steve and I have asked in comment #18 and comment #24, does this problem
> persist when the nfs_max_acl parameter is set to 255 when insmodding the nfs
module?

I tried modprobing nfs3_acl_max_entries to both 512 and 255 (and verified
in /proc/sys/fs/nfs that the values were actually active), but the same
failure occurs as when the parameter is at its default value of 1024.

Comment 34 Nick 2005-11-14 03:34:53 UTC
I've just seen the same problem twice in the space of 2 days. Console full of
"RPC: sendmsg returned error 12" on the NFS client, ctrl-alt-del not responding
and the only way to recover the systems is which a power cycle. The CPU on the
machine must have been busy as the variable speed fans were cranked right up.

We were running the NFS server on a 3.5 box and we recently moved it to a 4.2
os. We didn't see the problem once on the 3.5 server and its only started now
the server is 4.2. Client is 3.5.

Only random messages i can find in the syslogs that *might* be related are:
Nov 11 13:12:20 bill kernel: VFS: find_free_dqentry(): Data block full but it
shouldn't.
Nov 11 13:12:20 bill kernel: VFS: Error -5 occured while creating quota.

I presume people are only seeing this problem on NFS clients? And if so, what
kernel,os,etc is the NFS server running? Only ask as it seems to have stimulated
 the problem in our setup.

Comment 36 Dave Boone 2005-12-16 13:19:58 UTC
Has anyone experienced this problem on RHEL3 update 6?  We're experiencing 
this on RHEL3u5 and I'm wondering if upgrading to 6 will alleviate the problem.

Thanks.


Comment 37 Tudor Panaitescu 2005-12-28 17:09:59 UTC
I've just got the same problem in update 6. Had to manually reset the server.
This happened on an HP Blade Server, equivalent in HW with the DL380 G3, don't
have the spec available right now, I don't think is hardware related though.

Comment 41 Larry Woodman 2006-02-08 19:57:06 UTC
I can not reproduce this problem locally.  Can someone please reproduce this
problem with the latest RHEL3 kernel and get me an AltSysrq-M output when this
happens so I can see if the problem is memory echaustion or memory fragmentation.

Thanks, Larry Woodman.


Comment 43 Joseph L. Hill 2006-02-27 15:25:41 UTC
This bug just occured on my prolient server running redhat linux AS release 3
(Taroon Update 6). This box nfs mounts a directory from a sun server.

The kernel is 2.4.21-32.0.1.ELsmp.

I see there is a kernel update 2.4.21 37.0.1 and I am scheduling an update to
apply the new kernel. Hope this helps.

Any new developments on this bug?

If it happens again I'll try the AltSysrq-M for you.

Thanks,
-joe

Comment 44 Dave Boone 2006-02-27 15:36:35 UTC
We were seeing this on Dell PowerEdge 1850s running RHEL 3.0 update 5, until 
we upgraded the Intel GbE NIC driver (e1000).

Can anyone else share NICs this occurs on?

Thanks,
Dave


Comment 45 Konrad Rzeszutek 2006-03-09 19:41:05 UTC
I narrowed down the "RPC: sendmsg returns 12" message to be in the sunrpc module.

I added a bunch of show_stack() to dump the stack when the routine prints that
message and got these messages:

MCPU1:
^Me1bf7c88 f8a4898f f8a54750 00000001 f8a4b1c1 00000000 00000000 00000009
^M       e1bf6000 fffffff4 c0000000 d4eee614 e81dc8b4 e808a800 000000ac e2739000
^M       00001000 e2738000 00001000 eeae7000 00001000 eeae5000 00001000 eeae4000
^MCall Trace:   [<f8a4898f>] showacpu3 [sunrpc] 0x2f (0xe1bf7c8c)
^M[<f8a54750>] .rodata.str1.1 [sunrpc] 0xec (0xe1bf7c90)
^M[<f8a4b1c1>] xprt_sendmsg [sunrpc] 0x221 (0xe1bf7c98)
^M[<c0159bf2>] __alloc_pages [kernel] 0x132 (0xe1bf7d04)
^M[<f8a4a060>] do_xprt_transmit [sunrpc] 0x30 (0xe1bf7d4c)
^M[<f8a47e01>] call_transmit [sunrpc] 0x31 (0xe1bf7d78)
^M[<f8a4c3f9>] __rpc_execute [sunrpc] 0x109 (0xe1bf7d90)
^M[<f8a79b7e>] nfs_flush_one [nfs] 0x16e (0xe1bf7dd0)
^M[<f8a79dc6>] nfs_flush_list [nfs] 0x66 (0xe1bf7e1c)
^M[<f8a7ac47>] nfs_flush_file [nfs] 0x97 (0xe1bf7e48)
^M[<f8a79561>] nfs_strategy [nfs] 0x71 (0xe1bf7e74)
^M[<f8a79865>] nfs_updatepage [nfs] 0x235 (0xe1bf7e90)
^M[<f8a70a7c>] nfs_commit_write [nfs] 0x4c (0xe1bf7ec4)
^M[<c014c708>] do_generic_file_write [kernel] 0x268 (0xe1bf7edc)
^M[<c014cc1f>] generic_file_write [kernel] 0x18f (0xe1bf7f30)
^M[<f8a70bb9>] nfs_file_write [nfs] 0xa9 (0xe1bf7f64)
^M[<c0165247>] sys_write [kernel] 0x97 (0xe1bf7f94)
^M[<c02af06f>] no_timing [kernel] 0x7 (0xe1bf7fc0)

and in another instance:
^Mshow_stack()
^MCPU1:
^Me1bf7c88 f8a4898f f8a54750 00000001 f8a4b1c1 00000000 00000000 00000009
^M       e81dc8b4 fffffff4 c0000000 d4eee144 e81dc8b4 f1d02000 000000ac ff1a4000
^M       00001000 ff1a5000 00001000 ff1a6000 00001000 ff1a7000 00001000 ff1a8000
^MCall Trace:   [<f8a4898f>] showacpu3 [sunrpc] 0x2f (0xe1bf7c8c)
^M[<f8a54750>] .rodata.str1.1 [sunrpc] 0xec (0xe1bf7c90)
^M[<f8a4b1c1>] xprt_sendmsg [sunrpc] 0x221 (0xe1bf7c98)
^M[<f8a4a060>] do_xprt_transmit [sunrpc] 0x30 (0xe1bf7d4c)
^M[<f8a47e01>] call_transmit [sunrpc] 0x31 (0xe1bf7d78)
^M[<f8a4c3f9>] __rpc_execute [sunrpc] 0x109 (0xe1bf7d90)
^M[<f8a79b7e>] nfs_flush_one [nfs] 0x16e (0xe1bf7dd0)
^M[<f8a79dc6>] nfs_flush_list [nfs] 0x66 (0xe1bf7e1c)
^M[<f8a7ac47>] nfs_flush_file [nfs] 0x97 (0xe1bf7e48)
^M[<f8a79561>] nfs_strategy [nfs] 0x71 (0xe1bf7e74)
^M[<f8a79865>] nfs_updatepage [nfs] 0x235 (0xe1bf7e90)
^M[<f8a70a7c>] nfs_commit_write [nfs] 0x4c (0xe1bf7ec4)
^M[<c014c708>] do_generic_file_write [kernel] 0x268 (0xe1bf7edc)
^M[<c014cc1f>] generic_file_write [kernel] 0x18f (0xe1bf7f30)
^M[<f8a70bb9>] nfs_file_write [nfs] 0xa9 (0xe1bf7f64)
^M[<c0165247>] sys_write [kernel] 0x97 (0xe1bf7f94)
^M[<c016f849>] sys_fstat64 [kernel] 0x49 (0xe1bf7fa8)
^M[<c02af06f>] no_timing [kernel] 0x7 (0xe1bf7fc0)

I am going to attach the full serial console output, the patch for sunrpc.

Comment 46 Konrad Rzeszutek 2006-03-09 19:47:41 UTC
Created attachment 125899 [details]
Log from serial console while machine is running pounder2-01-24

Comment 47 Konrad Rzeszutek 2006-03-09 19:49:50 UTC
Created attachment 125900 [details]
Very simple patch I used to trigger the dump.

Comment 49 Larry Woodman 2006-05-08 20:43:32 UTC
The problem here is that there is plenty of free memory but it is fragmented
badly enough so there is no 64KB contiguous memory therefore the allocation
fails.  The following line from the AltSysrq-M output shows the normal zone has
plenty of free RAM but nothing larger than 32KB. 

>>>20*4kB 125*8kB 221*16kB 22*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 5320kB)

I added a new tuning parameter to RHEL3-U8 /proc/sys/vm/vm-defragment that
controls how aggressively the page allocator tries to defragment memory when the
allocation fails due to memory fragmentation.  Please try out the latest
RHEL3-U8 beta kernel and set /proc/sys/vm/vm-defragment to 100.  This should
prevent the RPC code from getting ENOMEM(12) failures.

Larry Woodman


Comment 56 Ernie Petrides 2006-07-12 20:26:43 UTC
Closing as NOTABUG since the problem is not reproducible.

Comment 60 Michael Martinez 2006-10-11 13:18:55 UTC
Hi,

We have an AS update 3 server that has been experiencing this problem. 
Identical symptoms to what several people describe above. 

It was running kernel 2.4.21-15.EL. We downloaded the rpm for kernel 2.4.21-
47.EL, installed it and booted. We set vm-defragment to 100. 

But the problem recurred about half a day later.

I'd like to reopen this bug report because this is an important issue for us 
(university campus) and we need to get it resolved.

I'd also like to hear from other users to see if they're still having the 
problem or if they've been able to fix it.

Thanks -

Michael Martinez

Comment 61 Steve Dickson 2006-10-24 14:02:21 UTC
If your still seen this with RHEL3 U8, then I would say either 
the problem is not fixed or we are seeing a different problem
with the same foot print... 

Comment 62 Michael Martinez 2006-10-24 19:23:09 UTC
I'm seeing it with kernel 2.4.21-47.EL under RHEL3 U3. That's the U8 kernel on 
a U3 system. Does that make a difference?

Michael

Comment 63 Konrad Rzeszutek 2006-10-24 20:07:52 UTC
It never got fixed and has been in RHEL3 since its begining. 

Comment 67 Deepa 2009-06-24 07:20:06 UTC
Red hat host Linux host  WS release 3 (Taroon Update 9)was hung and in console found RPC :send message returned error 12 .Can i know how to fix it.
and in /var/log/messages found an error like this before hang
nfs_proc_symlink: 22/00 already exists??
 nfs_proc_symlink: 93/79 already exists??
 nfs_proc_symlink: 88/39 already exists??
Can i know the root cause and how can i fix it.often this system hangs throwing this error

Deepa