Bug 123226
Summary: | System hangs with error RPC: sendmsg returned error 12 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Michael K. Ellis <mkellis> | ||||||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | |||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 3.0 | CC: | allykken, ara.t.howard, bchacko, boone_dave, dpa.clt, dwa, gfrankliu, jmtt, k.georgiou, konradr, krafthef, krzeszut, mikem, nhorman, nick, normelton, p.c.m.chiu, perbu, petrides, riel, steved, tao | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i686 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2006-07-12 20:26:43 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 170417, 185486, 186960 | ||||||||||||
Attachments: |
|
Description
Michael K. Ellis
2004-05-13 23:53:34 UTC
Could you please attach console output containing the oops output or at least a panic message? If you can't get this with a graphics console, could you please arrange to use a serial console to collect this info? Thanks. -ernie We're having trouble getting serial output: the system primarily manifests itself on an IBM blade server, which isn't easy to get serial console from (no serial ports save an aux USB port on the front, which tends to move with the KVM). In the mean time, we're seeing more issues. I have a live window on a machine that has a hung NFS mount. All other NFS mounts (including to the same server) are fine. NFS server is a SunFire 280R, occasionally under heavy load. It isn't right now, but the mount point is hung. With the live window I can pull any info from /proc that you might want. We're getting the same thing. I was just on the console when it happened. hundreds of these messages, and then the ASR kicked in and rebooted the bog. we also have this machine mounting a NFS drive (as well as exporting a NFS mount. (the NFS drive it is mounting is running RH7.3, this box is running EL3. our reboots are happening twice a day.. (and we have a console) so I'll try to get some more info about it. BUT I didnt see a panic or a OOPs. I have the same information on /var/log/messages, after the system restarts. I was copying several 200M files over NFS to a Sun Solaris 6 box and the system crashed with many of RPC error 12 on the console. Jun 11 18:37:56 vega login(pam_unix)[26560]: session opened for user root by LOGIN(uid=0) Jun 11 18:37:56 vega -- root[26560]: ROOT LOGIN ON tty2 Jun 11 18:51:16 vega kernel: RPC: sendmsg returned error 12 Jun 11 18:51:16 vega last message repeated 119 times Jun 11 18:51:16 vega kernel: RPC: sendmsgreturned error 12 Jun 11 18:51:16 vega kernel: RPC: sendmsg returned error 12 Jun 11 18:56:38 vega syslogd 1.4.1: restart. Jun 11 18:56:38 vega syslog: syslogd startup succeeded Jun 11 18:56:38 vega kernel: klogd 1.4.1, log source = /proc/kmsg started. Jun 11 18:56:38 vega kernel: Linux version 2.4.21-15.ELsmp (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-34)) #1 SMP Thu Apr 22 00:18:24 EDT 2004 Antonio I just saw this happen on a system of ours. Kernel 2.4.21-15.ELsmp running som J2EE-app (ibm jvm 1.4.1 sr2) with moderate NFS-load. No relevant entries in syslog. The message was 10s of times per second - so I am unable to say if there where any messages preceding this. We've seen this same problem on our Dual Intel 2.8 Ghz Xeon Systems. We also are using gigabit ethernet and NFS. We have multiple NFS mounts to a remote HP system. One or more of the NFS mounts will lock up or the system itself will lock up with the message "RPC: sendmsg retunred error 12" or sometimes error 101. We are using the 2.4.21-4 kernel. The ethernet card is an Intel 82540 Gigabit Ethernet Controller. The following was from the system log when one of our systems hung. The error messages continued. Aug 30 08:20:39 db2 kernel: RPC: sendmsg returned error 101 Aug 30 08:20:39 db2 kernel: nfs: RPC call returned error 101 Aug 30 08:20:39 db2 kernel: RPC: sendmsg returned error 101 Aug 30 08:20:39 db2 kernel: nfs: RPC call returned error 101 Aug 30 08:20:39 db2 kernel: RPC: sendmsg returned error 101 Aug 30 08:20:39 db2 kernel: nfs: RPC call returned error 101 i'm also seeing this behaviour - so far only on a server that is writing to the nfs filesystem. RPC: sendmessage returned error 12 and the system has to be physically reset. 2.4.21-20 here, the nfs client is a single processor P4/2.8 with hyperthreading turned on, smp kernel, 2G of ram and gigE nfs. sigh, i should have been more precise - a 2RU server that is a NFS client is the one that is regularly crashing. this hasn't happened to us previously with the 2.4.21-15.0.4, 15.0.3, 15, 9.0.3, 9.0.1, 9 and 4 kernels so i'm not sure if it is something recent introduced. -jason We've had this error occur for the first time on 2 servers running 2.4.21-20 over the weekend. Servers are dual processor zeon 2.66's. There is no indication in syslog, the message just repeats on the console and servers are unresponsive - had to power cycle. There would have been heavy NFS traffic at gigabit speeds when each server experienced the error. Just readding this, as the crash has repeated overnight. Two of our servers are crashing with a known bug (bugzilla #123226, https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123226). The message "RPC: sendmsg returned error 12" scrolls up the console, and the server is pingable but unresponsive. The server has to be power cycled to recover, ctrl-alt-del doesn't do anything. kernel 2.4.21-20.EL #1 SMP Wed Aug 18 20:46:40 EDT 2004 We think it is an NFS fault, it has happened twice while an NFS copy is in progress. It isn't happening all the time though. All updates for the channels rhel-i386-as-3, rhel-i386-as-3-extras and rhel-i386-as-3-devsuite have been applied as of today (Thu 23 Sep AEDT GMT+10). Syslog has nothing in it about any crash or error, the last log entry is from named: update forwarding denied I believe we've found some recent page caching problems in recent RHEL3 kernels. I wonder if this issue is a result of those problem. I'm going to reassing to our VM guy to see what he thinks. We have the same problem! We are running an NFS client on Red Hat Enterprise 3 on a dual Xenon 3.2Ghz (patched) (Sun V65x). The NFS Server is a Sun Fire 280R with Solaris 9 (patched). The NFS client is a webserver who mounts the data from the NFS Server (should be an clustered environment soon). The NFS Server crashed reporting: RPC: sendmsg returned error 12 (running in a loop on console) I was able to ping the NFS client webserver but no logins or responses seem to be possible. I need an immediate fix for this problem !!! I wonder what the current status on this problem. This morning I have found one of our dual Opteron system hang with the identical symptons: system responds to ping, but nothing else. The console is full of repeated messages of RPC: sendmsg returned error 12. Thanks. Peter There were changes in RHEL3-U4 to address ENOMEM failures. Can you please try to reporduce this problem with RHEL3-U4 and let me know if that fixes the problem you are seeing? Larry Woodman (In reply to comment #14) > > There were changes in RHEL3-U4 to address ENOMEM failures. Can you > please try to reporduce this problem with RHEL3-U4 and let me know > if that fixes the problem you are seeing? We are experiencing the same problem on a RHEL3 box with all the latest patches. Both times, the box has been under heavy I/O to two SATA drives striped using software RAID. NFS is used, but I don't believe there was any I/O at the time the system locked up. I could be wrong on this one though. Any advice? Norman we too are seeing this problem. i can trigger by running a simple compilation in an nfs mounted directory. sysinfo: * Red Hat Enterprise Linux WS release 3 (Taroon Update 4) * Linux 2.4.21-27.0.2.ELsmp i686 -a A customer is seeing this with RHEL3-U4 on an IBM x445 Can anyone attach a copy of /proc/meminfo and /proc/slabinfo when this failure occurs? Also, Is tcp or udp being used as the transport here? Lastly, are ACL's generally in use here? Starting with RHEL3-U4 I believe, there are some ACL tunables which allow you to reduce the amount of memory used when conducting ACL transactions. I don't think they are directly related to the problem, but their use may well reduce memory usage on a loaded system, which could help avoid/work around the problem. I believe setting nfs3_max_acl=255 modifies acl transactions such that they use an order 0 allocation rather than an order 2 allocation. Can someone who is experiencing this problem please get several AltSysrq-M outputs when the system is incurring these errors? I need this data to determine what is going on when this happens. Thanks, Larry Woodman ara, Have you tried turning down the memory consumption by setting the nfs3_max_acl? This should help... Same problem here. dual PIII Intel. 2.4.21-27.0.2.ELsmp NFS client of some Sun NFS servers, with automount. system hangs with rapid scrolling of "rpc sendmsg returned error 12" on the console. system is still pingable but I can't login in. had to reset. (In reply to comment #18) > Can anyone attach a copy of /proc/meminfo and /proc/slabinfo when this failure > occurs? I've been intermittently hitting this problem during testing of RHEL3 Update 5 on an HS40 (8839) blade. I captured the output of 'while [ 1 ]; do date; cat /proc/meminfo; cat /proc/slabinfo; sleep 30; done' leading up the hang, which I'll attach separately. Here's what 'top' said at the time of the hang: 13:09:15 up 2:29, 3 users, load average: 23.80, 22.75, 20.87 137 processes: 128 sleeping, 9 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 87.4% 0.0% 12.3% 0.1% 0.0% 0.0% 0.0% cpu00 82.8% 0.0% 16.9% 0.1% 0.0% 0.0% 0.0% cpu01 92.0% 0.0% 7.7% 0.1% 0.0% 0.0% 0.0% Mem: 2055444k av, 2025760k used, 29684k free, 0k shrd, 17996k buff 1549664k actv, 299828k in_d, 36340k in_c Swap: 2047744k av, 32212k used, 2015532k free 1829724k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 3425 root 19 0 125M 125M 125M R 30.8 6.2 6:00 0 snake 3332 root 19 0 125M 125M 125M R 26.4 6.2 6:19 0 snake 12915 root 15 0 556 556 408 D 18.1 0.0 0:34 0 growfiles 3462 root 24 0 125M 125M 125M R 3.9 6.2 6:00 1 snake 12583 root 25 0 2136 2136 716 R 3.9 0.1 0:41 0 lame 12588 root 25 0 2136 2136 716 R 3.9 0.1 0:42 1 lame 12962 root 25 0 252 252 208 R 3.9 0.0 0:12 0 inconsistency 13062 root 25 0 125M 63M 332 R 3.9 3.1 0:04 1 memxfer5b 12582 root 16 0 1504 1504 420 S 2.9 0.0 0:18 0 dd 15 root 15 0 0 0 0 DW 0.6 0.0 0:51 0 kjournald 15306 root 15 0 80 80 20 D 0.3 0.0 0:39 0 dd 8668 root 15 0 744 712 380 R 0.2 0.0 0:21 1 top 12380 root 15 0 440 440 320 D 0.2 0.0 0:00 0 bonnie++ 7 root 15 0 0 0 0 SW 0.1 0.0 0:41 1 kswapd 15303 root 15 0 80 80 20 D 0.1 0.0 0:40 1 dd 1 root 15 0 108 80 52 S 0.0 0.0 0:07 1 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0 6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1 Created attachment 115433 [details]
meminfo & slabinfo trace
Here's the promised meminfo/slabinfo trace
Has anyone come up with a resolution for this. We're seeing it also. Dell PE 6650 server with 4x2.7gHz cpus + 2GB RAM. RHEL 3 with all current updates. I made the following changes on camel.comp.uvic.ca Changed these % sudo /sbin/sysctl -a | egrep 'core.*[rw]mem' net.core.rmem_default = 65535 net.core.wmem_default = 65535 net.core.rmem_max = 131071 net.core.wmem_max = 131071 to % sudo /sbin/sysctl -a | egrep 'core.*[rw]mem' net.core.rmem_default = 524288 net.core.wmem_default = 524288 net.core.rmem_max = 1048576 net.core.wmem_max = 1048576 which is the amount of ram allocated in bytes for reading and writing to the network. This only delays when the issue occurs, but so far, the machines seems to be able to recover fast enough to not reach this limit. I have not had a reoccurance of the problem with these settings. As steve and I have asked in comment #18 and comment #24, does this problem persist when the nfs_max_acl parameter is set to 255 when insmodding the nfs module? (In reply to comment #30) > As steve and I have asked in comment #18 and comment #24, does this problem > persist when the nfs_max_acl parameter is set to 255 when insmodding the nfs module? I tried modprobing nfs3_acl_max_entries to both 512 and 255 (and verified in /proc/sys/fs/nfs that the values were actually active), but the same failure occurs as when the parameter is at its default value of 1024. I've just seen the same problem twice in the space of 2 days. Console full of "RPC: sendmsg returned error 12" on the NFS client, ctrl-alt-del not responding and the only way to recover the systems is which a power cycle. The CPU on the machine must have been busy as the variable speed fans were cranked right up. We were running the NFS server on a 3.5 box and we recently moved it to a 4.2 os. We didn't see the problem once on the 3.5 server and its only started now the server is 4.2. Client is 3.5. Only random messages i can find in the syslogs that *might* be related are: Nov 11 13:12:20 bill kernel: VFS: find_free_dqentry(): Data block full but it shouldn't. Nov 11 13:12:20 bill kernel: VFS: Error -5 occured while creating quota. I presume people are only seeing this problem on NFS clients? And if so, what kernel,os,etc is the NFS server running? Only ask as it seems to have stimulated the problem in our setup. Has anyone experienced this problem on RHEL3 update 6? We're experiencing this on RHEL3u5 and I'm wondering if upgrading to 6 will alleviate the problem. Thanks. I've just got the same problem in update 6. Had to manually reset the server. This happened on an HP Blade Server, equivalent in HW with the DL380 G3, don't have the spec available right now, I don't think is hardware related though. I can not reproduce this problem locally. Can someone please reproduce this problem with the latest RHEL3 kernel and get me an AltSysrq-M output when this happens so I can see if the problem is memory echaustion or memory fragmentation. Thanks, Larry Woodman. This bug just occured on my prolient server running redhat linux AS release 3 (Taroon Update 6). This box nfs mounts a directory from a sun server. The kernel is 2.4.21-32.0.1.ELsmp. I see there is a kernel update 2.4.21 37.0.1 and I am scheduling an update to apply the new kernel. Hope this helps. Any new developments on this bug? If it happens again I'll try the AltSysrq-M for you. Thanks, -joe We were seeing this on Dell PowerEdge 1850s running RHEL 3.0 update 5, until we upgraded the Intel GbE NIC driver (e1000). Can anyone else share NICs this occurs on? Thanks, Dave I narrowed down the "RPC: sendmsg returns 12" message to be in the sunrpc module. I added a bunch of show_stack() to dump the stack when the routine prints that message and got these messages: MCPU1: ^Me1bf7c88 f8a4898f f8a54750 00000001 f8a4b1c1 00000000 00000000 00000009 ^M e1bf6000 fffffff4 c0000000 d4eee614 e81dc8b4 e808a800 000000ac e2739000 ^M 00001000 e2738000 00001000 eeae7000 00001000 eeae5000 00001000 eeae4000 ^MCall Trace: [<f8a4898f>] showacpu3 [sunrpc] 0x2f (0xe1bf7c8c) ^M[<f8a54750>] .rodata.str1.1 [sunrpc] 0xec (0xe1bf7c90) ^M[<f8a4b1c1>] xprt_sendmsg [sunrpc] 0x221 (0xe1bf7c98) ^M[<c0159bf2>] __alloc_pages [kernel] 0x132 (0xe1bf7d04) ^M[<f8a4a060>] do_xprt_transmit [sunrpc] 0x30 (0xe1bf7d4c) ^M[<f8a47e01>] call_transmit [sunrpc] 0x31 (0xe1bf7d78) ^M[<f8a4c3f9>] __rpc_execute [sunrpc] 0x109 (0xe1bf7d90) ^M[<f8a79b7e>] nfs_flush_one [nfs] 0x16e (0xe1bf7dd0) ^M[<f8a79dc6>] nfs_flush_list [nfs] 0x66 (0xe1bf7e1c) ^M[<f8a7ac47>] nfs_flush_file [nfs] 0x97 (0xe1bf7e48) ^M[<f8a79561>] nfs_strategy [nfs] 0x71 (0xe1bf7e74) ^M[<f8a79865>] nfs_updatepage [nfs] 0x235 (0xe1bf7e90) ^M[<f8a70a7c>] nfs_commit_write [nfs] 0x4c (0xe1bf7ec4) ^M[<c014c708>] do_generic_file_write [kernel] 0x268 (0xe1bf7edc) ^M[<c014cc1f>] generic_file_write [kernel] 0x18f (0xe1bf7f30) ^M[<f8a70bb9>] nfs_file_write [nfs] 0xa9 (0xe1bf7f64) ^M[<c0165247>] sys_write [kernel] 0x97 (0xe1bf7f94) ^M[<c02af06f>] no_timing [kernel] 0x7 (0xe1bf7fc0) and in another instance: ^Mshow_stack() ^MCPU1: ^Me1bf7c88 f8a4898f f8a54750 00000001 f8a4b1c1 00000000 00000000 00000009 ^M e81dc8b4 fffffff4 c0000000 d4eee144 e81dc8b4 f1d02000 000000ac ff1a4000 ^M 00001000 ff1a5000 00001000 ff1a6000 00001000 ff1a7000 00001000 ff1a8000 ^MCall Trace: [<f8a4898f>] showacpu3 [sunrpc] 0x2f (0xe1bf7c8c) ^M[<f8a54750>] .rodata.str1.1 [sunrpc] 0xec (0xe1bf7c90) ^M[<f8a4b1c1>] xprt_sendmsg [sunrpc] 0x221 (0xe1bf7c98) ^M[<f8a4a060>] do_xprt_transmit [sunrpc] 0x30 (0xe1bf7d4c) ^M[<f8a47e01>] call_transmit [sunrpc] 0x31 (0xe1bf7d78) ^M[<f8a4c3f9>] __rpc_execute [sunrpc] 0x109 (0xe1bf7d90) ^M[<f8a79b7e>] nfs_flush_one [nfs] 0x16e (0xe1bf7dd0) ^M[<f8a79dc6>] nfs_flush_list [nfs] 0x66 (0xe1bf7e1c) ^M[<f8a7ac47>] nfs_flush_file [nfs] 0x97 (0xe1bf7e48) ^M[<f8a79561>] nfs_strategy [nfs] 0x71 (0xe1bf7e74) ^M[<f8a79865>] nfs_updatepage [nfs] 0x235 (0xe1bf7e90) ^M[<f8a70a7c>] nfs_commit_write [nfs] 0x4c (0xe1bf7ec4) ^M[<c014c708>] do_generic_file_write [kernel] 0x268 (0xe1bf7edc) ^M[<c014cc1f>] generic_file_write [kernel] 0x18f (0xe1bf7f30) ^M[<f8a70bb9>] nfs_file_write [nfs] 0xa9 (0xe1bf7f64) ^M[<c0165247>] sys_write [kernel] 0x97 (0xe1bf7f94) ^M[<c016f849>] sys_fstat64 [kernel] 0x49 (0xe1bf7fa8) ^M[<c02af06f>] no_timing [kernel] 0x7 (0xe1bf7fc0) I am going to attach the full serial console output, the patch for sunrpc. Created attachment 125899 [details]
Log from serial console while machine is running pounder2-01-24
Created attachment 125900 [details]
Very simple patch I used to trigger the dump.
The problem here is that there is plenty of free memory but it is fragmented
badly enough so there is no 64KB contiguous memory therefore the allocation
fails. The following line from the AltSysrq-M output shows the normal zone has
plenty of free RAM but nothing larger than 32KB.
>>>20*4kB 125*8kB 221*16kB 22*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 5320kB)
I added a new tuning parameter to RHEL3-U8 /proc/sys/vm/vm-defragment that
controls how aggressively the page allocator tries to defragment memory when the
allocation fails due to memory fragmentation. Please try out the latest
RHEL3-U8 beta kernel and set /proc/sys/vm/vm-defragment to 100. This should
prevent the RPC code from getting ENOMEM(12) failures.
Larry Woodman
Closing as NOTABUG since the problem is not reproducible. Hi, We have an AS update 3 server that has been experiencing this problem. Identical symptoms to what several people describe above. It was running kernel 2.4.21-15.EL. We downloaded the rpm for kernel 2.4.21- 47.EL, installed it and booted. We set vm-defragment to 100. But the problem recurred about half a day later. I'd like to reopen this bug report because this is an important issue for us (university campus) and we need to get it resolved. I'd also like to hear from other users to see if they're still having the problem or if they've been able to fix it. Thanks - Michael Martinez If your still seen this with RHEL3 U8, then I would say either the problem is not fixed or we are seeing a different problem with the same foot print... I'm seeing it with kernel 2.4.21-47.EL under RHEL3 U3. That's the U8 kernel on a U3 system. Does that make a difference? Michael It never got fixed and has been in RHEL3 since its begining. Red hat host Linux host WS release 3 (Taroon Update 9)was hung and in console found RPC :send message returned error 12 .Can i know how to fix it. and in /var/log/messages found an error like this before hang nfs_proc_symlink: 22/00 already exists?? nfs_proc_symlink: 93/79 already exists?? nfs_proc_symlink: 88/39 already exists?? Can i know the root cause and how can i fix it.often this system hangs throwing this error Deepa |