Bug 99025
Summary: | kernel: ENOMEM in journal_alloc_journal_head, retrying Dell 6650 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Geoff Dolman <geoff.dolman> | ||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||
Status: | CLOSED NOTABUG | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.1 | CC: | kalyanj80, riel, sct, tao, tommy.yan | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-12-13 20:06:28 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Geoff Dolman
2003-07-11 21:58:31 UTC
The journal messages are symptoms of heavy memory pressure but do not, by themselves, signal any failures. There are plenty of other places in the kernel where similar memory allocators simply retry silently under that pressure: the journal code is just one of the very few places that logs this condition. These messages are entirely harmless in themselves. The hang will need more information to diagnose. Can you capture an alt-sysrq-t or alt-sysrq-p backtrace when the hang occurs? The machine does not display /anything/ on its monitor after this hang has occurred. The sysrq key combinations are working (sysrq-b works) but owing to lack of display nothing can be seen from the sysrq-p etc. I also now have a way of reproducing the error: On six machines cat /dev/zero into a file on a filing system exported from the machine that suffers the hang. The load average goes up gradually and then suddenly goes up to 60 or more and then the machine hangs. [This is not because the filing system has filled though.] When the /dev/zeros are stopped the machine does not recover. All logins on terminals are unresponsive from then on. The machine still responds to ping and the lvs that it is running redirecting http requests continues to work. However you cannot rsh or ssh in, get a login on the console or mount the exported filing system. There is no change after several hours. Larry, looks like another VM exhaustion problem. We really need the sysrq output to get much further with this. Serial console is the best way to capture it if /var/log/messages is wedged. ...think I've got to the bottom of it. Turns out to be a contention between a tape drive and disks on the same bus. Could this be right? Depends: if it's SCSI, it shouldn't really matter unless the tape drive is doing something really broken. If it's an ATAPI tape drive on the same controller as a disk being used heavily, then yes, it could easily cause the VM to get upset as we can't swap things out fast enough. What bus and devices exactly were you using? is it mm-1.1.3-1 problem? Okay, the scsi thing was a red herring. The machine still exhibits the same behaviour as before only that for obvious reasons with the tape drive moved to a different bus it takes more machines writing to the nfs export to make it go over. (10 instead of 6). I have now captured some sysrq information (second URL). Hopefully this contains a sysrq-t, sysrq-p and a sysrq-m. Th first URL is lspci -vv. http://www-gene.cimr.cam.ac.uk/~geoff/lspci.tgz http://www-gene.cimr.cam.ac.uk/~geoff/sysrq-output.tgz The version of mm on the system is mm-1.1.3-8. I have not upgraded to the 27 enterprise kernel yet. All the important tasks are stuck in the VM page reclaim code. Looks like you're just getting critically short of low memory. We don't officially support 32GB on AS21, precisely because of the low memory shortage that you get on such systems. You really need the 4G/4G split from recent 2.6 and RHEL beta kernels to make them work properly. out on a limb here, but any scsi_malloc failures in the log? The problem here is that there is 550MB in the mem_map page array for the 32GB and 350MB in the slabcache used for kernel data structs. Since both of these are allocated out of lowmem, its exhausted and subsequent allocation are failing. Please attach a /proc/slabinfo output and I'll see if we can do any tuning magic to prevent lowmem exhaustion. Ohter than that I would suggest booting with 16GB of memory. 16GB is the maximum AS2.1 supports for this very reason. Larry Woodman Many thanks. I have posted the /proc/slabinfo to http://www-gene.cimr.cam.ac.uk/~geoff/slabinfo.tgz I'm having the same problem on a pair of Dell 6650s with 16Gb of memory each, running 2.4.9-e.40enterprise #1 SMP. Reproducable by copying a 9Gb directory structure from one filesystem to another. Around 7Gb is successfully copied before the system locks. Glad I found this before purchasing another 16Gb for both servers.. I have upgraded everything via up2date, and now running kernel 2.4.9-e.43enterprise #1 SMP. Still have this problem. Does upgrading to RHEL3 fix this one ? I found the problem was fixed by mirroring the memory ie by reducing from 32 GB to 16GB. There has not bee a crash since then. I tried upgrading to RHEL 3.0 and everything seemed fine except the alleged 'ISV support' as Veritas Netbackup 4.5 does not work under RHEL 3.0. With a lot of fiddling and wrapper scripts it will work but not in any kind of way you can imagine Veritas supporting so I reverted to 2.1. There's no officially supported server version of Netbackup for RHEL 3.0 until 5.1 is released so we're just twiddling our thumbs waiting... There were no problems with the RAM under 3 though. thanks Geoff. So did you have 32Gb running under RHEL3 ? Yes I tried it but I didn't put it into production for the reasons outlined re Veritas :( thanks Geoff. So did you have 32Gb running under RHEL3 ? Yes. Similar issue here with Legato Networker - version 7 does support RHEL3, but we're on 6. Arghh ! I'm also facing this problem in HP dl380/6G mem, ml570/32G mem, dl760/32G mem. Just dd a large file(over 10GB) will enconter out of memory error, and see 'ENOMEM in journal_alloc_journal_head' in server screen. Switch fs to ext2/ext3/reiserfs did not solve this problem. hum, The system run out of buffer in this case. You can reproduce with some simple dd or cp, For me, in a 30GB mem config, #dd if=/dev/zero of=/file1 bs=4096k count=2000 & #dd if=/dev/zero of=/file2 bs=4096k count=2000 & #dd if=/dev/zero of=/file3 bs=4096k count=2000 & #vmstat 3 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 4 1 0 20456412 40108 11790712 0 0 3 26 13 5 0 0 4 0 0 8 0 20223332 33984 12021504 0 0 11 76564 303 2987 0 30 31 39 0 7 0 20108120 32844 12136352 0 0 19 38220 302 2007 0 14 33 53 4 2 0 19831936 24152 12412832 0 0 15 91988 306 3947 0 38 33 29 0 7 0 19634008 18396 12610232 0 0 3 65229 282 2759 0 27 17 56 0 7 0 19471276 14028 12771068 0 0 1 55196 284 2565 0 22 12 66 4 1 0 19309040 8736 12937064 0 0 1 54031 276 2075 0 25 14 61 4 1 0 19126688 5452 13117664 0 0 4 61431 296 2788 0 25 19 56 0 6 0 19065608 3440 13179820 0 0 1 19595 287 1286 0 8 8 84 4 1 0 18854072 1404 13390412 0 0 4 72199 451 2500 0 30 34 36 [root@ml570-1 root]# vmstat 3 Connection to ml570-1.test closed by remote host. in the server screen, ENOMEM in journal_alloc_journal_head error appear. [root@ml570-1 root]# cat /proc/slabinfo slabinfo - version: 1.1 (SMP) kmem_cache 128 128 244 8 8 1 : 1008 252 nfs_write_data 0 0 384 0 0 1 : 496 124 nfs_read_data 0 0 384 0 0 1 : 496 124 nfs_page 0 0 128 0 0 1 : 1008 252 ip_fib_hash 11 448 32 4 4 1 : 1008 252 ext3_xattr 0 0 44 0 0 1 : 1008 252 journal_head 27874 27874 48 362 362 1 : 1008 252 revoke_table 1 250 12 1 1 1 : 1008 252 revoke_record 0 0 32 0 0 1 : 1008 252 clip_arp_cache 0 0 256 0 0 1 : 1008 252 ip_mrt_cache 0 0 128 0 0 1 : 1008 252 tcp_tw_bucket 0 0 128 0 0 1 : 1008 252 tcp_bind_bucket 7 336 32 3 3 1 : 1008 252 tcp_open_request 0 0 128 0 0 1 : 1008 252 inet_peer_cache 0 0 64 0 0 1 : 1008 252 secpath_cache 0 0 128 0 0 1 : 1008 252 xfrm_dst_cache 0 0 256 0 0 1 : 1008 252 ip_dst_cache 10 45 256 3 3 1 : 1008 252 arp_cache 2 30 256 2 2 1 : 1008 252 flow_cache 0 0 128 0 0 1 : 1008 252 blkdev_requests 12288 12600 128 420 420 1 : 1008 252 kioctx 0 0 128 0 0 1 : 1008 252 kiocb 0 0 128 0 0 1 : 1008 252 dnotify_cache 0 0 20 0 0 1 : 1008 252 file_lock_cache 3 120 96 3 3 1 : 1008 252 async_poll_table 0 0 140 0 0 1 : 1008 252 fasync_cache 0 0 16 0 0 1 : 1008 252 uid_cache 2 112 32 1 1 1 : 1008 252 skbuff_head_cache 453 598 168 26 26 1 : 1008 252 sock 57 100 1408 20 20 2 : 240 60 sigqueue 1 29 132 1 1 1 : 1008 252 kiobuf 1 30 128 1 1 1 : 1008 252 cdev_cache 13 406 64 7 7 1 : 1008 252 bdev_cache 7 290 64 5 5 1 : 1008 252 mnt_cache 15 232 64 4 4 1 : 1008 252 inode_cache 755 1064 512 152 152 1 : 496 124 dentry_cache 457 1200 128 40 40 1 : 1008 252 dquot 0 0 128 0 0 1 : 1008 252 filp 360 510 128 17 17 1 : 1008 252 names_cache 1 1 4096 1 1 1 : 240 60 buffer_head 3210480 3210480 108 91728 91728 1 : 1008 252 mm_struct 30 90 384 9 9 1 : 496 124 vm_area_struct 908 1512 68 27 27 1 : 1008 252 fs_cache 36 348 64 6 6 1 : 1008 252 files_cache 37 70 512 10 10 1 : 496 124 signal_cache 72 638 64 11 11 1 : 1008 252 sighand_cache 52 100 1408 20 20 2 : 240 60 pte_chain 1302 3000 128 100 100 1 : 1008 252 pae_pgd 30 464 64 8 8 1 : 1008 252 size-131072(DMA) 0 0 131072 0 0 32 : 0 0 size-131072 2 2 131072 2 2 32 : 0 0 size-65536(DMA) 0 0 65536 0 0 16 : 0 0 size-65536 0 0 65536 0 0 16 : 0 0 size-32768(DMA) 0 0 32768 0 0 8 : 0 0 size-32768 0 0 32768 0 0 8 : 0 0 size-16384(DMA) 0 0 16384 0 0 4 : 0 0 size-16384 23 23 16384 23 23 4 : 0 0 size-8192(DMA) 0 0 8192 0 0 2 : 0 0 size-8192 6 6 8192 6 6 2 : 0 0 size-4096(DMA) 0 0 4096 0 0 1 : 240 60 size-4096 425 425 4096 425 425 1 : 240 60 size-2048(DMA) 0 0 2048 0 0 1 : 240 60 size-2048 157 184 2048 92 92 1 : 240 60 size-1024(DMA) 0 0 1024 0 0 1 : 496 124 size-1024 132 132 1024 33 33 1 : 496 124 size-512(DMA) 0 0 512 0 0 1 : 496 124 size-512 344 344 512 43 43 1 : 496 124 size-256(DMA) 0 0 256 0 0 1 : 1008 252 size-256 4245 4245 256 283 283 1 : 1008 252 size-128(DMA) 0 0 128 0 0 1 : 1008 252 size-128 1991 2040 128 68 68 1 : 1008 252 size-64(DMA) 0 0 128 0 0 1 : 1008 252 size-64 330 330 128 11 11 1 : 1008 252 size-32(DMA) 0 0 64 0 0 1 : 1008 252 size-32 966 1218 64 21 21 1 : 1008 252 sb. can help explain? Created attachment 101956 [details]
/proc/slabcache
slabcache when machine is unresponsive
re #20 thriller - this is on RHEL2.1 right ? I want to be sure that RHEL3 doesn't have this bug as well before I upgrade ... I've tried this on rhel3 U1 & U2, both of them have this error. Tommy, can you attach "cat /proc/meminfo" output from the rhel3 system that is experiencing this problem? AS2.1 does not support > 16GB of memory and RHEL3 supports up to 16GB with the smp kernel and up to 32GB with the hugemem kernel. Larry This is smp kernel, meminfo just before the process killer kicked out my login is: [root@ml570-1 root]# cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 33448136704 14163038208 19285098496 0 4214784 13736730624 Swap: 2097434624 0 2097434624 MemTotal: 32664196 kB MemFree: 18833104 kB MemShared: 0 kB Buffers: 4116 kB Cached: 13414776 kB SwapCached: 0 kB Active: 23716 kB ActiveAnon: 6112 kB ActiveCache: 17604 kB Inact_dirty: 12763024 kB Inact_laundry: 318872 kB Inact_clean: 319384 kB Inact_target: 2684996 kB HighTotal: 32243676 kB HighFree: 18817196 kB LowTotal: 420520 kB LowFree: 15908 kB SwapTotal: 2048276 kB SwapFree: 2048276 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB Tommy - have you tried the hugemem kernel? as Larry says, that one should be ok for 32Gb. Yeah, We've tried hugemem kernel and that issue did not happen. I've read rhel3 release notes it does said if 16gb memory + should use huge memory kernel. But it also indicate support 64GB memory. Where does 'up to 32 GB' come from? Thanks Tommy. Why didn't you stick with the hugemem kernel ? An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-505.html We have a 4 CPU dual-core DL580 G3 with 32G of memory that is having similar issues on and off. I don't have a decent way of reproducing the error. I am running the 2.4.9-e.57enterprise kernel. This is our Oracle DB server. Currently, we are running without BigPages enabled. I have read a whole lot online that the memory needed to store the PTE entries will shrink by a magnitude of 1024 as the PageSize would go from 4k to 4MB. To go with this theory and put this box to production, I need to determine two things. 1. What is a precise way of reproducing the error? Currently, the server hangs with ssh and the console unavailabe, and there are the 'ENOMEM in journal_alloc_journal_head' msgs on the console? This happens when I create a high load by running 10 instances of dd if=/dev/zero of=test1file21 bs=4096k count=2000&, and with network copies like scp or write to a nfs mounted file. However, I have been able to crash the server only 2 out of some 40 times this way. 2. Once I have a precise way of crashing it, I need to see if enabling BigPages will make it go away. Thanks in advance. We are stuck with 2.4 kernel because of Oracle 9i. #cat /proc/meminfo total: used: free: shared: buffers: cached: Mem: 33773588480 31810072576 1963515904 18690871296 30732288 12716052480 Swap: 2147426304 0 2147426304 MemTotal: 32982020 kB MemFree: 1917496 kB MemShared: 18252804 kB Buffers: 30012 kB Cached: 12418020 kB SwapCached: 0 kB Active: 31560 kB Inact_dirty: 18272676 kB Inact_clean: 12396600 kB Inact_target: 8245292 kB HighTotal: 32635748 kB HighFree: 1817460 kB LowTotal: 346272 kB LowFree: 100036 kB SwapTotal: 2097096 kB SwapFree: 2097096 kB BigPagesFree: 0 kB (In reply to comment #27) > This is smp kernel, meminfo just before the process killer kicked out > my login is: > > [root@ml570-1 root]# cat /proc/meminfo > total: used: free: shared: buffers: cached: > Mem: 33448136704 14163038208 19285098496 0 4214784 > 13736730624 > Swap: 2097434624 0 2097434624 > MemTotal: 32664196 kB > MemFree: 18833104 kB > MemShared: 0 kB > Buffers: 4116 kB > Cached: 13414776 kB > SwapCached: 0 kB > Active: 23716 kB > ActiveAnon: 6112 kB > ActiveCache: 17604 kB > Inact_dirty: 12763024 kB > Inact_laundry: 318872 kB > Inact_clean: 319384 kB > Inact_target: 2684996 kB > HighTotal: 32243676 kB > HighFree: 18817196 kB > LowTotal: 420520 kB > LowFree: 15908 kB > SwapTotal: 2048276 kB > SwapFree: 2048276 kB > HugePages_Total: 0 > HugePages_Free: 0 > Hugepagesize: 2048 kB (In reply to comment #0) > From Bugzilla Helper: > User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 > > Description of problem: > Under heavy nfs load (from several rsyncs) on a large 556GB filesystem the > following message appears in /var/log/messages: > > Jul 11 14:44:00 srs kernel: ENOMEM in journal_get_undo_access_Rsmp_767cdac6, > retrying. > Jul 11 14:44:07 srs kernel: ENOMEM in journal_alloc_journal_head, retrying. > Jul 11 14:44:10 srs kernel: journal_write_metadata_buffer: ENOMEM at > get_unused_buffer_head, trying again. > Jul 11 14:44:18 srs kernel: ENOMEM in journal_get_undo_access_Rsmp_767cdac6, > retrying. > Jul 11 14:44:28 srs kernel: journal_write_metadata_buffer: ENOMEM at > get_unused_buffer_head, trying again. > Jul 11 14:45:30 srs kernel: journal_write_metadata_buffer: ENOMEM at > get_unused_buffer_head, trying again. > Jul 11 14:45:38 srs kernel: ENOMEM in do_get_write_access, retrying. > > etc... > > At the end of an hour or more of these messages the machine hangs. It can still > respond to ping, but it isn't possible to login on the console or via ssh. The > only resolution appears to be switching it off and on again. > > I see a similar bug has been logged concerning the tg3 driver for the broadcom > NetXtreme ethernet card, but I am not using the tg3 driver - I am using the > bcm5700 driver... > > > > > Version-Release number of selected component (if applicable): > Linux 2.4.9-e.25enterprise > > How reproducible: > Sometimes > > Steps to Reproduce: > 1. A heavy network load, typically produced by running several rsyncs. > 2. > 3. > > > Actual Results: Machine hangs indefinitely > > Expected Results: Other possibly relevant information: the machine is 4 > processor with hyperthreading enabled and has 32 GB of memory. > > Additional info: |