Bug 156030

Summary: 100% repeatable oom-killer crash during heavy RAID/external-journal ext3 usage
Product: Red Hat Enterprise Linux 4 Reporter: Eli Stair <eli.stair>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: davej, jasonb, mvoelker, riel
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: http://www.chpc.utah.edu/~eli/ISSUES/oom-killer/
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 16:56:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tar.bz2 containing current bonnie output none

Description Eli Stair 2005-04-26 19:17:02 UTC
Description of problem:

In testing a (to-be-soon) production box, I've confirmed and re-created a bug
that causes the oom-killer to raise its ugly head.

Hardware:
Arima HDAMA-F (1.89 BIOS)
3W-9500-8 (w/ BBU)
8 Seagate 160GB NCQ drives
  2-drive RAID1
  5-drive RAID5
  1-hotspare
2GB RAM

Drive layout:
[root@log logs]# fdisk -l

Disk /dev/sda: 159.9 GB, 159988580352 bytes
255 heads, 63 sectors/track, 19450 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          16      128488+  83  Linux
/dev/sda2              17        1060     8385930   83  Linux
/dev/sda3            1061        1321     2096482+  82  Linux swap
/dev/sda4            1322       19450   145621192+  83  Linux

Disk /dev/sdb: 639.9 GB, 639954321408 bytes
255 heads, 63 sectors/track, 77803 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1       77803   624952566   83  Linux





Version-Release number of selected component (if applicable):

RHEL4-AS-x86_64
kernel-smp-2.6.9-5.0.3.EL





How reproducible:

100%

Steps to Reproduce:
1. Create filessystems on 3Ware controller:

journal:  mke2fs -R stride=16 -L JOURNAL -b 4096 -O journal_dev /dev/sda2
data:     mke2fs -b 4096 -J device=/dev/sda2 -O dir_index,sparse_super -R
stride=80 -T news /dev/sdb1

(NOTE: occurs with/without stride setting being present)


2. Run bonnie against ext3 data array:

sync && while true ; do /root/pkg/bonnie/Bonnie -d /tmp/TEST/ -s 4096 -y -u >>
/root/logs/log.chpc-bonnie_R5_5d_tuneFS-extJ-WCE-ext3-ordered_bdv-8192.txt 2>&1
; done &

3.  Wait for crash.

With stride options enabled, the system never completes even a single test, it
(appears) to hang and becomes unresponsive after several minutes.  The
oom-killer creeps up after about ten minutes, hosing all processes on the
machine (sorry don't get saved to logs).

(log.chpc-bonnie_R5_5d_tuneFS-WCE-ext3-extJ_bdv-8192.txt)

With stride disabled on the journal and data space, bonnie completes 51 runs
before oom-killer again hoses everything.

(log.chpc-bonnie_R5_5d_tuneFS-WCE-ext3-SAFEextJ_bdv-8192.txt)

  
Actual results:

Crash.

Expected results:

No Crash.

Additional info:

See URL for logs of bonnie runs, and if I can log the oom-killer, its output.

Comment 1 Eli Stair 2005-04-26 19:17:02 UTC
Created attachment 113683 [details]
tar.bz2 containing current bonnie output

Comment 3 Larry Woodman 2005-04-27 02:22:43 UTC
Please grab the latest RHEL4-U1 kernel and try to reproduce this problem.  We've
fixed several OOM kill problems in that kernel.  Also, if you still see the same
problem please attach the show_mem() outout thats written to /var/log/messages
when the OOM kill occurs.  We need that to debug the problem.

Larry Woodman



Comment 4 Eli Stair 2005-04-27 20:35:26 UTC
So far the most recent release (non-Beta U1) kernel (2.6.9-5.0.5.ELsmp) has
proven stable.  It's been under test load for 14hrs/134 runs without issue so far.

Is this kernel known to contain backports of the same relavent oom fixes in the
2.6.9-6.37 beta-U1 kernel?  If this proves stable over a week of testing I'll
likely consider it sane enough to not use the beta.

Thanks,

/eli

Comment 5 Dave Jones 2005-04-28 21:11:07 UTC
No, 5.0.5 has no VM changes that could affect OOM killing.
It's entirely a security related update.


Comment 6 Larry Woodman 2005-05-03 13:12:41 UTC
Please let me know ASAP if you consider the OOM kill problem fixed in the
RHEL4-U1 kernel so I can close this bug.

Thanks, Larry Woodman


Comment 7 Eli Stair 2005-05-05 13:46:12 UTC
2.6.9-5.0.5 has finally, after almost a week of success, been slaughtered by
oom-killer.  I've booted the U1 beta kernel and will update later, on success or
failure of problem resolution.

/eli

Comment 8 Larry Woodman 2005-05-05 14:22:10 UTC
We are waiting for results of RHEL4-U1 testing.  The latest OOM killer
enhancements were made to that kernel and were not in the 2.6.9-5.0.5 kernel. 
Please let me know as soon as you can confirm success or failure.

Larry Woodman


Comment 9 Mark T. Voelker 2006-02-14 16:43:17 UTC
I was able to engage oom-killer on a Penguin Altus 1400 running RHEL4U2 with
kernel 2.6.9-22, although it didn't completely crash the box.  I had two boxes
with dual Opteron 250's, but one was running a non-smp kernel.  I was actually
doing some network testing at the time, so I'll go ahead and include that info
in the setup description just in case it's relevant.  The NICs on these machines
are a pair of BCM5721's each occupying 1x PCIe lane.  The boxes also have two
Fujitsu MAU3073C SCSI hard drives on the LSI Logic 53c1030 US320 SCSI
controller, which is on the PCI-X bus.  The SMP machine had 4GB of RAM and the
non-SMP had 2GB (probably both actually had 4GB but 2GB of them were on the
second cpu's memory controller and therefore weren't available due to not
running an SMP kernel).  The non-smp machine was not using LVM, the SMP machine
was.  Both were using ext3 partitions.  To reproduce:

1.)  On each box, do "nc -l -p 8080 -o /dev/null".
2.)  On boxA, do "cat /dev/zero | nc -p 8090 <ip of boxB> 8090".
3.)  On boxB, do "cat /dev/zero | nc -p 8090 <ip of boxA> 8090".  
4.)  On each box, do "for i in `seq 1 1000`;do bonnie++ -d /tmp/bonnie -u 500;done"
5.)  Wait for oom-killer messages to appear in the syslog on the box running the
SMP kernel and LVM.  I saw about 11 in a span of 13 hours or so on this machine.
 Here's one of them:

As of Tue Feb 14 10:11:30 EST 2006 (less than a minute before oom-killer struck):

             total       used       free     shared    buffers     cached
Mem:       4139176    3830372     308804          0      11160    3551696
-/+ buffers/cache:     267516    3871660
Swap:      2031608       4780    2026828

As of Tue Feb 14 10:13:12 EST 2006 (a few seconds after oom-killer)

             total       used       free     shared    buffers     cached
Mem:       4139176    4122724      16452          0       4332    3887380
-/+ buffers/cache:     231012    3908164
Swap:      2031608       4288    2027320

The error from the syslog:

Feb 14 10:12:05 dc-altrus3 kernel: oom-killer: gfp_mask=0xd0
Feb 14 10:12:05 dc-altrus3 kernel: Mem-info:
Feb 14 10:12:05 dc-altrus3 kernel: DMA per-cpu:
Feb 14 10:12:05 dc-altrus3 kernel: cpu 0 hot: low 2, high 6, batch 1
Feb 14 10:12:05 dc-altrus3 kernel: cpu 0 cold: low 0, high 2, batch 1
Feb 14 10:12:05 dc-altrus3 kernel: cpu 1 hot: low 2, high 6, batch 1
Feb 14 10:12:05 dc-altrus3 kernel: cpu 1 cold: low 0, high 2, batch 1
Feb 14 10:12:05 dc-altrus3 kernel: Normal per-cpu:
Feb 14 10:12:05 dc-altrus3 kernel: cpu 0 hot: low 32, high 96, batch 16
Feb 14 10:12:05 dc-altrus3 kernel: cpu 0 cold: low 0, high 32, batch 16
Feb 14 10:12:05 dc-altrus3 kernel: cpu 1 hot: low 32, high 96, batch 16
Feb 14 10:12:05 dc-altrus3 kernel: cpu 1 cold: low 0, high 32, batch 16
Feb 14 10:12:05 dc-altrus3 kernel: HighMem per-cpu:
Feb 14 10:12:20 dc-altrus3 kernel: cpu 0 hot: low 32, high 96, batch 16
Feb 14 10:12:35 dc-altrus3 kernel: cpu 0 cold: low 0, high 32, batch 16
Feb 14 10:12:51 dc-altrus3 kernel: cpu 1 hot: low 32, high 96, batch 16
Feb 14 10:12:58 dc-altrus3 kernel: cpu 1 cold: low 0, high 32, batch 16
Feb 14 10:13:02 dc-altrus3 kernel:
Feb 14 10:13:02 dc-altrus3 kernel: Free pages:       14724kB (1664kB HighMem)
Feb 14 10:13:02 dc-altrus3 kernel: Active:43584 inactive:941648 dirty:325494
writeback:2847 unstable:0 free:3681 slab:41675 mapped:43194 pagetables:1110
Feb 14 10:13:02 dc-altrus3 kernel: DMA free:12588kB min:16kB low:32kB high:48kB
active:0kB inactive:0kB present:16384kB pages_scanned:324209 all_unreclaimable? yes
Feb 14 10:13:02 dc-altrus3 kernel: protections[]: 0 0 0
Feb 14 10:13:02 dc-altrus3 kernel: Normal free:472kB min:928kB low:1856kB
high:2784kB active:1528kB inactive:672604kB present:901120kB
pages_scanned:1130052 all_unreclaimable? yes
Feb 14 10:13:02 dc-altrus3 kernel: protections[]: 0 0 0
Feb 14 10:13:02 dc-altrus3 kernel: HighMem free:1664kB min:512kB low:1024kB
high:1536kB active:172808kB inactive:3093988kB present:5373952kB pages_scanned:0
all_unreclaimable? no
Feb 14 10:13:02 dc-altrus3 kernel: protections[]: 0 0 0
Feb 14 10:13:02 dc-altrus3 kernel: DMA: 3*4kB 4*8kB 4*16kB 2*32kB 4*64kB 1*128kB
1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12588kB
Feb 14 10:13:02 dc-altrus3 kernel: Normal: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB
1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 472kB
Feb 14 10:13:02 dc-altrus3 kernel: HighMem: 0*4kB 4*8kB 0*16kB 3*32kB 2*64kB
1*128kB 3*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1664kB
Feb 14 10:13:02 dc-altrus3 kernel: Swap cache: add 9963, delete 9653, find
3907/5053, race 0+0
Feb 14 10:13:02 dc-altrus3 kernel: 0 bounce buffer pages
Feb 14 10:13:02 dc-altrus3 kernel: Free swap:       2026800kB
Feb 14 10:13:02 dc-altrus3 kernel: 1572864 pages of RAM
Feb 14 10:13:02 dc-altrus3 kernel: 819184 pages of HIGHMEM
Feb 14 10:13:02 dc-altrus3 kernel: 538072 reserved pages
Feb 14 10:13:02 dc-altrus3 kernel: 1005107 pages shared
Feb 14 10:13:02 dc-altrus3 kernel: 310 pages swap cached
Feb 14 10:13:02 dc-altrus3 kernel: Out of Memory: Killed process 22272 (eggcups)

[root@dc-altrus3 ~]# uname -a
Linux dc-altrus3.cisco.com 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005
i686 athlon i386 GNU/Linux
[root@dc-147n34 ~]# uname -a
Linux dc-147n34.cisco.com 2.6.9-22.EL #1 Mon Sep 19 18:20:28 EDT 2005 i686
athlon i386 GNU/Linux
[root@dc-altrus3 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 2)
[root@dc-147n34 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 2)


Comment 10 Jos VanWezel 2006-07-03 22:23:58 UTC
I see the same happening with iozone. I can repeatedly crash the program when it
is testing with a huge (20 GB) file) on an ext3 file system. Will try release
2.6.9-34 tomorrow.

# uname -a
Linux f01-110-127 2.6.9-22.0.2.ELsmp #1 SMP Thu Jan 5 17:13:01 EST 2006 i686
athlon i386 GNU/Linux

Free swap:       2096168kB
1064960 pages of RAM
819056 pages of HIGHMEM
26316 reserved pages
1004450 pages shared
1 pages swap cached
Out of Memory: Killed process 14444 (iozone).
oom-killer: gfp_mask=0xd0
Mem-info:
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16

Free pages:       14376kB (1664kB HighMem)
Active:7732 inactive:997712 dirty:318651 writeback:19911 unstable:0 free:3594
slab:25373 mapped:4437 pagetables:287
DMA free:12568kB min:16kB low:32kB high:48kB active:0kB inactive:0kB
present:16384kB pages_scanned:956193 all_unreclaimable? yes
protections[]: 0 0 0
Normal free:144kB min:928kB low:1856kB high:2784kB active:13180kB
inactive:741660kB present:901120kB pages_scanned:1127181 all_unreclaimable? yes
protections[]: 0 0 0
HighMem free:1664kB min:512kB low:1024kB high:1536kB active:17748kB
inactive:3249188kB present:3342336kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
DMA: 2*4kB 4*8kB 3*16kB 2*32kB 4*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB
2*4096kB = 12568kB
Normal: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 144kB
HighMem: 34*4kB 35*8kB 12*16kB 5*32kB 2*64kB 4*128kB 1*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 1664kB
Swap cache: add 1182, delete 1180, find 289/456, race 0+0
0 bounce buffer pages
Free swap:       2096140kB
1064960 pages of RAM
819056 pages of HIGHMEM
26316 reserved pages
1007846 pages shared
2 pages swap cached
Out of Memory: Killed process 14657 (iozone).

Comment 11 Eli Stair 2006-09-21 16:40:52 UTC
Pinging this issue, as it's _still_ present in 2.6.18-rc7.

Most recently confirmed hardware:

2.6.18-rc7
Dual opteron 248
8GB RAM
36GB external journal on RAID1 U320 SCSI disks (15k RPM)
1.7T RAID5 mdadm data filesystem on dual-port 2GB (multipathing) FCAL JBOD tray
mounted RAID5 with 'commit=5' 'commit=30' 'commit=300'
mounted RAID5 with data=journal

Any combination of tuning options at my disposal result in an eventual CRASH
after oom-killer starts up.  Heavy loads are induced with iozone, bonnie++, NFS
serving to multiple test clients, etc.

Recreating the RAID5 without an external journal file resolves the issue 100%.

I don't know offhand how to confirm this, but it appears that as writes are
bound for the journal first before being read-back and written to main
filesystem... and writes are not commiting to the journal fast enough, thus
filling up main memory (triggering the oom-killer).

Write performance on the journal is about 75% of what is possible on the RAID5,
so I can see in some situation inbound data writes being backed up by journal
bottlenecks.  /IF/ that is the case, I don't see any way to make this a usable
scenario, short of having a journal device at least 2x faster than the main
filesystem AND capable of reads and writes almost simultaneously.  

Is this ONLY feasible with a BB-RAM device as a journal blockdev?  I have been
able to use this successfully in only a few circumstances, when data writes are
sporadic/low-bandwidth and/or a slow RAID device is used for the main data
portion of the ext3 filesystem.

/eli

PS - as usual, stride/blocksize options on the ext3 volume make no difference in
performance.


Comment 12 Jiri Pallich 2012-06-20 16:56:42 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.