Red Hat Bugzilla – Bug 59748
VM problem cripples shm performance (Oracle)
Last modified: 2007-03-26 23:51:19 EDT
Description of Problem:
We ran into performance problems running Oracle on the Red Hat 2.4.9 kernel (-13
but -21 has the same problem). We have managed to isolate the issue to a very
small test program that can reproduce the problem. In a nutshell, once the
buffer/page cache is filled (e.g. use dd to copy a file larger than memory), I/O
to shm memory from a raw partition always incurs a swap I/O for the shm page. If
the memory is obtained via malloc instead, this problem does not occur.
Version-Release number of selected component (if applicable):
Red Hat Linux 7.2, kernel 2.4.9-13 or -21. I'm sure that the 2.4.9-7 kernel has
similar issues since they all use the 2.4.9-ac10 patch.
100%. While there is free memory performance is identical. Once there is not,
shm performance tanks.
Steps to Reproduce:
I have a small test program that reproduces the problem.
1. Attach a raw partition.
2. Use dd to copy a file larger than memory to pollute/fill the page/buffer
cache. Wait for the buffers to be flushed and the system to quiesce.
3. Run the program that performs random reads from the raw partition to memory
that is either malloced or shm memory. Even though the program was changed to do
I/O to the same page every time, the performance with shm is 1/2 or worse that
of the performance with malloc
The I/O rate to the shm memory is 1/2 that of the malloced memory. If the raw
partition is on a high-speed low latency drive (in our case an Imperial
Technology Megaram FC ramdisk), the performance drops to 1/7 !
Performance should be the same regardless of the memory being obtained via SysV
shm or malloc. This is the case with the Linus 2.4.9 kernel.org kernel.
I will add the program as an attachment.
Created attachment 45480 [details]
The test program should be invoked on a raw partition setup using the raw
command. Blocksize of 4096 (1 page) is fine. Number of blocks in the file
depends on the size of the partition. We've used 32MB/512 for a raid which makes
the workload effectively cache. Number of preads needs to be a decent size (50,000).
The MALLOC numbers remain steady, the SHM numbers drop off once the page/buffer
cache is populated.
I haven't looked at your program yet (that's next ;) but I just finished
building a kernel with a "slightly differently tuned vm"
Have you had a chance to test the test kernels?