Description of Problem: We ran into performance problems running Oracle on the Red Hat 2.4.9 kernel (-13 but -21 has the same problem). We have managed to isolate the issue to a very small test program that can reproduce the problem. In a nutshell, once the buffer/page cache is filled (e.g. use dd to copy a file larger than memory), I/O to shm memory from a raw partition always incurs a swap I/O for the shm page. If the memory is obtained via malloc instead, this problem does not occur. Version-Release number of selected component (if applicable): Red Hat Linux 7.2, kernel 2.4.9-13 or -21. I'm sure that the 2.4.9-7 kernel has similar issues since they all use the 2.4.9-ac10 patch. How Reproducible: 100%. While there is free memory performance is identical. Once there is not, shm performance tanks. Steps to Reproduce: I have a small test program that reproduces the problem. 1. Attach a raw partition. 2. Use dd to copy a file larger than memory to pollute/fill the page/buffer cache. Wait for the buffers to be flushed and the system to quiesce. 3. Run the program that performs random reads from the raw partition to memory that is either malloced or shm memory. Even though the program was changed to do I/O to the same page every time, the performance with shm is 1/2 or worse that of the performance with malloc Actual Results: The I/O rate to the shm memory is 1/2 that of the malloced memory. If the raw partition is on a high-speed low latency drive (in our case an Imperial Technology Megaram FC ramdisk), the performance drops to 1/7 ! Expected Results: Performance should be the same regardless of the memory being obtained via SysV shm or malloc. This is the case with the Linus 2.4.9 kernel.org kernel. Additional Information: I will add the program as an attachment.
Created attachment 45480 [details] Test program.
The test program should be invoked on a raw partition setup using the raw command. Blocksize of 4096 (1 page) is fine. Number of blocks in the file depends on the size of the partition. We've used 32MB/512 for a raid which makes the workload effectively cache. Number of preads needs to be a decent size (50,000). The MALLOC numbers remain steady, the SHM numbers drop off once the page/buffer cache is populated.
I haven't looked at your program yet (that's next ;) but I just finished building a kernel with a "slightly differently tuned vm" http://people.redhat.com/arjanv/testkernels
Have you had a chance to test the test kernels?