From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: Our application guys are complaining about some performance issues with the latest RHEL3 kernel 2.4.21-5EL. They first noticed the problem about a week ago with the 2.4.21-4.0.1 kernel and I hoped that the U1 kernel with all its VM fixes for Oracle might help us out with them and so I gave them the U1 kernel. They have simplified the problem down to two separate issues and provided me with a simple reproducer which illustrates the problems some of the simulation codes are running into. This test program simply makes a very long doubly linked list. In creating the list it does not ever traverse the list while building it. It always appends to the tail. Theoretically this should make the beginning of the list the oldest pages in the machine. The list is specifically designed to be bigger than the available RAM of the machine and so it does hit swap. Therefore, the pages reflecting the top of the list should be swapped out. Once the list is of the specified length. It traverses it from tail to head. Therefore, the pages it accesses first should be the pages which are the newest. #1 There seems to be a problem with the performance for the run. On a 7.3 based distro still using a 2.4.18 based kernel they see consistant performance of around 2 minutes and 45 seconds on a 2.2 GHz box. On a RHEL3 machine running a 2.4.21-5EL kernel, in the best case the runs take 5 minutes and 45 seconds. #2 With the old 2.4.18 based kernel they could run the program over and over and get about the same performance. However, with the RHEL3 kernel the performance on the second and subsequent runs drops to around 8 minutes. The reason why they caught this problem in early testing and why we are still running the last 2.4.18 errata kernel in production is because we first saw this problem when we were testing out RHL9. Then when the errata kernels for 7.3 moved to the 2.4.20 series we saw the problem appear there. This was some of the leverage I used in convincing them to move to RHEL3. The fact that I'm now seeing the same problem with RHEL3's kernel is not making me look very good and it is putting our plans to move to RHEL3 in production on hold. The reason why this 2nd problem is such an important issue is that the people running the simulation software tend to write their codes in such a way that they seldom if ever touch swap. They want to use every last bit of memory available without touching swap and having to pay the performance penalty. If this number changes from run to run, then they are rather upset. Also, it appears that on the second and subsequent runs, it hits swap much sooner. On the first run it would get to about 1.8 GB before it started swapping. This is comparable to what we have been seeing with 2.4.18. However, the subsequent runs seem to begin swapping at around 1.2GB. This upsets the application developers who feel that they have lost 600 MB of available RAM. Currently, it appears like this problem only happens on ia32. ia64 seems to either not have the problem or it takes more to provoke it. I have attached the little test program. Could someone with a deeper understanding of the kernel's VM subsystem, please run this little test program. You may have to comment out the printing and tweak with how many nodes you create on the linked list based upon how much memory is in your box. Version-Release number of selected component (if applicable): 2.4.21-5EL How reproducible: Always Steps to Reproduce: 1. compile the attached program (the program is designed to trigger the bug on a 2GB machine. If you have more or less RAM you may need to tweak the number of items it puts on the linked list. I also found removing the printf's helpful.) 2. time ./a.out 3. repeat several times Actual Results: 1) in the best case performance about half as good as seen on the last 2.4.18 errata kernel for 7.3 2) inconsistant performance between runs. 3) A sharp decrease in the amount of memory that can be used before there is a notable performance degregation between runs. Expected Results: 1) performance on par with 7.3 2) consistant performance between runs of the program 3) the same amount of memory available between runs before a performance degregation kicks in. Additional info:
Created attachment 96415 [details] test program that illustrates the problem.
Additionally under Expected results, we would like to add: 4) roughly the same amount of available RAM before performance is impacted as we see with the latter 2.4.18 errata kernels.
Created attachment 96436 [details] new version of program that reproduces the problem
These are results from 2.4.21-4.0.1EL toad5@ben:/usr/bin/time -v ./a.out build 1071007855 traverse 1071007883 done 1071008120 build=28 sec traverse 237 sec total=265 sec Command being timed: "./a.out" User time (seconds): 1.74 System time (seconds): 16.43 Percent of CPU this job got: 6% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:26.82 <snip> Major (requiring I/O) page faults: 43291 Minor (reclaiming a frame) page faults: 1042095 <snip> These are results from 2.4.21-5EL toad6@ben:/usr/bin/time -v ./a.out build 1071007394 traverse 1071007423 done 1071007653 build=29 sec traverse 230 sec total=259 sec Command being timed: "./a.out" User time (seconds): 1.40 System time (seconds): 18.79 Percent of CPU this job got: 7% Elapsed (wall clock) time (h:mm:ss or m:ss): 4:20.44 <snip> Major (requiring I/O) page faults: 264199 Minor (reclaiming a frame) page faults: 822589 <snip> These are the results with a 2.4.18-27 kernel Command being timed: "./a.out" User time (seconds): 1.73 System time (seconds): 25.35 Percent of CPU this job got: 16% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:48.36 <snip> Major (requiring I/O) page faults: 309483 Minor (reclaiming a frame) page faults: 819566 <snip>
I have to eat crow on this one. We determined that a non-obvious difference in the node configuration (different drive speeds and partitioning) lead to the vast majority of the performance differences between the two runs. Once we corrected for this problem the performance difference dropped from 120% slow down to a 10% slowdown which may or may not be caused by the VM. However, the customer is still spooked that there may still be gremlins hiding in the VM subsystem and that our reproducer may have just just failed to replicate the issue. We have seen some anomolous performance variations between the two kernels but we have yet to isolate them down to a reproducible state.