From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070312 Red Hat/1.5.0.10-2.el5 Firefox/1.5.0.10 pango-text Description of problem: Vijay Trehan wrote: > Larry, > > Please see attached. > > They are running into problems - unusual behaviour - similar to what they ran into with RHEL4 some time ago. OK. 1.) have them get me "vmstat 1" and 2.) AltSysrq-M/W outputs when the system is at 100% System time.. > > > Vijay > > ------------------------------------------------------------------------ > > Subject: > Update: Cache performance characterization on Xeon > From: > "Yefim Somin" <Yefim.Somin> > Date: > Mon, 23 Jul 2007 19:23:03 -0400 > To: > <vtrehan> > > > I have done a series of runs on the 8-core Clovertown under RHEL5. I > can report that I had no problem using telnet for several thousand > users. It turned out that some telnet config files differ slightly > between RHEL4 and RHEL5 (some parameters relied on were not included > previously, hence taken at default values, but were included now and had > to be changed/removed - thanks to Sandeep for a tip), but that was easy > to fix. > > The important part is that I ran up to 5000 users and at that level > encountered the same problem that was reported by me on RHEL4 quite some > time ago by now. Namely, the run includes "reasonable" periods when CPU > utilization is mostly user and the level is well below saturation, but > also periods with 100% sys CPU when no useful work is done. This results > in extremely long and essentially non-valid completion times. As I just > implied, once the problem is fixed, the actual rating of the system > should be much higher than 5000, based on the good stretches observed. I > think we need to check again what could be done about the support case > that has been opened (along with a lot of collected data attached to > it). > > Thanks, > ys ========================================== Hi Shak, Thanks for your response. I believe this is a continuation of the problem described in support case 1056326 where quite a few sets of collected diagnostic data are attached. I think I had a brief discussion of this topic with you at the RedHat Summit. I am also attaching the latest couple of items collected on RHEL5. This was done on a system provided by Intel that we currently have inhouse, however, it was also observed at the Intel Phoenix lab by Sandeep. I have some heavy engagements right now, so I am not clear when I may have a window for a synchronized conversation. In particular, I will be inaccessible on Monday. Regards, Fima ======================================= Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.When simulating benchmark load of 4000-5000 users 2. 3. Actual Results: Expected Results: Additional info:
Created attachment 160264 [details] vmstat 1, sysrq commands outputs -- around 100% sys cpu util Vijay, Collected items are attached (vmstat shows a short 100% interval along with getting into it and out of it; sometimes those intervals are much longer). Say hi to Larry. Thanks, ys
Created attachment 160270 [details] Some more description of the problem
Created attachment 160315 [details] proc/meminfo when problem occurs
Created attachment 160920 [details] Output from Sandeep after using Larry's fix
FYI, the cause of this problem is one large file was mapped and faulted into the pagecache before the remaining RAM was mapped into anonymous regions. When the system finally does run out of RAM the first several thousand pages on the inactive list are from that single mapped file. This causes every CPU to enter try_to_free_pages() and eventually get stuck on the same mapping->i_mmap_lock in page_referenced_file(). Since the system has TONs of anonymous pages it shouldnt be reclaiming mapped file pages from the same file on every CPU so I just skip mapped file pages if the system has mostly anonymous memory pages. --- linux-2.6.18.noarch/mm/page_alloc.c.orig +++ linux-2.6.18.noarch/mm/page_alloc.c @@ -1289,7 +1289,7 @@ void show_free_areas(void) K(nr_free_highpages())); printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu " - "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n", + "unstable:%lu free:%u slab:%lu mapped-file:%lu mapped-anon:%lu pagetables:%lu\n", active, inactive, global_page_state(NR_FILE_DIRTY), @@ -1298,6 +1298,7 @@ void show_free_areas(void) nr_free_pages(), global_page_state(NR_SLAB), global_page_state(NR_FILE_MAPPED), + global_page_state(NR_ANON_PAGES), global_page_state(NR_PAGETABLE)); for_each_zone(zone) { --- linux-2.6.18.noarch/mm/vmscan.c.orig +++ linux-2.6.18.noarch/mm/vmscan.c @@ -808,6 +808,8 @@ force_reclaim_mapped: if (page_mapped(page)) { if (!reclaim_mapped || (total_swap_pages == 0 && PageAnon(page)) || + ((global_page_state(NR_FILE_MAPPED) < global_page_state(NR_ANON_PAGES)) && + !PageAnon(page)) || page_referenced(page, 0)) { list_add(&page->lru, &l_active); continue;
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Does the patched kernel resolve this issuse well enough to satisfy the customer??? Larry Woodman
Larry, Fima (InsterSystems) and Sandeep (Intel) report the benchmark runs a little longer but then runs into higher sys cpu util. Sandeep's output is in the last attachment dated 08/08/2007. ============================================== Yefim Somin wrote: > I did try it and the results were also better, but I still saw stretches > of elevated sys CPU (not 100% though). Unfortunately, I am extremely > stretched right now and until 8/20 to do hands on tooling beyond an > occasional run in the background. > > I also did some runs with fatter users requiring fewer of them (about > half) for a similar load. I did not see special stretches of bad > behavior there but the overall sys CPU was higher in relation to user > CPU (about 10:8 user to system) than what I would normally expect. This > also resulted in the response time curve going up much earlier than the > overall CPU utilization warranted. It would be useful to do something > like oprofiling and other things here, but as I mentioned, I don't have > the cycles to do hands-on now. > > Thanks, > ys > > P.S. For Sandeep's info: more intensive load is prepared by running a > script (similar to password change scripts) which replaces T2 with T1 in > RTE scripts, i.e., 2 sec think time with 1 sec think time. The elapsed > time yardstick is also halved (250 sec vs. 500 sec) > > -----Original Message----- > From: Gupta, Sandeep R [sandeep.r.gupta] > Sent: Tuesday, August 07, 2007 1:58 PM > To: Larry Woodman; Yefim Somin > Cc: vtrehan; Ed Rudack; John Shakshober; Kaditz, Barry A; > Gupta, Sandeep R > Subject: RE: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: > Update: Cache performance characterization on Xeon]]]]]] > > Hi Larry > > I reran the benchmark and this time, average response time for 5000 > users with the kernel appeared to better than before. I am going to run > couple of more runs to make sure the results are consistent. > > In the mean time, attached is /var/log/messages file with AltSysrq-M and > AltSysrq-T outputs. > > One question on the kernel you provided: > > Will the changes in this kernel be available in subsequent general > release of RHEL5? > > Fima > > Could you try the kernel Larry provided in your environment to see if > the problems you saw year before are being addressed by this? > > Sandeep
Larry, Some more feedback from Fima (InterSystems). He reports getting 4500 users on a 4-core Xeon. That is what they are currently getting from an 8-core Xeon using RHEL. Sandeep (intel) gets a little further because he has a higher clock rate Xeon. ================================= Yefim Somin wrote: > Vijay, > > We do not publish platform comparisons, but based on previous > experiences I would say that this rating is low for this type of > platform. We have seen ratings in the vicinity of 4500 users on a 4-core > Xeon platform. I think we need to do more work here first to fix this > outstanding problem, and then possibly also to investigate the causes of > generally elevated sys CPU times and the response time curve growth out > of proportion to the CPU utilization growth (i.e., despite relatively > low overall utilization) that I mentioned in my recent email. This would > be my suggestion. > > Regards, > Fima > > -----Original Message----- > From: Vijay Trehan [vtrehan] > Sent: Tuesday, August 07, 2007 5:16 PM > To: Yefim Somin; Ed Rudack; Gupta, Sandeep R > Cc: Kaditz, Barry A; Richard Li; Larry Woodman; John Shakshober; Vijay > Trehan > Subject: What is the target # Ontario users for the Xeon SUT > > Fima / Ed / Sandeep, > > Given that we would like to publish a white paper. Have a draft by Sept > 1 if possible. > > After putting in some of the fixes that Larry Woodman has provided and > any future fixes lets say we can handle 5000 Ontario users without any > perf anomaly. > > What I was curious was what is the target rating for this class of > machine? Is it 5500? 7500? 10000? Ball park. > > The reason I ask is if its 5500, we can decide to write the white paper > while we tune the performance further. While if the target is 10000 and > we are getting only 5000, then we must wait to fix the problem. > > Please send me your inputs. > > Thanks, > > Vijay
Yefim Somin wrote: Vijay, Here is a more detailed description of the problems, of which there seem to be more than one level. 1. Spontaneous elevated sys CPU This has been observed in two flavors. The initial one, reported last year, included long stretches with 100% sys CPU utilization. It extended the execution time very significantly. With a patch from Larry, this phenomenon was significantly reduced. What has been observed with the patch, was smaller stretches of elevated sys CPU not rising to the level of 100%. I assume that sysrq outputs for those stretches are requested. Given my work situation right now and the need to observe the test to produce that data, I am unable to do it. I believe that Sandeep should be able to do it by running with a higher number of users and collecting sysrq outputs when something like frequent vmstat shows this behavior. 2. High response time at low CPU utilization As I have mentioned, I did some runs on Clovertown and Caneland with a slightly modified version of the benchmark with a think time between transactions reduced from 2 sec to 1 sec. This allows to put a higher load on the system with a smaller number of users. Roughly speaking, the rating on this version is half of the number of users of the traditional benchmark version. The execution time yardstick is also about ½. I would stress, that these are rough correspondences. When run on Clovertown (8 cores, 2.13GHz) and Caneland (16 cores, 2.9Ghz), both RHEL5, this flavor of the benchmark produced ratings of 3000 users and 4000 correspondingly. For these runs, reported CPU utilizations were 55% (39u + 16s) and 39% (22u + 17s) respectively. I make the following observations: - we take our rating at the point when the response time curve has gone up significantly from the initial no contention level; in all of the well-behaved cases on various platforms this happens when CPU utilization is nearing 100%; in this case CPU is only approximately half utilized or less - the share of system CPU here is notably higher than normally observed for Ontario benchmark, especially on the 16 core system - the increase in capacity between the two platforms is much less than would be expected based on the number of cores and the clock rate (nobody expects perfect scaling of course) The described phenomenon of high elapsed times at low CPU utilizations is not like anything we have seen before. As a result, we clearly can demonstrate only a fraction of the capacity of these systems. Given that Sandeep has Clovertown, he could probably reproduce the behavior by running the 1 sec think time version at a slightly higher number of users than I did and take sysrq outputs if that is the first thing desired. If a more detailed study is then warranted, I suppose that somebody from RH may visit either Intel or ISC and make more detailed observations. In my case, however, I am not only saturated now, but will be traveling next week for my engagements and this may spill over the week after as well. Note, that there were no especially elevated sys CPU stretches like the ones described in the first section, but the numbers of users were also lower. Looking back on the traditional benchmark with some elevated sys CPU stretches I believe those stretches alone could not have accounted for the observed increase of the response time and we are dealing with a similar phenomenon with the traditional benchmark as well. I hope this is a clearer explanation of the situation. Regards, Fima
Vijay, A couple of clarifications here: 1) Short stretches of elevated sys CPU Sandeep should run the old benchmark here with a higher number of users (above 5000 and on maybe) to observe the appearance of such stretches and take sysrq etc. at those points. This is to continue in the follow up path of the patch from Larry 2) High response times at low CPU utilization This is a different and apparently anomalous phenomenon. This needs to be run with a fat benchmark (Sandeep should run a script to convert T2 to T1 in the RTE scripts; it could be reversed back to T2 with little effort when needed). This is not stretches of high sys CPU, rather the overall CPU utilization is steady and low but the response time is high (reaching around 250 sec and up which is roughly equivalent to 500 sec and up for the original benchmark). I noted a suspicious higher share of sys CPU in the overall utilization (say a third or more of the total CPU utilization, while normally it's under 20% of the total) as just an indicator of what may need further study, like profiling etc. Hopes this clarifies the steps, Thanks, Fima -----Original Message----- From: Vijay Trehan [vtrehan] Sent: Friday, August 10, 2007 11:10 AM To: Yefim Somin Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; Kaditz, Barry A; Richard Li; Vijay Trehan Subject: Re: Cache performance characterization on Xeon - technical description Hi Fima, Thanks for taking the time to write the attached description. I was on the phone with Sandeep this morning. Sandeep wanted to know what specific benchmark runs he should perform to reproduce the problem. I would like your input on this ASAP, so Sandeep can collect the data for Larry. Here is my understanding: 1. Sandeep stick to using the original Ontario benchmark 2. You ran 3000 fat Ontario users on the Clovertown and observed short stretches of elevated sys cpu util == which is roughly equal to 6000 regular Ontario users 3. Since Sandeep's Clovertown machine has a higher clock rate, he may need to run ~7000 regular Ontario users 4. Sandeep should run a ~7000 user benchmark and collect sysrq and other outputs for the stretches with elevated CPU 5. In the event there are no such stretches, we may need to try a higer number of users Does this represent a correct understanding of your message below? Thanks Vijay Yefim Somin wrote: > > > > > > > > Vijay, > > > > > > > > Here is a more detailed description of the problems, of which there > > seem to be more than one level. > > > > > > > > 1. Spontaneous elevated sys CPU > > > > > > > > This has been observed in two flavors. The initial one, reported last > > year, included long stretches with 100% sys CPU utilization. It > > extended the execution time very significantly. With a patch from > > Larry, this phenomenon was significantly reduced. > > > > > > > > What has been observed with the patch, was smaller stretches of > > elevated sys CPU not rising to the level of 100%. I assume that sysrq > > outputs for those stretches are requested. Given my work situation > > right now and the need to observe the test to produce that data, I am > > unable to do it. I believe that Sandeep should be able to do it by > > running with a higher number of users and collecting sysrq outputs > > when something like frequent vmstat shows this behavior. > > > > > > > > 2. High response time at low CPU utilization > > > > > > > > As I have mentioned, I did some runs on Clovertown and Caneland with a > > slightly modified version of the benchmark with a think time between > > transactions reduced from 2 sec to 1 sec. This allows to put a higher > > load on the system with a smaller number of users. Roughly speaking, > > the rating on this version is half of the number of users of the > > traditional benchmark version. The execution time yardstick is also > > about ½. I would stress, that these are rough correspondences. > > > > > > > > When run on Clovertown (8 cores, 2.13GHz) and Caneland (16 cores, > > 2.9Ghz), both RHEL5, this flavor of the benchmark produced ratings of > > 3000 users and 4000 correspondingly. For these runs, reported CPU > > utilizations were 55% (39u + 16s) and 39% (22u + 17s) respectively. > > > > I make the following observations: > > > > > > > > - we take our rating at the point when the response time > > curve has gone up significantly from the initial no contention level; > > in all of the well-behaved cases on various platforms this happens > > when CPU utilization is nearing 100%; in this case CPU is only > > approximately half utilized or less > > > > - the share of system CPU here is notably higher than > > normally observed for Ontario benchmark, especially on the 16 core system > > > > - the increase in capacity between the two platforms is much > > less than would be expected based on the number of cores and the clock > > rate (nobody expects perfect scaling of course) > > > > > > > > The described phenomenon of high elapsed times at low CPU utilizations > > is not like anything we have seen before. As a result, we clearly can > > demonstrate only a fraction of the capacity of these systems. Given > > that Sandeep has Clovertown, he could probably reproduce the behavior > > by running the 1 sec think time version at a slightly higher number of > > users than I did and take sysrq outputs if that is the first thing > > desired. If a more detailed study is then warranted, I suppose that > > somebody from RH may visit either Intel or ISC and make more detailed > > observations. In my case, however, I am not only saturated now, but > > will be traveling next week for my engagements and this may spill over > > the week after as well. > > > > > > > > Note, that there were no especially elevated sys CPU stretches like > > the ones described in the first section, but the numbers of users were > > also lower. Looking back on the traditional benchmark with some > > elevated sys CPU stretches I believe those stretches alone could not > > have accounted for the observed increase of the response time and we > > are dealing with a similar phenomenon with the traditional benchmark > > as well. > > > > > > > > I hope this is a clearer explanation of the situation. > > > > Regards, > > > > Fima > > > > > > > > P.S. I tried to get added to Bugzilla cc list for the case entered by > > Vijay, but it would not let me login with my credentials. > > Alternatively, I could login at the support site, but did not know how > > to get to Bugzilla from there. > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: Vijay Trehan [vtrehan] > > > > Sent: Thursday, August 09, 2007 11:52 AM > > > > To: Yefim Somin > > > > Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; > > Kaditz, Barry A; Vijay Trehan > > > > Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: > > Update: Cache performance characterization on Xeon]]]]]] > > > > > > > > Fima, > > > > > > > > I know you are tied up fighting some fire. However, whenever you have a > > > > chance can you run more users until you get ~100% sys cpu util and send > > > > th e outputs for Larry. > > > > Thanks. > > > > > > > > Vijay > > > > > > > > Yefim Somin wrote: > > >> > > I did try it and the results were also better, but I still saw stretches > > >> > > of elevated sys CPU (not 100% though). Unfortunately, I am extremely > > >> > > stretched right now and until 8/20 to do hands on tooling beyond an > > >> > > occasional run in the background. > > >> > > > > >> > > I also did some runs with fatter users requiring fewer of them (about > > >> > > half) for a similar load. I did not see special stretches of bad > > >> > > behavior there but the overall sys CPU was higher in relation to user > > >> > > CPU (about 10:8 user to system) than what I would normally expect. This > > >> > > also resulted in the response time curve going up much earlier than the > > >> > > overall CPU utilization warranted. It would be useful to do something > > >> > > like oprofiling and other things here, but as I mentioned, I don't have > > >> > > the cycles to do hands-on now. > > >> > > > > >> > > Thanks, > > >> > > ys > > >> > > > > >> > > P.S. For Sandeep's info: more intensive load is prepared by running a > > >> > > script (similar to password change scripts) which replaces T2 with T1 in > > >> > > RTE scripts, i.e., 2 sec think time with 1 sec think time. The elapsed > > >> > > time yardstick is also halved (250 sec vs. 500 sec) > > >> > > > > >> > > -----Original Message----- > > >> > > From: Gupta, Sandeep R [sandeep.r.gupta] > > >> > > Sent: Tuesday, August 07, 2007 1:58 PM > > >> > > To: Larry Woodman; Yefim Somin > > >> > > Cc: vtrehan; Ed Rudack; John Shakshober; Kaditz, Barry A; > > >> > > Gupta, Sandeep R > > >> > > Subject: RE: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: > > >> > > Update: Cache performance characterization on Xeon]]]]]] > > >> > > > > >> > > Hi Larry > > >> > > > > >> > > I reran the benchmark and this time, average response time for 5000 > > >> > > users with the kernel appeared to better than before. I am going to run > > >> > > couple of more runs to make sure the results are consistent. > > >> > > > > >> > > In the mean time, attached is /var/log/messages file with AltSysrq-M and > > >> > > AltSysrq-T outputs. > > >> > > > > >> > > One question on the kernel you provided: > > >> > > > > >> > > Will the changes in this kernel be available in subsequent general > > >> > > release of RHEL5? > > >> > > > > >> > > Fima > > >> > > > > >> > > Could you try the kernel Larry provided in your environment to see if > > >> > > the problems you saw year before are being addressed by this? > > >> > > > > >> > > Sandeep > > >> > > > > >> > > -----Original Message----- > > >> > > From: Larry Woodman [lwoodman] > > >> > > Sent: Friday, August 03, 2007 12:07 PM > > >> > > To: Yefim Somin > > >> > > Cc: Gupta, Sandeep R; vtrehan; Ed Rudack; John Shakshober; > > >> > > Kaditz, Barry A > > >> > > Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: > > >> > > Update: Cache performance characterization on Xeon]]]]]] > > >> > > > > >> > > Yefim Somin wrote: > > >> > > > > >> > > > > >>> > >> Just to clarify: there were hang ups with similar CPU manifestations on > > >>> > >> some other systems for some runs, but not in this round. This time it's > > >>> > >> high sys CPU utilization which could of course delay execution of a > > >>> > >> command while it's going on. > > >>> > >> > > >>> > >> ys > > >>> > >> > > >>> > >> > > >>> > >> > > >> > > I already verified that the high CPU time was SMP spinlock contention > > >> > > and > > >> > > I am certain that the kernel I sent you eliminated that specific > > >> > > spinlock > > >> > > from the picture(because I removed it from the kernel entirely). I need > > >> > > the next debug outputs (the AltSysrq-M and AltSysrq-W) so I can see what > > >> > > else is causing the problem after the first was removed. > > >> > > > > >> > > Larry > > >> > > > >
I was hoping not to see the original problem with fewer users and be able to load the system more, but ran into this next level problem. ys -----Original Message----- From: Vijay Trehan [vtrehan] Sent: Friday, August 10, 2007 11:39 AM To: Yefim Somin Subject: Re: Cache performance characterization on Xeon - technical description Fima, Got it. Just one question. Why do you have to run the fat benchmark? Wouldn't the regular benchmark with twice the number of users be the same? Vijay
Larry Woodman wrote: > On Fri, 2007-08-10 at 11:29 -0400, Yefim Somin wrote: >> Vijay, >> >> A couple of clarifications here: >> >> 1) Short stretches of elevated sys CPU > Once again I need the AltSysrq-M and AltSysrq-W outputs when the system > gets slow with the kernel I sent you running. >> Sandeep should run the old benchmark here with a higher number of users (above 5000 and on maybe) to observe the appearance of such stretches and take sysrq etc. at those points. This is to continue in the follow up path of the patch from Larry >> >> 2) High response times at low CPU utilization >> >> This is a different and apparently anomalous phenomenon. This needs to be run with a fat benchmark (Sandeep should run a script to convert T2 to T1 in the RTE scripts; it could be reversed back to T2 with little effort when needed). This is not stretches of high sys CPU, rather the overall CPU utilization is steady and low but the response time is high (reaching around 250 sec and up which is roughly equivalent to 500 sec and up for the original benchmark). >> I noted a suspicious higher share of sys CPU in the overall utilization (say a third or more of the total CPU utilization, while normally it's under 20% of the total) as just an indicator of what may need further study, like profiling etc. >> >> Hopes this clarifies the steps, >> Thanks, >> Fima >> >> -----Original Message----- >> From: Vijay Trehan [vtrehan] >> Sent: Friday, August 10, 2007 11:10 AM >> To: Yefim Somin >> Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; Kaditz, Barry A; Richard Li; Vijay Trehan >> Subject: Re: Cache performance characterization on Xeon - technical description >> >> Hi Fima, >> >> Thanks for taking the time to write the attached description. >> >> I was on the phone with Sandeep this morning. Sandeep wanted to know >> what specific benchmark runs he should perform to reproduce the problem. >> >> I would like your input on this ASAP, so Sandeep can collect the data >> for Larry. >> >> Here is my understanding: >> 1. Sandeep stick to using the original Ontario benchmark >> 2. You ran 3000 fat Ontario users on the Clovertown and observed short >> stretches of elevated sys cpu util == which is roughly equal to 6000 >> regular Ontario users >> 3. Since Sandeep's Clovertown machine has a higher clock rate, he may >> need to run ~7000 regular Ontario users >> 4. Sandeep should run a ~7000 user benchmark and collect sysrq and other >> outputs for the stretches with elevated CPU >> 5. In the event there are no such stretches, we may need to try a higer >> number of users >> >> Does this represent a correct understanding of your message below? >> >> Thanks >> >> Vijay >> >> Yefim Somin wrote: >>> >>> >>> Vijay, >>> >>> >>> >>> Here is a more detailed description of the problems, of which there >>> seem to be more than one level. >>> >>> >>> >>> 1. Spontaneous elevated sys CPU >>> >>> >>> >>> This has been observed in two flavors. The initial one, reported last >>> year, included long stretches with 100% sys CPU utilization. It >>> extended the execution time very significantly. With a patch from >>> Larry, this phenomenon was significantly reduced. >>> >>> >>> >>> What has been observed with the patch, was smaller stretches of >>> elevated sys CPU not rising to the level of 100%. I assume that sysrq >>> outputs for those stretches are requested. Given my work situation >>> right now and the need to observe the test to produce that data, I am >>> unable to do it. I believe that Sandeep should be able to do it by >>> running with a higher number of users and collecting sysrq outputs >>> when something like frequent vmstat shows this behavior. >>> >>> >>> >>> 2. High response time at low CPU utilization >>> >>> >>> >>> As I have mentioned, I did some runs on Clovertown and Caneland with a >>> slightly modified version of the benchmark with a think time between >>> transactions reduced from 2 sec to 1 sec. This allows to put a higher >>> load on the system with a smaller number of users. Roughly speaking, >>> the rating on this version is half of the number of users of the >>> traditional benchmark version. The execution time yardstick is also >>> about ½. I would stress, that these are rough correspondences. >>> >>> >>> >>> When run on Clovertown (8 cores, 2.13GHz) and Caneland (16 cores, >>> 2.9Ghz), both RHEL5, this flavor of the benchmark produced ratings of >>> 3000 users and 4000 correspondingly. For these runs, reported CPU >>> utilizations were 55% (39u + 16s) and 39% (22u + 17s) respectively. >>> >>> I make the following observations: >>> >>> >>> >>> - we take our rating at the point when the response time >>> curve has gone up significantly from the initial no contention level; >>> in all of the well-behaved cases on various platforms this happens >>> when CPU utilization is nearing 100%; in this case CPU is only >>> approximately half utilized or less >>> >>> - the share of system CPU here is notably higher than >>> normally observed for Ontario benchmark, especially on the 16 core system >>> >>> - the increase in capacity between the two platforms is much >>> less than would be expected based on the number of cores and the clock >>> rate (nobody expects perfect scaling of course) >>> >>> >>> >>> The described phenomenon of high elapsed times at low CPU utilizations >>> is not like anything we have seen before. As a result, we clearly can >>> demonstrate only a fraction of the capacity of these systems. Given >>> that Sandeep has Clovertown, he could probably reproduce the behavior >>> by running the 1 sec think time version at a slightly higher number of >>> users than I did and take sysrq outputs if that is the first thing >>> desired. If a more detailed study is then warranted, I suppose that >>> somebody from RH may visit either Intel or ISC and make more detailed >>> observations. In my case, however, I am not only saturated now, but >>> will be traveling next week for my engagements and this may spill over >>> the week after as well. >>> >>> >>> >>> Note, that there were no especially elevated sys CPU stretches like >>> the ones described in the first section, but the numbers of users were >>> also lower. Looking back on the traditional benchmark with some >>> elevated sys CPU stretches I believe those stretches alone could not >>> have accounted for the observed increase of the response time and we >>> are dealing with a similar phenomenon with the traditional benchmark >>> as well. >>> >>> >>> >>> I hope this is a clearer explanation of the situation. >>> >>> Regards, >>> >>> Fima >>> >>> >>> >>> P.S. I tried to get added to Bugzilla cc list for the case entered by >>> Vijay, but it would not let me login with my credentials. >>> Alternatively, I could login at the support site, but did not know how >>> to get to Bugzilla from there. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> >>> From: Vijay Trehan [vtrehan] >>> >>> Sent: Thursday, August 09, 2007 11:52 AM >>> >>> To: Yefim Somin >>> >>> Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; >>> Kaditz, Barry A; Vijay Trehan >>> >>> Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: >>> Update: Cache performance characterization on Xeon]]]]]] >>> >>> >>> >>> Fima, >>> >>> >>> >>> I know you are tied up fighting some fire. However, whenever you have a >>> >>> chance can you run more users until you get ~100% sys cpu util and send >>> >>> th e outputs for Larry. >>> >>> Thanks. >>> >>> >>> >>> Vijay >>> >>> >>> >>> Yefim Somin wrote: >>> >>>> I did try it and the results were also better, but I still saw stretches >>>> of elevated sys CPU (not 100% though). Unfortunately, I am extremely >>>> stretched right now and until 8/20 to do hands on tooling beyond an >>>> occasional run in the background. >>>> I also did some runs with fatter users requiring fewer of them (about >>>> half) for a similar load. I did not see special stretches of bad >>>> behavior there but the overall sys CPU was higher in relation to user >>>> CPU (about 10:8 user to system) than what I would normally expect. This >>>> also resulted in the response time curve going up much earlier than the >>>> overall CPU utilization warranted. It would be useful to do something >>>> like oprofiling and other things here, but as I mentioned, I don't have >>>> the cycles to do hands-on now. >>>> Thanks, >>>> ys >>>> P.S. For Sandeep's info: more intensive load is prepared by running a >>>> script (similar to password change scripts) which replaces T2 with T1 in >>>> RTE scripts, i.e., 2 sec think time with 1 sec think time. The elapsed >>>> time yardstick is also halved (250 sec vs. 500 sec) >>>> -----Original Message----- >>>> From: Gupta, Sandeep R [sandeep.r.gupta] >>>> Sent: Tuesday, August 07, 2007 1:58 PM >>>> To: Larry Woodman; Yefim Somin >>>> Cc: vtrehan; Ed Rudack; John Shakshober; Kaditz, Barry A; >>>> Gupta, Sandeep R >>>> Subject: RE: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: >>>> Update: Cache performance characterization on Xeon]]]]]] >>>> Hi Larry >>>> I reran the benchmark and this time, average response time for 5000 >>>> users with the kernel appeared to better than before. I am going to run >>>> couple of more runs to make sure the results are consistent. >>>> In the mean time, attached is /var/log/messages file with AltSysrq-M and >>>> AltSysrq-T outputs. >>>> One question on the kernel you provided: >>>> Will the changes in this kernel be available in subsequent general >>>> release of RHEL5? >>>> Fima >>>> Could you try the kernel Larry provided in your environment to see if >>>> the problems you saw year before are being addressed by this? >>>> Sandeep >>>> -----Original Message----- >>>> From: Larry Woodman [lwoodman] >>>> Sent: Friday, August 03, 2007 12:07 PM >>>> To: Yefim Somin >>>> Cc: Gupta, Sandeep R; vtrehan; Ed Rudack; John Shakshober; >>>> Kaditz, Barry A >>>> Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: >>>> Update: Cache performance characterization on Xeon]]]]]] >>>> Yefim Somin wrote: >>>> >>>>> Just to clarify: there were hang ups with similar CPU manifestations on >>>>> some other systems for some runs, but not in this round. This time it's >>>>> high sys CPU utilization which could of course delay execution of a >>>>> command while it's going on. >>>>> ys >>>>> >>>> I already verified that the high CPU time was SMP spinlock contention >>>> and >>>> I am certain that the kernel I sent you eliminated that specific >>>> spinlock >>>> from the picture(because I removed it from the kernel entirely). I need >>>> the next debug outputs (the AltSysrq-M and AltSysrq-W) so I can see what >>>> else is causing the problem after the first was removed. >>>> Larry >>>> >
Created attachment 176881 [details] vmstat for 4000 fat users on 8-core, 2.66 GHz, 16GB Cloverton
Created attachment 176901 [details] Alt_sysrq for 4000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton
Created attachment 176921 [details] vmstat for 5000 fat users on 8-core, 2.66 GHz, 16MB Cloverton
Created attachment 176941 [details] Alt_sysrq for 5000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton
Created attachment 176961 [details] vmstat for 4000 fat users on 16-core, 2.9 GHz, 32GB Tigerton
Created attachment 176981 [details] oprofile output for 4000 fat users on 16-core, 2.9 GHz, 32GB Tigerton
Yefim Somin wrote on 08/24/2007: > > I agree with Sandeep’s observations. I am attaching a vmstat from a run on the 16 core (tigerton, 32GB mem) system with 4000 users, where the total execution time is also elevated to the knee of the curve (even longer than Sandeep’s run) while: > > > > - the CPU utilization is low > > - sys CPU is a high proportion of it (higher than for Sandeep’s 5000 run) > > - context switches are even higher (180000/sec) even though the amount of useful work is lower per unit of time than in Sandeep’s 5000 run > > > > I do not have sysrq outputs for it but that is what I would get when have a chance to rerun. I would assume for now that the nature of the problem is similar, however. So the high CS is a question in itself, but also whether and how it would lead to extended elapsed time at the low overall CPU utilization. > > > > ys > > > > From: Gupta, Sandeep R [sandeep.r.gupta] > Sent: Friday, August 24, 2007 3:24 PM > To: Kaditz, Barry A; Richard Li > Cc: vtrehan; Yefim Somin; Larry Woodman; shak; Gupta, Sandeep R > Subject: RE: Fat Client Diagnostic Info (4000 and 5000 User Scenarios) and Some observations > > > > From: Gupta, Sandeep R > Sent: Friday, August 24, 2007 12:22 PM > To: vtrehan; Yefim Somin; Larry Woodman; shak > Cc: Gupta, Sandeep R > Subject: Fat Client Diagnostic Info (4000 and 5000 User Scenarios) and Some observations > > > > Attached are the vmstat and sysrq outputs for 4000 and 5000 user scenarios with Fat Client and Larry’s patch. The response times are > > > > 4000 Users - 199.128 > > 5000 Users - 230.520 > > > > Few Observations > > - Standard Ontario benchmark on (RHEL5-XEON-CACHE2007) is showing context switching rate of 45000 (4000 Users) and 55000 to 88000 (5000 Users). > > - Fat Client benchmark on (RHEL5-XEON-CACHE2007) is showing context switching rate of 88000 (4000 Users) and 155000+ (5000 Users), that is the elevated CPU usage symptom is more prominent in this case. > > - As a comparative data point (RHEL4 AS4-Itanium-CACHE2007) shows a context switch rate of 16000 for 5000 Users. > > > > - I think this excessive context switching as more concurrent load is added (either by adding more users in standard benchmark or by reducing think time in fat client benchmark) is causing significant clock cycles to be consumed showing the elevated CPU usage. > > - So the questions are > > - What is causing the excessive context switching? > > - Why the processes have to context switch and are not able to keep up? > > > > Sandeep Gupta > > Intel Corporation > > Software Solutions Group > > Digital Health Enabling > > Email: Sandeep.r.Gupta > > Phone: 480 554 4003 (Work) > >
The problem here is the sparse address space associated with the Cache address space. At the end of this run 1773288 out of 4110010 pages or 43% of RAM was consumed in pagetables. At that point the pagetables consumed more pages of memory than the data memory that they mapped!!! This is caused by the application creating very large anD sparse virtual address spaces. Over time the system swaps out the application data pages and reuses those pages with pagetable pages. This is what is causing the substandard performance. At the first AlltSysrq-M the total Active+Inactive was 3086149 pages. At the last AltSysrq-M the total Active+Inactive was 1685867 pages. The total managed memory shrunk by 1400282 pages while the pagetable pages grew from 588045 to 1773288 pages, a total of 1185243 pagetable page increase. (First AltSysrq-M) Active:2181591 inactive:904558 dirty:62 writeback:0 unstable:0 free:21953 slab:189797 mapped-file:88115 mapped-anon:2150368 pagetables:588045 (Last AltSysrq-M) Active:1659849 inactive:260158 dirty:21665 writeback:361 unstable:0 free:21111 slab:165598 mapped-file:241949 mapped-anon:1536562 pagetables:1773288 There is nothing the operating system can do about this problem, it is up to the application to create and maintain smaller and less sparse virtual address spaces in order to achieve higher performance. As the system reclaims data pages and reuses them for pagetables the system will swap more aggressively and the application will perform more poorly until it finally completes. Larry Woodman
Created attachment 210711 [details] Some more oprofile data form Fima
After more investigation here there are fundimental differences in the way this test runs on AMD x64_64 system versus Intel x86_64 systems. In both cases, the system exhausts all of the memory in the pagecache and the 2GB system V shared memory database shared region. As the ~5000 individual processes run and touch the virtual pages of that shared region they incur pagefaults and create the page tables to map the shared region. Over time this reclaims the memory from the pagecache and swaps out the system V shared memory region and reuses that memory for pagetables. Since pagetables are wired/not reclaimable the system ends up spending excessive time in the page reclaim code. Simple arithmetic illustrates that the 2GB(2^31) shared region contains 524288 4KB pages therefore each process requires 524288 8-byte page tables entries(PTEs) or 4MB(2^22) of private pagetables just to map that shared region. ~5000 processes times 4MB results in 20GB of wired pagetables on a system with 20GB of RAM. This is a know issue that can only be solved by 1.) Adding more RAM. 2.) Using hugepages so that pagetables are not needed to map the system V shared region. Any changes that can be made to the page reclaim code to addres this will be of no practical use compared to 1.) more RAM or 2.) using hugepages. What is different between the AMD and Intel x86_64 systems is the size of the caches and TLBs, AMD systems have 1024 TLB entries where Intel processors have 128 TLB entries. When the system exhausts memory and ends up in page_check_address() where it walks the pgd(s), pud(s), pmd(s) and pte(s) in order to determine whether the pages are referenced or not. Intel processors incur many more TLB misses walking these pagetables than AMD systems simply because the Intel TLB is 8-times smapper than the AMD TLB.
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
Comment on attachment 176901 [details] Alt_sysrq for 4000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton Yep work very gud
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).