Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 250155

Summary:

100% Sys CPU when no useful work being done - Simialr to BZ 210927 (RHEL4)

Product:

Red Hat Enterprise Linux 5

Reporter:

Vijay Trehan <vtrehan>

Component:

kernel

Assignee:

core-kernel-bot <core-kernel-mgr>

kernel sub component:

Kernel-Core

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

medium

Priority:

low

CC:

anton.fang, dewayneneal94, dshaks, lwoodman, vtrehan

Version:

5.0

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-02 13:04:54 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

425461

Attachments:

Description	Flags
vmstat 1, sysrq commands outputs -- around 100% sys cpu util	none
Some more description of the problem	none
proc/meminfo when problem occurs	none
Output from Sandeep after using Larry's fix	none
vmstat for 4000 fat users on 8-core, 2.66 GHz, 16GB Cloverton	none
Alt_sysrq for 4000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton	none
vmstat for 5000 fat users on 8-core, 2.66 GHz, 16MB Cloverton	none
Alt_sysrq for 5000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton	none
vmstat for 4000 fat users on 16-core, 2.9 GHz, 32GB Tigerton	none
oprofile output for 4000 fat users on 16-core, 2.9 GHz, 32GB Tigerton	none
Some more oprofile data form Fima	none

Description Vijay Trehan 2007-07-30 19:03:15 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070312 Red Hat/1.5.0.10-2.el5 Firefox/1.5.0.10 pango-text

Description of problem:
Vijay Trehan wrote:

> Larry,
>
> Please see attached.
>
> They are running into problems - unusual behaviour - similar to what they ran into with RHEL4 some time ago. 

OK.  1.) have them get me "vmstat 1" and 2.) AltSysrq-M/W outputs when the system is at 100% System time..

>
>
> Vijay
>
> ------------------------------------------------------------------------
>
> Subject:
> Update: Cache performance characterization on Xeon
> From:
> "Yefim Somin" <Yefim.Somin>
> Date:
> Mon, 23 Jul 2007 19:23:03 -0400
> To:
> <vtrehan>
>
>
> I have done a series of runs on the 8-core Clovertown under RHEL5. I
> can report that I had no problem using telnet for several thousand
> users. It turned out that some telnet config files differ slightly
> between RHEL4 and RHEL5 (some parameters relied on were not included
> previously, hence taken at default values, but were included now and had
> to be changed/removed - thanks to Sandeep for a tip), but that was easy
> to fix.
>
> The important part is that I ran up to 5000 users and at that level
> encountered the same problem that was reported by me on RHEL4 quite some
> time ago by now. Namely, the run includes "reasonable" periods when CPU
> utilization is mostly user and the level is well below saturation, but
> also periods with 100% sys CPU when no useful work is done. This results
> in extremely long and essentially non-valid completion times. As I just
> implied, once the problem is fixed, the actual rating of the system
> should be much higher than 5000, based on the good stretches observed. I
> think we need to check again what could be done about the support case
> that has been opened (along with a lot of collected data attached to
> it).
>
> Thanks,
> ys 

==========================================

Hi Shak,

Thanks for your response. I believe this is a continuation of the
problem described in support case 1056326 where quite a few sets of
collected diagnostic data are attached. I think I had a brief discussion
of this topic with you at the RedHat Summit. I am also attaching the
latest couple of items collected on RHEL5. This was done on a system
provided by Intel that we currently have inhouse, however, it was also
observed at the Intel Phoenix lab by Sandeep. I have some heavy
engagements right now, so I am not clear when I may have a window for a
synchronized conversation. In particular, I will be inaccessible on
Monday.

Regards,
Fima 

=======================================

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1.When simulating benchmark load of 4000-5000 users
2.
3.

Actual Results:

Expected Results:

Additional info:

Comment 1 Vijay Trehan 2007-07-30 19:12:05 UTC

Created attachment 160264 [details]
vmstat 1,  sysrq commands outputs -- around 100% sys cpu util 

Vijay,

Collected items are attached (vmstat shows a short 100% interval along
with getting into it and out of it; sometimes those intervals are much
longer).

Say hi to Larry.

Thanks,
ys

Comment 2 Vijay Trehan 2007-07-30 20:08:59 UTC

Created attachment 160270 [details]
Some more description of the problem

Comment 3 Vijay Trehan 2007-07-31 12:58:13 UTC

Created attachment 160315 [details]
proc/meminfo when problem occurs

Comment 4 Vijay Trehan 2007-08-08 16:38:53 UTC

Created attachment 160920 [details]
Output from Sandeep after using Larry's fix

Comment 5 Larry Woodman 2007-08-09 15:09:11 UTC

FYI, the cause of this problem is one large file was mapped and faulted into the
pagecache before the remaining RAM was mapped into anonymous regions.  When the
system finally does run out of RAM the first several thousand pages on the
inactive list are from that single mapped file.  This causes every CPU to enter
try_to_free_pages() and eventually get stuck on the same mapping->i_mmap_lock in
page_referenced_file().  Since the system has TONs of anonymous pages it
shouldnt be reclaiming mapped file pages from the same file on every CPU so I
just skip mapped file pages if the system has mostly anonymous memory pages.

 
--- linux-2.6.18.noarch/mm/page_alloc.c.orig
+++ linux-2.6.18.noarch/mm/page_alloc.c
@@ -1289,7 +1289,7 @@ void show_free_areas(void)
                K(nr_free_highpages()));
                                                                               
                           
        printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-               "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+               "unstable:%lu free:%u slab:%lu mapped-file:%lu mapped-anon:%lu
pagetables:%lu\n",
                active,
                inactive,
                global_page_state(NR_FILE_DIRTY),
@@ -1298,6 +1298,7 @@ void show_free_areas(void)
                nr_free_pages(),
                global_page_state(NR_SLAB),
                global_page_state(NR_FILE_MAPPED),
+               global_page_state(NR_ANON_PAGES),
                global_page_state(NR_PAGETABLE));
                                                                               
                           
        for_each_zone(zone) {
--- linux-2.6.18.noarch/mm/vmscan.c.orig
+++ linux-2.6.18.noarch/mm/vmscan.c
@@ -808,6 +808,8 @@ force_reclaim_mapped:
                if (page_mapped(page)) {
                        if (!reclaim_mapped ||
                            (total_swap_pages == 0 && PageAnon(page)) ||
+                           ((global_page_state(NR_FILE_MAPPED) <
global_page_state(NR_ANON_PAGES)) &&
+                               !PageAnon(page)) ||
                            page_referenced(page, 0)) {
                                list_add(&page->lru, &l_active);
                                continue;

Comment 6 RHEL Program Management 2007-08-09 15:16:32 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Larry Woodman 2007-08-09 15:22:16 UTC

Does the patched kernel resolve this issuse well enough to satisfy the customer???

Larry Woodman

Comment 8 Vijay Trehan 2007-08-09 15:34:44 UTC

Larry,

Fima (InsterSystems) and Sandeep (Intel) report the benchmark runs a little
longer but then runs into higher sys cpu util. Sandeep's output is in the last
attachment dated 08/08/2007.

==============================================

Yefim Somin wrote:
> I did try it and the results were also better, but I still saw stretches
> of elevated sys CPU (not 100% though). Unfortunately, I am extremely
> stretched right now and until 8/20 to do hands on tooling beyond an
> occasional run in the background. 
>
> I also did some runs with fatter users requiring fewer of them (about
> half) for a similar load. I did not see special stretches of bad
> behavior there but the overall sys CPU was higher in relation to user
> CPU (about 10:8 user to system) than what I would normally expect. This
> also resulted in the response time curve going up much earlier than the
> overall CPU utilization warranted. It would be useful to do something
> like oprofiling and other things here, but as I mentioned, I don't have
> the cycles to do hands-on now.
>
> Thanks,
> ys
>
> P.S. For Sandeep's info: more intensive load is prepared by running a
> script (similar to password change scripts) which replaces T2 with T1 in
> RTE scripts, i.e., 2 sec think time with 1 sec think time. The elapsed
> time yardstick is also halved (250 sec vs. 500 sec)
>
> -----Original Message-----
> From: Gupta, Sandeep R [sandeep.r.gupta] 
> Sent: Tuesday, August 07, 2007 1:58 PM
> To: Larry Woodman; Yefim Somin
> Cc: vtrehan; Ed Rudack; John Shakshober; Kaditz, Barry A;
> Gupta, Sandeep R
> Subject: RE: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd:
> Update: Cache performance characterization on Xeon]]]]]]
>
> Hi Larry
>
> I reran the benchmark and this time, average response time for 5000
> users with the kernel appeared to better than before. I am going to run
> couple of more runs to make sure the results are consistent. 
>
> In the mean time, attached is /var/log/messages file with AltSysrq-M and
> AltSysrq-T outputs. 
>
> One question on the kernel you provided:
>
> Will the changes in this kernel be available in subsequent general
> release of RHEL5?
>
> Fima
>
> Could you try the kernel Larry provided in your environment to see if
> the problems you saw year before are being addressed by this?
>
> Sandeep

Comment 9 Vijay Trehan 2007-08-09 15:39:44 UTC

Larry,

Some more feedback from Fima (InterSystems). He reports getting 4500 users on a
4-core Xeon. That is what they are currently getting from an 8-core Xeon using
RHEL. Sandeep (intel) gets a little further because he has a higher clock rate Xeon.

=================================

Yefim Somin wrote:
> Vijay,
>
> We do not publish platform comparisons, but based on previous
> experiences I would say that this rating is low for this type of
> platform. We have seen ratings in the vicinity of 4500 users on a 4-core
> Xeon platform. I think we need to do more work here first to fix this
> outstanding problem, and then possibly also to investigate the causes of
> generally elevated sys CPU times and the response time curve growth out
> of proportion to the CPU utilization growth (i.e., despite relatively
> low overall utilization) that I mentioned in my recent email. This would
> be my suggestion.
>
> Regards,
> Fima
>
> -----Original Message-----
> From: Vijay Trehan [vtrehan] 
> Sent: Tuesday, August 07, 2007 5:16 PM
> To: Yefim Somin; Ed Rudack; Gupta, Sandeep R
> Cc: Kaditz, Barry A; Richard Li; Larry Woodman; John Shakshober; Vijay
> Trehan
> Subject: What is the target # Ontario users for the Xeon SUT
>
> Fima / Ed / Sandeep,
>
> Given that we would like to publish a white paper. Have a draft by Sept 
> 1 if possible.
>
> After putting in some of the fixes that Larry Woodman has provided and 
> any future fixes lets say we can handle 5000 Ontario users without any 
> perf anomaly.
>
> What I was curious was what is the target rating for this class of 
> machine? Is it 5500? 7500? 10000? Ball park.
>
> The reason I ask is if its 5500, we can decide to write the white paper 
> while we tune the performance further. While if the target is 10000 and 
> we are getting only 5000, then we must wait to fix the problem.
>
> Please send me your inputs.
>
> Thanks,
>
> Vijay

Comment 10 Vijay Trehan 2007-08-10 17:07:28 UTC

Yefim Somin wrote:


Vijay,
                                                                   
Here is a more detailed description of the problems, of which there seem to be
more than one level. 

   1. Spontaneous elevated sys CPU
 
This has been observed in two flavors. The initial one, reported last year,
included long stretches with 100% sys CPU utilization. It extended the execution
time very significantly. With a patch from Larry, this phenomenon was
significantly reduced. 

What has been observed with the patch, was smaller stretches of elevated sys CPU
not rising to the level of 100%. I assume that sysrq outputs for those stretches
are requested. Given my work situation right now and the need to observe the
test to produce that data, I am unable to do it. I believe that Sandeep should
be able to do it by running with a higher number of users and collecting sysrq
outputs when something like frequent vmstat shows this behavior.
 

   2. High response time at low CPU utilization 

As I have mentioned, I did some runs on Clovertown and Caneland with a slightly
modified version of the benchmark with a think time between transactions reduced
from 2 sec to 1 sec. This allows to put a higher load on the system with a
smaller number of users. Roughly speaking, the rating on this version is half of
the number of users of the traditional benchmark version. The execution time
yardstick is also about ½. I would stress, that these are rough correspondences.

When run on Clovertown (8 cores, 2.13GHz) and Caneland (16 cores, 2.9Ghz), both
RHEL5, this flavor of the benchmark produced ratings of 3000 users and 4000
correspondingly. For these runs, reported CPU utilizations were 55% (39u + 16s)
and 39% (22u + 17s) respectively.

I make the following observations: 

-          we take our rating at the point when the response time curve has gone
up significantly from the initial no contention level; in all of the
well-behaved cases on various platforms this happens when CPU utilization is
nearing 100%; in this case CPU is only approximately half utilized or less  

-          the share of system CPU here is notably higher than normally observed
for Ontario benchmark, especially on the 16 core system

-          the increase in capacity between the two platforms is much less than
would be expected based on the number of cores and the clock rate (nobody
expects perfect scaling of course)
 

The described phenomenon of high elapsed times at low CPU utilizations is not
like anything we have seen before. As a result, we clearly can demonstrate only
a fraction of the capacity of these systems. Given that Sandeep has Clovertown,
he could probably reproduce the behavior by running the 1 sec think time version
at a slightly higher number of users than I did and take sysrq outputs if that
is the first thing desired. If a more detailed study is then warranted, I
suppose that somebody from RH may visit either Intel or ISC and make more
detailed observations. In my case, however, I am not only saturated now, but
will be traveling next week for my engagements and this may spill over the week
after as well.
 

Note, that there were no especially elevated sys CPU stretches like the ones
described in the first section, but the numbers of users were also lower.
Looking back on the traditional benchmark with some elevated sys CPU stretches I
believe those stretches alone could not have accounted for the observed increase
of the response time and we are dealing with a similar phenomenon with the
traditional benchmark as well. 

I hope this is a clearer explanation of the situation.

Regards,

Fima

Comment 11 Vijay Trehan 2007-08-10 17:09:58 UTC

Vijay,

A couple of clarifications here:

1) Short stretches of elevated sys CPU

Sandeep should run the old benchmark here with a higher number of users (above
5000 and on maybe) to observe the appearance of such stretches and take sysrq
etc. at those points. This is to continue in the follow up path of the patch
from Larry

2) High response times at low CPU utilization

This is a different and apparently anomalous phenomenon. This needs to be run
with a fat benchmark (Sandeep should run a script to convert T2 to T1 in the RTE
scripts; it could be reversed back to T2 with little effort when needed). This
is not stretches of high sys CPU, rather the overall CPU utilization is steady
and low but the response time is high (reaching around 250 sec and up which is
roughly equivalent to 500 sec and up for the original benchmark). 
I noted a suspicious higher share of sys CPU in the overall utilization (say a
third or more of the total CPU utilization, while normally it's under 20% of the
total) as just an indicator of what may need further study, like profiling etc. 

Hopes this clarifies the steps,
Thanks,
Fima

-----Original Message-----
From: Vijay Trehan [vtrehan] 
Sent: Friday, August 10, 2007 11:10 AM
To: Yefim Somin
Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; Kaditz, Barry
A; Richard Li; Vijay Trehan
Subject: Re: Cache performance characterization on Xeon - technical description

Hi Fima,

Thanks for taking the time to  write the attached description.

I was on the phone with Sandeep this morning. Sandeep wanted to know 
what  specific benchmark runs he should perform to reproduce the problem.

I would like your input on this ASAP, so Sandeep can collect the data 
for Larry.
 
Here is my understanding:
1. Sandeep stick to using the original Ontario benchmark
2. You ran 3000 fat Ontario users on the Clovertown and observed short 
stretches of elevated sys cpu util == which is roughly equal to 6000 
regular Ontario users
3. Since Sandeep's Clovertown machine has a higher clock rate, he may 
need to run ~7000 regular Ontario users
4. Sandeep should run a ~7000 user benchmark and collect sysrq and other 
outputs for the stretches with elevated CPU
5. In the event there are no such stretches, we may need to try a higer 
number of users

Does this represent a correct understanding of your message below?

Thanks

Vijay

Yefim Somin wrote:
> >
> >  
> >
> > Vijay,
> >
> >                                                                        
> >
> > Here is a more detailed description of the problems, of which there 
> > seem to be more than one level.
> >
> >  
> >
> >    1. Spontaneous elevated sys CPU
> >
> >  
> >
> > This has been observed in two flavors. The initial one, reported last 
> > year, included long stretches with 100% sys CPU utilization. It 
> > extended the execution time very significantly. With a patch from 
> > Larry, this phenomenon was significantly reduced.
> >
> >  
> >
> > What has been observed with the patch, was smaller stretches of 
> > elevated sys CPU not rising to the level of 100%. I assume that sysrq 
> > outputs for those stretches are requested. Given my work situation 
> > right now and the need to observe the test to produce that data, I am 
> > unable to do it. I believe that Sandeep should be able to do it by 
> > running with a higher number of users and collecting sysrq outputs 
> > when something like frequent vmstat shows this behavior.
> >
> >  
> >
> >    2. High response time at low CPU utilization
> >
> >  
> >
> > As I have mentioned, I did some runs on Clovertown and Caneland with a 
> > slightly modified version of the benchmark with a think time between 
> > transactions reduced from 2 sec to 1 sec. This allows to put a higher 
> > load on the system with a smaller number of users. Roughly speaking, 
> > the rating on this version is half of the number of users of the 
> > traditional benchmark version. The execution time yardstick is also 
> > about ½. I would stress, that these are rough correspondences.
> >
> >  
> >
> > When run on Clovertown (8 cores, 2.13GHz) and Caneland (16 cores, 
> > 2.9Ghz), both RHEL5, this flavor of the benchmark produced ratings of 
> > 3000 users and 4000 correspondingly. For these runs, reported CPU 
> > utilizations were 55% (39u + 16s) and 39% (22u + 17s) respectively.
> >
> > I make the following observations:
> >
> >  
> >
> > -          we take our rating at the point when the response time 
> > curve has gone up significantly from the initial no contention level; 
> > in all of the well-behaved cases on various platforms this happens 
> > when CPU utilization is nearing 100%; in this case CPU is only 
> > approximately half utilized or less  
> >
> > -          the share of system CPU here is notably higher than 
> > normally observed for Ontario benchmark, especially on the 16 core system
> >
> > -          the increase in capacity between the two platforms is much 
> > less than would be expected based on the number of cores and the clock 
> > rate (nobody expects perfect scaling of course)
> >
> >  
> >
> > The described phenomenon of high elapsed times at low CPU utilizations 
> > is not like anything we have seen before. As a result, we clearly can 
> > demonstrate only a fraction of the capacity of these systems. Given 
> > that Sandeep has Clovertown, he could probably reproduce the behavior 
> > by running the 1 sec think time version at a slightly higher number of 
> > users than I did and take sysrq outputs if that is the first thing 
> > desired. If a more detailed study is then warranted, I suppose that 
> > somebody from RH may visit either Intel or ISC and make more detailed 
> > observations. In my case, however, I am not only saturated now, but 
> > will be traveling next week for my engagements and this may spill over 
> > the week after as well.
> >
> >  
> >
> > Note, that there were no especially elevated sys CPU stretches like 
> > the ones described in the first section, but the numbers of users were 
> > also lower. Looking back on the traditional benchmark with some 
> > elevated sys CPU stretches I believe those stretches alone could not 
> > have accounted for the observed increase of the response time and we 
> > are dealing with a similar phenomenon with the traditional benchmark 
> > as well.
> >
> >  
> >
> > I hope this is a clearer explanation of the situation.
> >
> > Regards,
> >
> > Fima
> >
> >  
> >
> > P.S. I tried to get added to Bugzilla cc list for the case entered by 
> > Vijay, but it would not let me login with my credentials. 
> > Alternatively, I could login at the support site, but did not know how 
> > to get to Bugzilla from there.
> >
> >  
> >
> >  
> >
> >  
> >
> >  
> >
> >  
> >
> > -----Original Message-----
> >
> > From: Vijay Trehan [vtrehan]
> >
> > Sent: Thursday, August 09, 2007 11:52 AM
> >
> > To: Yefim Somin
> >
> > Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; 
> > Kaditz, Barry A; Vijay Trehan
> >
> > Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: 
> > Update: Cache performance characterization on Xeon]]]]]]
> >
> >  
> >
> > Fima,
> >
> >  
> >
> > I know you are tied up fighting some fire. However, whenever you have a
> >
> > chance can you run more users until you get ~100% sys cpu util and send
> >
> > th e outputs for Larry.
> >
> > Thanks.
> >
> >  
> >
> > Vijay
> >
> >  
> >
> > Yefim Somin wrote:
> >
>> > > I did try it and the results were also better, but I still saw stretches
> >
>> > > of elevated sys CPU (not 100% though). Unfortunately, I am extremely
> >
>> > > stretched right now and until 8/20 to do hands on tooling beyond an
> >
>> > > occasional run in the background.
> >
>> > > 
> >
>> > > I also did some runs with fatter users requiring fewer of them (about
> >
>> > > half) for a similar load. I did not see special stretches of bad
> >
>> > > behavior there but the overall sys CPU was higher in relation to user
> >
>> > > CPU (about 10:8 user to system) than what I would normally expect. This
> >
>> > > also resulted in the response time curve going up much earlier than the
> >
>> > > overall CPU utilization warranted. It would be useful to do something
> >
>> > > like oprofiling and other things here, but as I mentioned, I don't have
> >
>> > > the cycles to do hands-on now.
> >
>> > > 
> >
>> > > Thanks,
> >
>> > > ys
> >
>> > > 
> >
>> > > P.S. For Sandeep's info: more intensive load is prepared by running a
> >
>> > > script (similar to password change scripts) which replaces T2 with T1 in
> >
>> > > RTE scripts, i.e., 2 sec think time with 1 sec think time. The elapsed
> >
>> > > time yardstick is also halved (250 sec vs. 500 sec)
> >
>> > > 
> >
>> > > -----Original Message-----
> >
>> > > From: Gupta, Sandeep R [sandeep.r.gupta]
> >
>> > > Sent: Tuesday, August 07, 2007 1:58 PM
> >
>> > > To: Larry Woodman; Yefim Somin
> >
>> > > Cc: vtrehan; Ed Rudack; John Shakshober; Kaditz, Barry A;
> >
>> > > Gupta, Sandeep R
> >
>> > > Subject: RE: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd:
> >
>> > > Update: Cache performance characterization on Xeon]]]]]]
> >
>> > > 
> >
>> > > Hi Larry
> >
>> > > 
> >
>> > > I reran the benchmark and this time, average response time for 5000
> >
>> > > users with the kernel appeared to better than before. I am going to run
> >
>> > > couple of more runs to make sure the results are consistent.
> >
>> > > 
> >
>> > > In the mean time, attached is /var/log/messages file with AltSysrq-M and
> >
>> > > AltSysrq-T outputs.
> >
>> > > 
> >
>> > > One question on the kernel you provided:
> >
>> > > 
> >
>> > > Will the changes in this kernel be available in subsequent general
> >
>> > > release of RHEL5?
> >
>> > > 
> >
>> > > Fima
> >
>> > > 
> >
>> > > Could you try the kernel Larry provided in your environment to see if
> >
>> > > the problems you saw year before are being addressed by this?
> >
>> > > 
> >
>> > > Sandeep
> >
>> > > 
> >
>> > > -----Original Message-----
> >
>> > > From: Larry Woodman [lwoodman]
> >
>> > > Sent: Friday, August 03, 2007 12:07 PM
> >
>> > > To: Yefim Somin
> >
>> > > Cc: Gupta, Sandeep R; vtrehan; Ed Rudack; John Shakshober;
> >
>> > > Kaditz, Barry A
> >
>> > > Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd:
> >
>> > > Update: Cache performance characterization on Xeon]]]]]]
> >
>> > > 
> >
>> > > Yefim Somin wrote:
> >
>> > > 
> >
>> > >  
> >
>>> > >> Just to clarify: there were hang ups with similar CPU manifestations on
> >
>>> > >> some other systems for some runs, but not in this round. This time it's
> >
>>> > >> high sys CPU utilization which could of course delay execution of a
> >
>>> > >> command while it's going on.
> >
>>> > >> 
> >
>>> > >> ys
> >
>>> > >> 
> >
>>> > >> 
> >
>>> > >>    
> >
>> > > I already verified that the high CPU time was SMP spinlock contention
> >
>> > > and
> >
>> > > I am certain that the kernel I sent you eliminated that specific
> >
>> > > spinlock
> >
>> > > from the picture(because I removed it from the kernel entirely).  I need
> >
>> > > the next debug outputs (the AltSysrq-M and AltSysrq-W) so I can see what
> >
>> > > else is causing the problem after the first was removed.
> >
>> > > 
> >
>> > > Larry
> >
>> > >  
> >

Comment 12 Vijay Trehan 2007-08-10 17:14:32 UTC

I was hoping not to see the original problem with fewer users and be able to
load the system more, but ran into this next level problem.

ys

-----Original Message-----
From: Vijay Trehan [vtrehan] 
Sent: Friday, August 10, 2007 11:39 AM
To: Yefim Somin
Subject: Re: Cache performance characterization on Xeon - technical description

Fima,

Got it.

Just one question. Why do you have to run the fat benchmark? Wouldn't 
the regular benchmark with twice the number of users be the same?

Vijay

Comment 13 Vijay Trehan 2007-08-10 17:17:25 UTC

Larry Woodman wrote:

> On Fri, 2007-08-10 at 11:29 -0400, Yefim Somin wrote:
>> Vijay,
>>
>> A couple of clarifications here:
>>
>> 1) Short stretches of elevated sys CPU

> Once again I need the AltSysrq-M and AltSysrq-W outputs when the system
> gets slow with the kernel I sent you running.

>> Sandeep should run the old benchmark here with a higher number of users
(above 5000 and on maybe) to observe the appearance of such stretches and take
sysrq etc. at those points. This is to continue in the follow up path of the
patch from Larry
>>
>> 2) High response times at low CPU utilization
>>
>> This is a different and apparently anomalous phenomenon. This needs to be run
with a fat benchmark (Sandeep should run a script to convert T2 to T1 in the RTE
scripts; it could be reversed back to T2 with little effort when needed). This
is not stretches of high sys CPU, rather the overall CPU utilization is steady
and low but the response time is high (reaching around 250 sec and up which is
roughly equivalent to 500 sec and up for the original benchmark). 
>> I noted a suspicious higher share of sys CPU in the overall utilization (say
a third or more of the total CPU utilization, while normally it's under 20% of
the total) as just an indicator of what may need further study, like profiling etc. 
>>
>> Hopes this clarifies the steps,
>> Thanks,
>> Fima
>>
>> -----Original Message-----
>> From: Vijay Trehan [vtrehan] 
>> Sent: Friday, August 10, 2007 11:10 AM
>> To: Yefim Somin
>> Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; Kaditz,
Barry A; Richard Li; Vijay Trehan
>> Subject: Re: Cache performance characterization on Xeon - technical description
>>
>> Hi Fima,
>>
>> Thanks for taking the time to  write the attached description.
>>
>> I was on the phone with Sandeep this morning. Sandeep wanted to know 
>> what  specific benchmark runs he should perform to reproduce the problem.
>>
>> I would like your input on this ASAP, so Sandeep can collect the data 
>> for Larry.
>>  
>> Here is my understanding:
>> 1. Sandeep stick to using the original Ontario benchmark
>> 2. You ran 3000 fat Ontario users on the Clovertown and observed short 
>> stretches of elevated sys cpu util == which is roughly equal to 6000 
>> regular Ontario users
>> 3. Since Sandeep's Clovertown machine has a higher clock rate, he may 
>> need to run ~7000 regular Ontario users
>> 4. Sandeep should run a ~7000 user benchmark and collect sysrq and other 
>> outputs for the stretches with elevated CPU
>> 5. In the event there are no such stretches, we may need to try a higer 
>> number of users
>>
>> Does this represent a correct understanding of your message below?
>>
>> Thanks
>>
>> Vijay
>>
>> Yefim Somin wrote:
>>>  
>>>
>>> Vijay,
>>>
>>>                                                                        
>>>
>>> Here is a more detailed description of the problems, of which there 
>>> seem to be more than one level.
>>>
>>>  
>>>
>>>    1. Spontaneous elevated sys CPU
>>>
>>>  
>>>
>>> This has been observed in two flavors. The initial one, reported last 
>>> year, included long stretches with 100% sys CPU utilization. It 
>>> extended the execution time very significantly. With a patch from 
>>> Larry, this phenomenon was significantly reduced.
>>>
>>>  
>>>
>>> What has been observed with the patch, was smaller stretches of 
>>> elevated sys CPU not rising to the level of 100%. I assume that sysrq 
>>> outputs for those stretches are requested. Given my work situation 
>>> right now and the need to observe the test to produce that data, I am 
>>> unable to do it. I believe that Sandeep should be able to do it by 
>>> running with a higher number of users and collecting sysrq outputs 
>>> when something like frequent vmstat shows this behavior.
>>>
>>>  
>>>
>>>    2. High response time at low CPU utilization
>>>
>>>  
>>>
>>> As I have mentioned, I did some runs on Clovertown and Caneland with a 
>>> slightly modified version of the benchmark with a think time between 
>>> transactions reduced from 2 sec to 1 sec. This allows to put a higher 
>>> load on the system with a smaller number of users. Roughly speaking, 
>>> the rating on this version is half of the number of users of the 
>>> traditional benchmark version. The execution time yardstick is also 
>>> about ½. I would stress, that these are rough correspondences.
>>>
>>>  
>>>
>>> When run on Clovertown (8 cores, 2.13GHz) and Caneland (16 cores, 
>>> 2.9Ghz), both RHEL5, this flavor of the benchmark produced ratings of 
>>> 3000 users and 4000 correspondingly. For these runs, reported CPU 
>>> utilizations were 55% (39u + 16s) and 39% (22u + 17s) respectively.
>>>
>>> I make the following observations:
>>>
>>>  
>>>
>>> -          we take our rating at the point when the response time 
>>> curve has gone up significantly from the initial no contention level; 
>>> in all of the well-behaved cases on various platforms this happens 
>>> when CPU utilization is nearing 100%; in this case CPU is only 
>>> approximately half utilized or less  
>>>
>>> -          the share of system CPU here is notably higher than 
>>> normally observed for Ontario benchmark, especially on the 16 core system
>>>
>>> -          the increase in capacity between the two platforms is much 
>>> less than would be expected based on the number of cores and the clock 
>>> rate (nobody expects perfect scaling of course)
>>>
>>>  
>>>
>>> The described phenomenon of high elapsed times at low CPU utilizations 
>>> is not like anything we have seen before. As a result, we clearly can 
>>> demonstrate only a fraction of the capacity of these systems. Given 
>>> that Sandeep has Clovertown, he could probably reproduce the behavior 
>>> by running the 1 sec think time version at a slightly higher number of 
>>> users than I did and take sysrq outputs if that is the first thing 
>>> desired. If a more detailed study is then warranted, I suppose that 
>>> somebody from RH may visit either Intel or ISC and make more detailed 
>>> observations. In my case, however, I am not only saturated now, but 
>>> will be traveling next week for my engagements and this may spill over 
>>> the week after as well.
>>>
>>>  
>>>
>>> Note, that there were no especially elevated sys CPU stretches like 
>>> the ones described in the first section, but the numbers of users were 
>>> also lower. Looking back on the traditional benchmark with some 
>>> elevated sys CPU stretches I believe those stretches alone could not 
>>> have accounted for the observed increase of the response time and we 
>>> are dealing with a similar phenomenon with the traditional benchmark 
>>> as well.
>>>
>>>  
>>>
>>> I hope this is a clearer explanation of the situation.
>>>
>>> Regards,
>>>
>>> Fima
>>>
>>>  
>>>
>>> P.S. I tried to get added to Bugzilla cc list for the case entered by 
>>> Vijay, but it would not let me login with my credentials. 
>>> Alternatively, I could login at the support site, but did not know how 
>>> to get to Bugzilla from there.
>>>
>>>  
>>>
>>>  
>>>
>>>  
>>>
>>>  
>>>
>>>  
>>>
>>> -----Original Message-----
>>>
>>> From: Vijay Trehan [vtrehan]
>>>
>>> Sent: Thursday, August 09, 2007 11:52 AM
>>>
>>> To: Yefim Somin
>>>
>>> Cc: Gupta, Sandeep R; Larry Woodman; Ed Rudack; John Shakshober; 
>>> Kaditz, Barry A; Vijay Trehan
>>>
>>> Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: 
>>> Update: Cache performance characterization on Xeon]]]]]]
>>>
>>>  
>>>
>>> Fima,
>>>
>>>  
>>>
>>> I know you are tied up fighting some fire. However, whenever you have a
>>>
>>> chance can you run more users until you get ~100% sys cpu util and send
>>>
>>> th e outputs for Larry.
>>>
>>> Thanks.
>>>
>>>  
>>>
>>> Vijay
>>>
>>>  
>>>
>>> Yefim Somin wrote:
>>>
>>>> I did try it and the results were also better, but I still saw stretches
>>>> of elevated sys CPU (not 100% though). Unfortunately, I am extremely
>>>> stretched right now and until 8/20 to do hands on tooling beyond an
>>>> occasional run in the background.
>>>> I also did some runs with fatter users requiring fewer of them (about
>>>> half) for a similar load. I did not see special stretches of bad
>>>> behavior there but the overall sys CPU was higher in relation to user
>>>> CPU (about 10:8 user to system) than what I would normally expect. This
>>>> also resulted in the response time curve going up much earlier than the
>>>> overall CPU utilization warranted. It would be useful to do something
>>>> like oprofiling and other things here, but as I mentioned, I don't have
>>>> the cycles to do hands-on now.
>>>> Thanks,
>>>> ys
>>>> P.S. For Sandeep's info: more intensive load is prepared by running a
>>>> script (similar to password change scripts) which replaces T2 with T1 in
>>>> RTE scripts, i.e., 2 sec think time with 1 sec think time. The elapsed
>>>> time yardstick is also halved (250 sec vs. 500 sec)
>>>> -----Original Message-----
>>>> From: Gupta, Sandeep R [sandeep.r.gupta]
>>>> Sent: Tuesday, August 07, 2007 1:58 PM
>>>> To: Larry Woodman; Yefim Somin
>>>> Cc: vtrehan; Ed Rudack; John Shakshober; Kaditz, Barry A;
>>>> Gupta, Sandeep R
>>>> Subject: RE: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd:
>>>> Update: Cache performance characterization on Xeon]]]]]]
>>>> Hi Larry
>>>> I reran the benchmark and this time, average response time for 5000
>>>> users with the kernel appeared to better than before. I am going to run
>>>> couple of more runs to make sure the results are consistent.
>>>> In the mean time, attached is /var/log/messages file with AltSysrq-M and
>>>> AltSysrq-T outputs.
>>>> One question on the kernel you provided:
>>>> Will the changes in this kernel be available in subsequent general
>>>> release of RHEL5?
>>>> Fima
>>>> Could you try the kernel Larry provided in your environment to see if
>>>> the problems you saw year before are being addressed by this?
>>>> Sandeep
>>>> -----Original Message-----
>>>> From: Larry Woodman [lwoodman]
>>>> Sent: Friday, August 03, 2007 12:07 PM
>>>> To: Yefim Somin
>>>> Cc: Gupta, Sandeep R; vtrehan; Ed Rudack; John Shakshober;
>>>> Kaditz, Barry A
>>>> Subject: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd: Re: [Fwd:
>>>> Update: Cache performance characterization on Xeon]]]]]]
>>>> Yefim Somin wrote:
>>>>  
>>>>> Just to clarify: there were hang ups with similar CPU manifestations on
>>>>> some other systems for some runs, but not in this round. This time it's
>>>>> high sys CPU utilization which could of course delay execution of a
>>>>> command while it's going on.
>>>>> ys
>>>>>    
>>>> I already verified that the high CPU time was SMP spinlock contention
>>>> and
>>>> I am certain that the kernel I sent you eliminated that specific
>>>> spinlock
>>>> from the picture(because I removed it from the kernel entirely).  I need
>>>> the next debug outputs (the AltSysrq-M and AltSysrq-W) so I can see what
>>>> else is causing the problem after the first was removed.
>>>> Larry
>>>>  
>

Comment 14 Vijay Trehan 2007-08-28 15:45:49 UTC

Created attachment 176881 [details]
vmstat for 4000 fat users on 8-core, 2.66 GHz, 16GB Cloverton

Comment 15 Vijay Trehan 2007-08-28 15:48:55 UTC

Created attachment 176901 [details]
Alt_sysrq for 4000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton

Comment 16 Vijay Trehan 2007-08-28 15:50:14 UTC

Created attachment 176921 [details]
vmstat for 5000 fat users on 8-core, 2.66 GHz, 16MB Cloverton

Comment 17 Vijay Trehan 2007-08-28 15:51:15 UTC

Created attachment 176941 [details]
Alt_sysrq for 5000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton

Comment 18 Vijay Trehan 2007-08-28 15:52:37 UTC

Created attachment 176961 [details]
vmstat for 4000 fat users on 16-core, 2.9 GHz, 32GB Tigerton

Comment 19 Vijay Trehan 2007-08-28 15:55:22 UTC

Created attachment 176981 [details]
oprofile output for 4000 fat users on 16-core, 2.9 GHz, 32GB Tigerton

Comment 20 Vijay Trehan 2007-08-28 15:59:33 UTC

Yefim Somin wrote on 08/24/2007:
>
> I agree with Sandeepâ€™s observations. I am attaching a vmstat from a run on the
16 core (tigerton, 32GB mem) system with 4000 users, where the total execution
time is also elevated to the knee of the curve (even longer than Sandeepâ€™s run)
while:
>
>  
>
> -          the CPU utilization is low
>
> -          sys CPU is a high proportion of it (higher than for Sandeepâ€™s 5000 run)
>
> -          context switches are even higher (180000/sec) even though the
amount of useful work is lower per unit of time than in Sandeepâ€™s 5000 run
>
>  
>
> I do not have sysrq outputs for it but that is what I would get when have a
chance to rerun. I would assume for now that the nature of the problem is
similar, however. So the high CS is a question in itself, but also whether and
how it would lead to extended elapsed time at the low overall CPU utilization.
>
>  
>
> ys
>
>  
>
> From: Gupta, Sandeep R [sandeep.r.gupta]
> Sent: Friday, August 24, 2007 3:24 PM
> To: Kaditz, Barry A; Richard Li
> Cc: vtrehan; Yefim Somin; Larry Woodman; shak; Gupta, Sandeep R
> Subject: RE: Fat Client Diagnostic Info (4000 and 5000 User Scenarios) and
Some observations
>
>  
>
> From: Gupta, Sandeep R
> Sent: Friday, August 24, 2007 12:22 PM
> To: vtrehan; Yefim Somin; Larry Woodman; shak
> Cc: Gupta, Sandeep R
> Subject: Fat Client Diagnostic Info (4000 and 5000 User Scenarios) and Some
observations
>
>  
>
> Attached are the vmstat and sysrq outputs for 4000 and 5000 user scenarios
with Fat Client and Larryâ€™s patch.  The response times are
>
>  
>
> 4000 Users - 199.128
>
> 5000 Users - 230.520
>
>  
>
> Few Observations
>
>         -   Standard Ontario benchmark on (RHEL5-XEON-CACHE2007) is showing
context switching rate of  45000 (4000 Users) and 55000 to 88000 (5000 Users).
>
>         -   Fat Client benchmark on (RHEL5-XEON-CACHE2007) is showing context
switching rate of 88000 (4000 Users) and 155000+ (5000 Users), that is the
elevated CPU usage symptom is more prominent in this case.
>
>         -   As a comparative data point (RHEL4 AS4-Itanium-CACHE2007) shows a
context switch rate of 16000 for 5000 Users.
>
>  
>
>         -   I think this excessive context switching as more concurrent load
is added (either by adding more users in standard benchmark or by reducing think
time in fat client benchmark) is causing significant clock cycles to be consumed
showing the elevated CPU usage.  
>
>         -   So the questions are
>
> - What is causing the excessive context switching?
>
> - Why the processes have to context switch and are not able to keep up?
>
>  
>
> Sandeep Gupta
>
> Intel Corporation
>
> Software Solutions Group
>
> Digital Health Enabling
>
> Email: Sandeep.r.Gupta
>
> Phone: 480 554 4003 (Work)
>
>

Comment 21 Larry Woodman 2007-08-28 16:44:14 UTC

The problem here is the sparse address space associated with the Cache address
space.  At the end of this run 1773288 out of 4110010 pages or 43% of RAM was
consumed in pagetables.  At that point the pagetables consumed more pages of
memory than the data memory that they mapped!!!  This is caused by the
application creating very large anD sparse virtual address spaces.  Over time
the system swaps out the application data pages and reuses those pages with
pagetable pages.  This is what is causing the substandard performance.

At the first AlltSysrq-M the total Active+Inactive was 3086149 pages.  At the
last AltSysrq-M the total Active+Inactive was 1685867 pages. The total managed
memory shrunk by 1400282 pages while the pagetable pages grew from 588045 to
1773288 pages, a total of 1185243 pagetable page increase.
 
 
(First AltSysrq-M)
Active:2181591 inactive:904558 dirty:62 writeback:0 unstable:0 free:21953
slab:189797 mapped-file:88115 mapped-anon:2150368 pagetables:588045

(Last AltSysrq-M)
Active:1659849 inactive:260158 dirty:21665 writeback:361 unstable:0 free:21111
slab:165598 mapped-file:241949 mapped-anon:1536562 pagetables:1773288

There is nothing the operating system can do about this problem, it is up to the
application to create and maintain smaller and less sparse virtual address
spaces in order to achieve higher performance.  As the system reclaims data
pages and reuses them for pagetables the system will swap more aggressively and
the application will perform more poorly until it finally completes.

Larry Woodman

Comment 22 Vijay Trehan 2007-09-28 17:18:52 UTC

Created attachment 210711 [details]
Some more oprofile data form Fima

Comment 23 Larry Woodman 2007-12-06 16:29:04 UTC

After more investigation here there are fundimental differences in the way this
test runs on AMD x64_64 system versus Intel x86_64 systems.  In both cases, the
system exhausts all of the memory in the pagecache and the 2GB system V shared
memory database shared region.  As the ~5000 individual processes run and touch
the virtual pages of that shared region they incur pagefaults and create the
page tables to map the shared region.  Over time this reclaims the memory from
the pagecache and swaps out the system V shared memory region and reuses that
memory for pagetables.  Since pagetables are wired/not reclaimable the system
ends up spending excessive time in the page reclaim code.  Simple arithmetic
illustrates that the 2GB(2^31) shared region contains 524288 4KB pages therefore
each process requires 524288 8-byte page tables entries(PTEs) or 4MB(2^22) of
private pagetables just to map that shared region.  ~5000 processes times 4MB
results in 20GB of wired pagetables on a system with 20GB of RAM.  This is a
know issue that can only be solved by 1.) Adding more RAM.  2.) Using hugepages
so that pagetables are not needed to map the system V shared region.  Any
changes that can be made to the page reclaim code to addres this will be of no
practical use compared to 1.) more RAM or 2.) using hugepages.

What is different between the AMD and Intel x86_64 systems is the size of the
caches and TLBs, AMD systems have 1024 TLB entries where Intel processors have
128 TLB entries.  When the system exhausts memory and ends up in
page_check_address() where it walks the pgd(s), pud(s), pmd(s) and pte(s) in
order to determine whether the pages are referenced or not.  Intel processors
incur many more TLB misses walking these pagetables than AMD systems simply
because the Intel TLB is 8-times smapper than the AMD TLB.

Comment 24 RHEL Program Management 2008-03-11 19:43:49 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 28 john jone 2013-03-04 14:39:49 UTC

Comment on attachment 176901 [details]
Alt_sysrq for 4000 fat users on 8-core, 2.66 GHz, 16 MB Cloverton

Yep work very gud

Comment 29 RHEL Program Management 2014-03-07 13:36:32 UTC

This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 30 RHEL Program Management 2014-06-02 13:04:54 UTC

Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).