99168 – "highmemspeed" patch causes 60% slowdown for bounce buffer allocation

Bug 99168 - "highmemspeed" patch causes 60% slowdown for bounce buffer allocation

Summary: "highmemspeed" patch causes 60% slowdown for bounce buffer allocation

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-07-15 13:46 UTC by Martin Wilck
Modified:	2007-11-30 22:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:	AS2.1-U7 kernel
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-09-30 19:00:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch that soves this performance issue (1.22 KB, patch) 2003-07-15 14:03 UTC, Martin Wilck	no flags	Details \| Diff
View All

Description Martin Wilck 2003-07-15 13:46:23 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.11 (X11; Linux i686; U;) Gecko/20030314

Description of problem:
A simple MTA benchmark for postfix exhibits ~60% performance drop when run on a
system with high memory (total RAM >896MB), compared to a low memory system. 

We have traced this down to two reasons:

a) the fact that most drivers for hardware RAID don't support high memory IO
properly;

b) the "2.4.7-highmemspeed" patch included in RedHat's 2.4.9-e.xy and
2.4.18-[78] kernel series slows down bounce buffer allocation.


Version-Release number of selected component (if applicable):
kernel <= 2.4.9-e.25, 2.4.18-27.7.x 

How reproducible:
Always

Steps to Reproduce:

1. Setup network (ethernet 100MBit/s) with 3 hosts ("source", "mta", "sink").

2. On "sink", run the postfix "smtp-sink" program.

3. On "mta", setup the postfix MTA. Specify "sink" as the relay host. Put the
postfix queue directory on an ext3 partition mounted with
"data=journal,noatime". This partition must be on a block device with a driver
without high-memory IO support such as DAC960, aacraid, or dpt_i2o (those are
the ones we tested). System must have >896MB memory installed.

To improve performance, setup asynchronous mail logging in /etc/syslog.conf:
mail.*	-/var/log/maillog

3. On "source", run the postal benchmark
(http://www.coker.com.au/postal/postal-0.62.tgz) 
with options "-m 150 -p 5 -c 30 -r 6000" (i.e. max message sie 150kB, 
5 parallel processes, 30 mesgs/connection, max. 6000 msg/min). 

(I am only specifying the options exactly here to make sure numbers are
comparable, other options will show the same performance-degrading effect).


Actual Results:  The postal benchmark will show ~1500 mails/min, as opposed to
~3500 if less than 896MB of memory is installed (or if memory is limited with
the mem=XXXM  option).


Expected Results:  ~3500 mails/min, as in the low-memory case.
RedHat's 2.4.20-xy.7  series, as well as vanilla kernels, also do not show this
performance drop.



Additional info:

The 'bottleneck' in the simple MTA benchmark described is synchronous disk IO,
mostly in the journal of the queue partition. 

If the driver doesn't support high memory IO (as currently very few hardware
RAID controller drivers appear to do), bounce buffers must be allocated for most
IO operations.

Normally the alloc_bounce_page() and alloc_bounce-bh() functions work 
as follows (pseudocode):

    page = get_lowmem_page ();
    if (page) return page;
    wakeup_bdflush (); 

    for (;;) {
        page = get_emergency_page ();
        if (page) return page;
        run_task_queue (tq_disk);
        yield (); 
    }

With the "highmemspeed" patch, the code tries to allocate from the "emergency
pool" first unless it is more then 1/2 empty:


    for (n = 0; ; n++) {
        if ((n == 0) && 
            (free_emergency_pages < emergency_pool_size/2))
                page = get_lowmem_page ();
        if (page) return page;
        wakeup_bdflush (); 

        page = get_emergency_page ();
        if (page) return page;
        run_task_queue (tq_disk);
        yield ();
        first;
    }

The problem with this approach is that unless the emergency pool is more then
1/2 empty, wakeup_bdflush() will be called every time alloc_bounce_page() or
alloc_bounce_bh() is called. This is pointless
because calling wakeup_bdflush() only makes sense if the allocation from low
memory wasn't successful.

I have collected some statistics and found that in our case this happens ab

Simply changing the above code to
[...]
        if ((n == 0) && 
            (free_emergency_pages < emergency_pool_size/2)) |
                page = get_lowmem_page ();
                if (page) return page;
                wakeup_bdflush (); 
        }
[...]
(i.e. calling wakeup_bdflush() only if the low memory allocation was tried and
wasn't successful) the solves the problem. 

I will soon attach a patch fixing the problem.

[ I have simplified stuff a little bit here. Actually 2.4.9-e.25 contains
several more patches (in particular "vmfixes9") which change these allocation
routines again. The basic argument - wakeup_bdflush() called for no reason -
remains valid, though. ]

To illustrate my point, hre are the postal results with 2.4.9-e.25
(1GB memory, queue partition on a DAC960 partition):

time,messages,data(K),errors,connections
15:12,1428,110108,0,89,0
15:13,1471,108634,0,84,0

And here the same with my patch applied:

time,messages,data(K),errors,connections
15:38,3858,286641,0,252,0
15:39,3607,275018,0,242,0

and here withput my patch, but with only 512MB memory:

time,messages,data(K),errors,connections
17:58,3644,268730,0,242,0
17:59,3348,250146,0,222,0

The most relevant figure ist the first one after the time of day
("messages", i.e. delivered messages per minute).
It can be seen to be a factor of ~2.5 low with the 2.4.9-e.25 kernel
and 1GB memory.

Comment 1 Arjan van de Ven 2003-07-15 14:00:34 UTC

thank you for this excellent bugreport; I wish more bugreports were like this.
Looking at the recent code it has a
         
        if (!iteration)
                wakeup_bdflush(0);

where iteration starts at 0 and never gets 1 for the "first half" of the pool;
this makes me wonder if this problem is still there for the recent kernels...

Comment 2 Martin Wilck 2003-07-15 14:03:21 UTC

Created attachment 92936 [details]
Patch that soves this performance issue

I made this patch as small as possible; in principle it simply adds a pair of
braces in both functions (and some whitespace).

I am not saying that I'd propose this patch as the final solution.
IMO alloc_bounce_page() and alloc_bounce_bh() need some general cleanup.

Comment 3 Martin Wilck 2003-07-15 14:07:36 UTC

>this makes me wonder if this problem is still there for the recent kernels...

Yes - because (!iteration) is true in the first iteration. In our case there is
no real memory pressure, thus the pages are always allocated in the first
iteration. 
Thus, every time alloc_bounce_...() is called and the emergency pool has more
than half of the entries free, wakeup_bdflush() is called once.

Comment 4 Arjan van de Ven 2003-07-15 14:11:00 UTC

never mind my comment...

the minimal fix should be

if (iteration)
   wake_up(...)

instead of !iteration

Comment 5 Martin Wilck 2003-07-15 14:13:22 UTC

I still like the braces better ... but that's up to you :-)

Thank you for the quick response.

Comment 6 Martin Wilck 2003-07-15 14:16:22 UTC

Ooops - just found a typo in my original report.

>I have collected some statistics and found that in our case this happens ab

should read "... this happens about 40% of the time alloc_bounce_...() is called".

Comment 7 Martin Wilck 2003-09-11 15:39:37 UTC

Any reason why this pretty obvious & easy-to-fix issue isn't fixed in 2.4.9-e.27?

Comment 8 Jason Baron 2003-11-13 04:01:36 UTC

Just top follow up, the U3 kernel is due out shortly, if you want to
grab a preview w/this fix, grab it from:

http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.27.28.test/

Note You need to log in before you can comment on or make changes to this bug.