From Bugzilla Helper: User-Agent: Mozilla/5.0 Galeon/1.2.11 (X11; Linux i686; U;) Gecko/20030314 Description of problem: A simple MTA benchmark for postfix exhibits ~60% performance drop when run on a system with high memory (total RAM >896MB), compared to a low memory system. We have traced this down to two reasons: a) the fact that most drivers for hardware RAID don't support high memory IO properly; b) the "2.4.7-highmemspeed" patch included in RedHat's 2.4.9-e.xy and 2.4.18-[78] kernel series slows down bounce buffer allocation. Version-Release number of selected component (if applicable): kernel <= 2.4.9-e.25, 2.4.18-27.7.x How reproducible: Always Steps to Reproduce: 1. Setup network (ethernet 100MBit/s) with 3 hosts ("source", "mta", "sink"). 2. On "sink", run the postfix "smtp-sink" program. 3. On "mta", setup the postfix MTA. Specify "sink" as the relay host. Put the postfix queue directory on an ext3 partition mounted with "data=journal,noatime". This partition must be on a block device with a driver without high-memory IO support such as DAC960, aacraid, or dpt_i2o (those are the ones we tested). System must have >896MB memory installed. To improve performance, setup asynchronous mail logging in /etc/syslog.conf: mail.* -/var/log/maillog 3. On "source", run the postal benchmark (http://www.coker.com.au/postal/postal-0.62.tgz) with options "-m 150 -p 5 -c 30 -r 6000" (i.e. max message sie 150kB, 5 parallel processes, 30 mesgs/connection, max. 6000 msg/min). (I am only specifying the options exactly here to make sure numbers are comparable, other options will show the same performance-degrading effect). Actual Results: The postal benchmark will show ~1500 mails/min, as opposed to ~3500 if less than 896MB of memory is installed (or if memory is limited with the mem=XXXM option). Expected Results: ~3500 mails/min, as in the low-memory case. RedHat's 2.4.20-xy.7 series, as well as vanilla kernels, also do not show this performance drop. Additional info: The 'bottleneck' in the simple MTA benchmark described is synchronous disk IO, mostly in the journal of the queue partition. If the driver doesn't support high memory IO (as currently very few hardware RAID controller drivers appear to do), bounce buffers must be allocated for most IO operations. Normally the alloc_bounce_page() and alloc_bounce-bh() functions work as follows (pseudocode): page = get_lowmem_page (); if (page) return page; wakeup_bdflush (); for (;;) { page = get_emergency_page (); if (page) return page; run_task_queue (tq_disk); yield (); } With the "highmemspeed" patch, the code tries to allocate from the "emergency pool" first unless it is more then 1/2 empty: for (n = 0; ; n++) { if ((n == 0) && (free_emergency_pages < emergency_pool_size/2)) page = get_lowmem_page (); if (page) return page; wakeup_bdflush (); page = get_emergency_page (); if (page) return page; run_task_queue (tq_disk); yield (); first; } The problem with this approach is that unless the emergency pool is more then 1/2 empty, wakeup_bdflush() will be called every time alloc_bounce_page() or alloc_bounce_bh() is called. This is pointless because calling wakeup_bdflush() only makes sense if the allocation from low memory wasn't successful. I have collected some statistics and found that in our case this happens ab Simply changing the above code to [...] if ((n == 0) && (free_emergency_pages < emergency_pool_size/2)) | page = get_lowmem_page (); if (page) return page; wakeup_bdflush (); } [...] (i.e. calling wakeup_bdflush() only if the low memory allocation was tried and wasn't successful) the solves the problem. I will soon attach a patch fixing the problem. [ I have simplified stuff a little bit here. Actually 2.4.9-e.25 contains several more patches (in particular "vmfixes9") which change these allocation routines again. The basic argument - wakeup_bdflush() called for no reason - remains valid, though. ] To illustrate my point, hre are the postal results with 2.4.9-e.25 (1GB memory, queue partition on a DAC960 partition): time,messages,data(K),errors,connections 15:12,1428,110108,0,89,0 15:13,1471,108634,0,84,0 And here the same with my patch applied: time,messages,data(K),errors,connections 15:38,3858,286641,0,252,0 15:39,3607,275018,0,242,0 and here withput my patch, but with only 512MB memory: time,messages,data(K),errors,connections 17:58,3644,268730,0,242,0 17:59,3348,250146,0,222,0 The most relevant figure ist the first one after the time of day ("messages", i.e. delivered messages per minute). It can be seen to be a factor of ~2.5 low with the 2.4.9-e.25 kernel and 1GB memory.
thank you for this excellent bugreport; I wish more bugreports were like this. Looking at the recent code it has a if (!iteration) wakeup_bdflush(0); where iteration starts at 0 and never gets 1 for the "first half" of the pool; this makes me wonder if this problem is still there for the recent kernels...
Created attachment 92936 [details] Patch that soves this performance issue I made this patch as small as possible; in principle it simply adds a pair of braces in both functions (and some whitespace). I am not saying that I'd propose this patch as the final solution. IMO alloc_bounce_page() and alloc_bounce_bh() need some general cleanup.
>this makes me wonder if this problem is still there for the recent kernels... Yes - because (!iteration) is true in the first iteration. In our case there is no real memory pressure, thus the pages are always allocated in the first iteration. Thus, every time alloc_bounce_...() is called and the emergency pool has more than half of the entries free, wakeup_bdflush() is called once.
never mind my comment... the minimal fix should be if (iteration) wake_up(...) instead of !iteration
I still like the braces better ... but that's up to you :-) Thank you for the quick response.
Ooops - just found a typo in my original report. >I have collected some statistics and found that in our case this happens ab should read "... this happens about 40% of the time alloc_bounce_...() is called".
Any reason why this pretty obvious & easy-to-fix issue isn't fixed in 2.4.9-e.27?
Just top follow up, the U3 kernel is due out shortly, if you want to grab a preview w/this fix, grab it from: http://people.redhat.com/~jbaron/.private/testing/2.4.9-e.27.28.test/