Red Hat Bugzilla – Bug 155032
IMAP server performance issue (high iowait)
Last modified: 2007-11-30 17:07:07 EST
Description of problem:
The problem was first reported as a (qla2300) HBA (SAN setup on IBM DS4400)
performance issue on a RHEL 3.0 imap server running 2.4.21-27.0.1 kernel. The
733GB disk was formatted as ext2 on top of LVM and the server was (and still is)
in production serving around 29,700 inboxes. The load averages periodically shot
up into the 700+ ranges and eventually the imap processes would stop taking
requests and reboot was required.
-- Previous Tuning Attempt --
Tried relevant VM (pagecache), IO (elvtune), and qlogic specific tunings but
nothing helped. Subsequent debugging sessions concluded that the system
1. Had superblock lock contention issue (s_lock semaphore in lock_super()).
2. Ran out of io request descriptors.
For 1), RHEL 4 was suggested since it had the s_lock logic removed, along with
other visible ext2/3 performance improvements.
For 2), sent out a RHEL 3 test kernel with io request descriptor increased to 8K
(up from 1024 default) to match with AS2.1.
Also encouraged the customer to spread the workload into smaller filesystems.
-- Current Issue --
Customer came back stated that the RHEL 3 test kernel behaved similar to RHEL 4.
Though some improvements were observed, the imap threads still got locked up if
run on top of larger LUNs. They would like to stay on RHEL 3 because RHEL 4
didn't prove to help the issue and IBM hadn't certify RHEL 4 (with their HW)
yet. Also the problem only showed up in SAN environment. Test runs on aic7xxx
scsi on top of JBOD didn't exhibit this issue.
Version-Release number of selected component (if applicable):
A test system with smaller disk (136GB) has been configured in the customer's
site that can re-create the issue by simulating the production (mail) server
workload. A python script is used to pump the IO requests (open/read/write) into
the system. Will upload the test utility shortly.
1. The customer found a workaround for the s_lock issue (on RHEL 3) so this
sub-issue was not shown up in the simulation environment.
2. Via instrumented kernels, we found large amount of threads blocked in
__lock_page and it would took longer than 0.65 seconds to get out of the
io_schedule polling loop when this problem showed up.
Threads complete io requests within human tolerable latency (send/recv mails
without much delays).
Suspending incoming rquests and umounting the filesystem make the problem goes away.
Created attachment 113251 [details]
Created attachment 113252 [details]
Data set that goes with the python script
(bugzilla doesn't allow editing so add some additional info here)
1. To get out of iowait loop, 0.65 second is the *lower* bound - it takes *much*
longer than that to get out of the loop (need to re-fine my instrumented kernel
if this data is required).
2. I logged about 100 entries of the waited locked pages (in a wrap-around
buffer) - the threads are waiting for different pages. So it doesn't look like a
3. The iostat shows the io (into disk) still going on strong in the simulation
environment. But in the real production environment (where we can't do
instrumentations), the io seems to be slower (but we don't have a definite data
yet). I have placed the iostat output request (from real production server) but
havn't received it yet.
We also have an in-depth study about the io request descriptor problem that
seems to have thundering herd effect when free descriptors are available. Will
upload as an attachment to avoid making this bugzilla too long to read.
Examing the issue further (and roughly going thru 2.6 kernel code), decide to
describe the issue (of io request descriptor) here for visibility (since it
seems to be fixable). The customer, Chris Siebenmann, deserves kudos for most of
The experiment was done by an instrumented kernel using default 1k io request
descriptors. It recorded the number of times threads had waited in
_get_request_wait and the maximum wait time interval. A multi-threaded program
created a constant load of N simultaneous read IOs on the disk. Each thread
issued a read() and as soon as it completed, another read() was immediately
dispatched. At an N larger than 512 plus the qla queue length, it was observed
that some unlucky threads were starving almost completely in __get_request_wait.
They could only get out of loop (for the first time) when the whole set of
threads from the test program exited.
The intuitive conclusion was that the threads were sleeping in
__get_request_wait and only waked up when there were at least 32 free request
descriptors (RHEL 3 default batch size). On the other hand, an already-running
thread could allocate new requests when there were less than 32 free without
delay. This created a potential for starvation of the sleeping threads. The
customer explained, though a worst-case scenario, this actually was some
POP/IMAP daemons' behavior where fast-running processes had their IO completed,
waked up immediately, did their minimal processing, and submitted their next IO
that used up the requests that they just freed up. The free request pool never
rose to 32 entries or higher, and the only chance that a waiting thread had was
that it waked up every second regardless and so might be lucky enough to be
scheduled while there were some io request descriptors free.
1. Setting io request batch size to 1 (from 32) does *not* completely solve the
2. A "umount" would make the problem go away.
3. a sync commnad normally helps (make problem go away) too.
Modify 2) above: "quiece the system (disable user login) + umount" would make
problem go away.
3) The effectiveness of "sync" is not well tested out yet.
hello, did this go anywhere? We're seeing behavior that could be similar on a
RHEL3 u4 (27.0.1) kernel with QLA2300 drivers version 7.0.3.
We are seeing a similar issue using AMD 64-bit, RHEA 3.0 connected to a
Clariion frame. High IO wait doing files copies, decompression, etc. across
to the SAN. Eventually the system degrades to the point of hanging.
For comment #14, when the system appears hanging, could you umount the
partition, and re-mount it to see whether the problem goes away ?
Are there any news?
We have a similar issue using IBM blade system with FASt600 connected via
Since IT is closed, I'll close this bugzilla as well.