Description of problem: The problem was first reported as a (qla2300) HBA (SAN setup on IBM DS4400) performance issue on a RHEL 3.0 imap server running 2.4.21-27.0.1 kernel. The 733GB disk was formatted as ext2 on top of LVM and the server was (and still is) in production serving around 29,700 inboxes. The load averages periodically shot up into the 700+ ranges and eventually the imap processes would stop taking requests and reboot was required. -- Previous Tuning Attempt -- Tried relevant VM (pagecache), IO (elvtune), and qlogic specific tunings but nothing helped. Subsequent debugging sessions concluded that the system 1. Had superblock lock contention issue (s_lock semaphore in lock_super()). 2. Ran out of io request descriptors. For 1), RHEL 4 was suggested since it had the s_lock logic removed, along with other visible ext2/3 performance improvements. For 2), sent out a RHEL 3 test kernel with io request descriptor increased to 8K (up from 1024 default) to match with AS2.1. Also encouraged the customer to spread the workload into smaller filesystems. -- Current Issue -- Customer came back stated that the RHEL 3 test kernel behaved similar to RHEL 4. Though some improvements were observed, the imap threads still got locked up if run on top of larger LUNs. They would like to stay on RHEL 3 because RHEL 4 didn't prove to help the issue and IBM hadn't certify RHEL 4 (with their HW) yet. Also the problem only showed up in SAN environment. Test runs on aic7xxx scsi on top of JBOD didn't exhibit this issue. Version-Release number of selected component (if applicable): How reproducible: A test system with smaller disk (136GB) has been configured in the customer's site that can re-create the issue by simulating the production (mail) server workload. A python script is used to pump the IO requests (open/read/write) into the system. Will upload the test utility shortly. Actual results: 1. The customer found a workaround for the s_lock issue (on RHEL 3) so this sub-issue was not shown up in the simulation environment. 2. Via instrumented kernels, we found large amount of threads blocked in __lock_page and it would took longer than 0.65 seconds to get out of the io_schedule polling loop when this problem showed up. Expected results: Threads complete io requests within human tolerable latency (send/recv mails without much delays). Additional info: Suspending incoming rquests and umounting the filesystem make the problem goes away.
Created attachment 113251 [details] python script
Created attachment 113252 [details] Data set that goes with the python script
(bugzilla doesn't allow editing so add some additional info here) 1. To get out of iowait loop, 0.65 second is the *lower* bound - it takes *much* longer than that to get out of the loop (need to re-fine my instrumented kernel if this data is required). 2. I logged about 100 entries of the waited locked pages (in a wrap-around buffer) - the threads are waiting for different pages. So it doesn't look like a deadlock. 3. The iostat shows the io (into disk) still going on strong in the simulation environment. But in the real production environment (where we can't do instrumentations), the io seems to be slower (but we don't have a definite data yet). I have placed the iostat output request (from real production server) but havn't received it yet.
We also have an in-depth study about the io request descriptor problem that seems to have thundering herd effect when free descriptors are available. Will upload as an attachment to avoid making this bugzilla too long to read.
Examing the issue further (and roughly going thru 2.6 kernel code), decide to describe the issue (of io request descriptor) here for visibility (since it seems to be fixable). The customer, Chris Siebenmann, deserves kudos for most of the works. The experiment was done by an instrumented kernel using default 1k io request descriptors. It recorded the number of times threads had waited in _get_request_wait and the maximum wait time interval. A multi-threaded program created a constant load of N simultaneous read IOs on the disk. Each thread issued a read() and as soon as it completed, another read() was immediately dispatched. At an N larger than 512 plus the qla queue length, it was observed that some unlucky threads were starving almost completely in __get_request_wait. They could only get out of loop (for the first time) when the whole set of threads from the test program exited. The intuitive conclusion was that the threads were sleeping in __get_request_wait and only waked up when there were at least 32 free request descriptors (RHEL 3 default batch size). On the other hand, an already-running thread could allocate new requests when there were less than 32 free without delay. This created a potential for starvation of the sleeping threads. The customer explained, though a worst-case scenario, this actually was some POP/IMAP daemons' behavior where fast-running processes had their IO completed, waked up immediately, did their minimal processing, and submitted their next IO that used up the requests that they just freed up. The free request pool never rose to 32 entries or higher, and the only chance that a waiting thread had was that it waked up every second regardless and so might be lucky enough to be scheduled while there were some io request descriptors free.
Critical info: 1. Setting io request batch size to 1 (from 32) does *not* completely solve the issue. 2. A "umount" would make the problem go away. 3. a sync commnad normally helps (make problem go away) too.
Modify 2) above: "quiece the system (disable user login) + umount" would make problem go away. 3) The effectiveness of "sync" is not well tested out yet.
hello, did this go anywhere? We're seeing behavior that could be similar on a RHEL3 u4 (27.0.1) kernel with QLA2300 drivers version 7.0.3.
We are seeing a similar issue using AMD 64-bit, RHEA 3.0 connected to a Clariion frame. High IO wait doing files copies, decompression, etc. across to the SAN. Eventually the system degrades to the point of hanging.
For comment #14, when the system appears hanging, could you umount the partition, and re-mount it to see whether the problem goes away ?
Are there any news? We have a similar issue using IBM blade system with FASt600 connected via qla2300 HBA.
Since IT is closed, I'll close this bugzilla as well.