Bug 241540
Summary: | NBD module in RHEL5 deadlocks (regression) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Iain Wade <iwade> | ||||||
Component: | kernel | Assignee: | Jarod Wilson <jarod> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.0 | CC: | davem, dzickus, esandeen, jmoyer, nhorman, snitzer | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2008-0314 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-05-21 14:43:21 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Iain Wade
2007-05-27 18:38:23 UTC
How is it that this nbd issue is only percieved as a "medium" severity and priority? Any application that relies on the kernel's nbd support is dead in the water on RHEL5. This is a show stopper for me. Please strongly consider raising priority/severity. Can either of you also reproduce the issue with the latest upstream kernel? I see a few possibly relevant patches upstream, and I could throw them atop a RHEL5-based kernel build, but it should be possible to check functionality by simply installing the latest fedora rawhide kernel on a RHEL5 box. Some interesting reference material: http://lwn.net/Articles/194569/ The comment#3 reference to the "Network receive deadlock prevention for NBD" has absolutely nothing to do with NBD working on a system that is not even remotely loaded. RHEL5's nbd can't even negociate with the nbd-server. It locks up. This has nothing to do with low memory conditions. The RHEL5 nbd problem is _much_ more basic. But don't let me dissuade anyone at redhat from embracing PeterZ's more comprehensive network deadlock avoidance patchset (it is a very real VM problem with Linux when writeout depends on a networked resource, not just nbd). That said, NBD also has a serious issue in the nbd-server under low memory conditions. I'm currently researching/testing setting PF_MEMALLOC from within the userspace nbd-server to address issues associated with the following scenario: http://marc.info/?l=linux-mm&m=118981112030719&w=2 But even this has no bearing on _basic_ NBD functionality. I agree with Mike's comments that although the low-memory deadlock situation needs fixing at some point, it is not relevent in this case. Ubuntu 7.04's 2.6.20 kernel works. mke2fs, mount, read/write, sync works on RHEL5 w/ rawhide's 2.6.23-0.181.rc6.git4. Another problem crops up for me on 2.6.23, which is that during a software raid build on top of NBD, NBD issues block requests which are not multiples of my requested NBD_SET_BLKSIZE parameter (I set block size of 4k, but get request sizes which are multiple of 512b, like: 3072b, 7168, etc.) which causes problems for my custom NBD server for various reasons. It can be worked around though. Sorry guys, didn't mean to suggest PeterZ's stuff was the fix to this particular bug, just wanted to list it for future reference. I was actually hoping that the locking changes already in the upstream kernel would remedy the basic problem, which it sounds like they may well do, based on Iain's testing. I'll try to get a RHEL5 test kernel with backported upstream changes posted on people.redhat.com by the end of the day. I wasn't actively looking, but I don't recall seeing anything in the nbd code between 2.6.20 and 2.6.23 that appeared related to the block size issue, so perhaps that's another problem from the software raid code that we'll not inherit. Created attachment 197531 [details] nbd updates backported from upstream kernel The attached patch backports the following upstream changes to the RHEL5 kernel: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6b39bb6548d60b9a18826134b5ccd5c3cef85fe2 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=84963048ca8093e0aa71ac90c2a5fe7af5f617c3 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=be0ef957c9eed4ebae873ee3fbcfb9dfde486dec Test kernels with the above set of patches included can be found here: http://people.redhat.com/jwilson/test_kernels/2.6.18-47.el5.xi10/ Please give them a go and report back with results! still broken here: sync blocks. Okay, in that case, I'm thinking the bug may lay in the software raid layer. I've already spotted a few possibly relevant changes in the software raid code, will get another test kernel together soonish. I'm starting to think perhaps I should also try to reproduce the problem with some systems in my office... Updated test kernel with some additional software raid deadlock prevention patches included slowly making its way out to people.redhat.com: http://people.redhat.com/jwilson/test_kernels/2.6.18-48.el5.xi11/ However, in re-reading the problem description, it doesn't actually mention that any software raid is involved, so perhaps looking at the software raid code was an exercise in futility... I'll have a test setup of my own together shortly. Thanks for your continued efforts on this NBD issue! I now have 3 RHEL5 systems to test with and will be able to help give your kernels a spin as they are released. BTW, I'd recommend using nbd-2.8.8 rather than nbd-2.9.x. But yes, the original issue is just the fact that NBD isn't happy. At this point I can't recall why/how that was the case; I'll revisit this shortly. I've been fighting with MD+NBD so my comment#4 that said as much obviously clouded the issue. Heh, my prior knowledge of the ways in which Linux Networx uses nbd clouded the issue as well... ;) I've now got a rawhide box serving up a file-backed nbd, more or less like so: # dd if=/dev/zero of=/root/nbd-file bs=1k count=1M # losetup -f /root/nbd-file # nbd-server 12345 /dev/loop0 Then I've got a rhel5 box running the client. Thus far, its been able to connect to the server, format /dev/nb0 and sync, all without a problem. This was all done with nbd 2.9.7, built with debugging enabled (and all the debug output looks sane) and the previously mentioned ~1GB file-backed storage. I'll give it a go with 2.8.8 and see what transpires, but I'd like to hear more details on the specific setups that are having problems, such as: 1) nbd server version 2) nbd client version 3) backing store type and size on server Okay, been poking more... For some reason, nbd 2.8.8 doesn't build on the rawhide box (complains about nbd.h, despite it being there and working with 2.9.7), and mixing 2.8.8 on the rhel5 box and 2.9.7 on the rawhide box leads to a protocol error. However, using 2.9.7 on both sides, I've now got things locking up. Just needed to start pushing more actual data across. I'll keep poking at this particular setup and see what I can figure out... Just for fun, can you try reverting: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ea817398e68dfa25612229fda7fc74580cf915fb and see if the problem persists? Thanks, -Eric For the benefit of those playing along at home, I pinged Eric after taking a look at alt-sysrq-t output after inducing a lock-up, and seeing that nbd-client was actually *not* deadlocked, its kjournald where the actual deadlock appears to be occurring. I'll revert the above changeset in the morning and see what I can see then, as well as doing some testing with ext2 and possibly xfs to help verify the source of the issue. One more possibly relevant reference Eric dropped on me: http://thread.gmane.org/gmane.linux.file-systems/7374 Actually, the patch I referred to may not do it, but the patch in the previous comment is probably more applicable... See also http://thread.gmane.org/gmane.linux.file-systems/7374 "The problem we're trying to solve here is how do implement network block devices (nbd, iscsi) efficiently. The zero copy codepath in the networking layer does need to grab additional references to pages. So to use sendpage we need a refcountable page. pages used by the slab allocator are not normally refcounted so try to do get_page/put_page on them will break." I'd still try reverting the git commit above first, though, and then I'll study up and make sure I know what's going on here. :) In response to comment#13 I'm pretty sure Linux Networx used SteelEye's MD+NBD based product. I worked for LNXI but heard from an ex-co-worker that they embraced the SteelEye stuff just after I left. Also, using small loop devices isn't the best test of NBD; larger block devices causes mke2fs to allocate much more dirty pages and increase memory pressure considerably. So mke2fs on something like a 500GB or 750GB disk really is an entirely different experience. Not sure if you have multiple disks like that kicking around. For those inclined to really ask for deadlocks try setting up the MD+NBD config from the linux-mm post referenced in comment#3... I know; I know... we need RHEL5's nbd to crawl before it can walk and/or run! But the same deadlock may be easily reproducible without MD in the picture. Import largediskA on serverB and largediskB and serverA; simultaneously run mke2fs on serverA and serverB's respective /dev/nbdX and see what happens. I'm trying hacks such as exposing PF_LESS_THROTTLE and PF_MEMALLOC to userspace (via prctl) and having the nbd-server set those flags in the child nbd-server process. I should have results on if that'll make a difference within the next 2 days. Created attachment 200431 [details]
patch to use page allocations for anything sent to block layer
Jarod, forget about reverting that patch I pointed to, and give this one a
shot, please. This matches what's currently pending for upstream.
Thanks,
-Eric
Hm, though, Jarod... ok, maybe I'd better talk to you offline to get up to speed. :) I heard "kjournald deadlock nbd" and made some assumptions, but now looking at the first comment, if just mkfs to the block device causes the lockup, how does kjournald come into it.... sorry if I jumped the gun. Let's talk tomorrow am. (In reply to comment #18) > I'm pretty sure Linux Networx used SteelEye's MD+NBD based product. I worked > for LNXI but heard from an ex-co-worker that they embraced the SteelEye stuff > just after I left. I left LNXI before you did. :) At the time, Nate's hacky stuff for failover between cluster head nodes was still more prominent than SteelEye's stuff, at least where I was -- onsite at Boeing in Bellevue, WA. I think the last cluster I helped bring up there just before I left was one of the earliest SteelEye deployments. (In reply to comment #20) > Hm, though, Jarod... ok, maybe I'd better talk to you offline to get up to > speed. :) I heard "kjournald deadlock nbd" and made some assumptions, but now > looking at the first comment, if just mkfs to the block device causes the > lockup, how does kjournald come into it.... In my test setup, mkfs didn't deadlock, it wasn't until I started firing decent chunks of data at nbd that it locked, but then it was only a 1GB test partition, maybe mkfs would lock on a larger partition... In the OP's case, it sounded like mkfs was successful, but a subsequent sync call failed. At least, that was what I was thinking... Iain, can you confirm exactly where things lock up in your case? > sorry if I jumped the gun. Let's talk tomorrow am. No worries, I shouldn't have prodded ya right at the end of the day. ;) Tomorrow morning it is. I've had both situations; mke2fs hanging near the end (last output line is Writing inode tables: done) as well as everything seeming ok, doing a small amount of IO, typing sync to flush and then sync blocking. The device is 400GB in size. In either case it's very reproducable and quick to happen. I have to admit my environment is not like the one you are, or would be, testing. I wrote a custom NBD server which translates NBD block IO requests from received over a unix domain socket into network packets against a Netgear SC101 device. I had to reverse engineer the packet format from tcpdump captures. http://code.google.com/p/sc101-nbd/ I have personally run my code successfully (creating, initialising and using a 4x400GB software RAID5) on RHEL4 (2.6.9xxx) and Ubuntu 7.04 (2.6.20xxx). Users have had success on a number of other distros/kernels. If you can recreate the hang in a vanilla environment, that's even better. If you'd like access to the machine that's no problem. If there is anything you'd like collected I would be happy to do so. Here is my alt+sysrq+t output: http://members.optusnet.com.au/iwade/alt-sysrq-t.txt Somehow, I missed the deadlocked pdflush processes in my sysrq-t output, but *that* is actually what seems to be the common thread between Iain's output, mine and also what Eric is seeing doing nothing more than dd'ing /dev/zero to an nbd. I'm going to collect a few more data points with assorted kernel versions, then I think its time to start in on a git bisection. Iain, btw, very neat stuff there w/the Netgear! Also seeing blk_congestion_wait popping up in the trace, which is what Daniel Phillips says is biting Mike... http://lkml.org/lkml/2007/9/18/12 Dunno yet whether or not this patch is potentially relevant (or if its already in our 2.6.18), but don't want to lose track of it: http://www.linux-nfs.org/Linux-2.6.x/2.6.18-rc4.2/linux-2.6.18-064- add_fixes_for_the_congestion_wait_crap.dif (gotta go work on a few other things for a spell). The patch from Trond does indeed look to be highly relevant, and is not in our 2.6.18 tree. http://git.kernel.org/?p=linux/kernel/git/torvalds/ linux-2.6.git;a=commitdiff;h=275a082fe9308e710324e26ccb5363c53d8fd45f Will tackle that after lunch... It doesn't appear as though _any_ released kernel has that patch from Trond. Andrew Morton later obsoleted Trond's changes: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3fcfab16c5b86eaa3db3a9a31adba550c5b67141 To be clear, I still get the blk_congestion_wait() hang with 2.6.22.6 Yeah, noticed it was later obsoleted. Looked so promising at first. :) I've done a bit of further poking, using a simple "dd if=/dev/zero of=/dev/nbd0" test: - All 2.6.18 renditions thus far have failed, deadlocking almost immediately when data starts being written - Ditto for a 2.6.20 kernel build (2.6.20-1.2962.fc6) - A 2.6.22.6 kernel (2.6.22.6-55.fc6) made it through the entire dd (1GB worth of data), subsequent sync call, mke2fs -j, mounting the parition and copying a kernel git tree onto it. Doing a kernel build in that git tree on the nbd now. So far, so good... Oops, ran out of space on the device, time to create a bigger nbd backing store... I'm thinking a git-bisect between 2.6.22 and 2.6.20 is probably in order. Nb: still working on the git-bisect, about 7 bisections and test cycles in. Likely to be Monday before isolating the offending changeset and attempting back-port to 2.6.18. Then once we get the basic functionality actually working, I'd suggest a new bug to track the additional deadlock Mike is seeing under higher traffic loads. responding to comment#30: sounds like a plan btw, I'm also seeing some extremely disturbing behavior where NBD appears to be buffering writes (that should be synchronous via MD raid1), PeterZ addressed this in a way that Jens Axboe didn't approve. But a "correct" fix hasn't ever been developed AFAIK, anyway here are some useful links: http://lkml.org/lkml/2006/9/12/204 http://lkml.org/lkml/2007/4/29/283 I even tried to engage peterz and paul clements (nbd maintainer) on this issue but never got a response: http://lkml.org/lkml/2007/6/26/212 Just about done with a second bisection run, I think I screwed up somewhere along the line the first one[*]. Should have something definitive by lunch tomorrow... [*]: git-bisect doesn't like it when HEAD is good, and an earlier version is bad, so you have to reverse the meaning of good and bad to make it actually do anything, and I think I probably accidentally didn't reverse at one of the bisection points the first time through. Ugh. Okay, git-bisect claims its this changeset: -- commit 498d3aa2b4f791059acd8c942ee8fa15c2ce36c2 Author: Jens Axboe <jens.axboe> Date: Thu Apr 26 12:54:48 2007 +0200 [PATCH] cfq-iosched: style cleanups and comments Signed-off-by: Jens Axboe <jens.axboe> :040000 040000 098f17624a0007b506367b4df9fc42580dcefe8a 8bf593a2d6aab9f0f8d1603e9d29656f56b1fd0c M block -- However, it may actually be the commit just after that, not entirely sure yet, because of the reversed logic and it getting late... For reference, the next commit: cfq-iosched: slice offset should take ioprio into account Jens Axboe [Fri, 20 Apr 2007 12:18:00 +0000 (14:18 +0200)] Use the max_slice-cur_slice as the multipler for the insertion offset. http://git.kernel.org/?p=linux/kernel/git/torvalds/ linux-2.6.git;a=commitdiff;h=67e6b49e39e9b9bf5ce1351ef21dad391856183f More fun come morning... I musta screwed up somewhere in the bisection again, manually reverting to the version just before the commit git-bisect identified still gives me a working nbd, which makes sense after looking at that changeset, which contains zero functional changes. More after lunch... Okay, figured out what went wrong in the last bisection, and manually stepped through a few things to double-check. The commit that finally fixed things is this one: -- cfq-iosched: development update Jens Axboe [Wed, 25 Apr 2007 10:44:27 +0000 (12:44 +0200)] - Implement logic for detecting cooperating processes, so we choose the best available queue whenever possible. - Improve residual slice time accounting. - Remove dead code: we no longer see async requests coming in on sync queues. That part was removed a long time ago. That means that we can also remove the difference between cfq_cfqq_sync() and cfq_cfqq_class_sync(), they are now indentical. And we can kill the on_dispatch array, just make it a counter. - Allow a process to go into the current list, if it hasn't been serviced in this scheduler tick yet. Possible future improvements including caching the cfqq lookup in cfq_close_cooperator(), so we don't have to look it up twice. cfq_get_best_queue() should just use that last decision instead of doing it again. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6d048f5310aa2dda2b5acd947eab3598c25e269f -- However, the very next commit cleans up a few things in this commit, and the two after that also look worth taking if we're going to take this much. Not to mention I've not yet determined how difficult this will be to wedge into the rhel5 kernel. Meanwhile, in my own testing, PeterZ's patch here... http://lkml.org/lkml/2006/7/7/164 ...applies quite easily to the rhel5 kernel and also appears to remedy the problem. Also, given that what PeterZ's patch largely does is simply force nbd to use the noop scheduler, it should be possible to simply: echo noop > /sys/block/nbd0/queue/scheduler and/or boot with elevator=noop, and have working nbd with the current kernel. Will try that myself shortly. Other schedules may work as well... So this issue is two-fold: 1) RedHat's default IO scheduler is cfq; hence the fact that reverting/fixing that area affected by Jens commit eliminates the problem. 2) At the heart of it; nbd devices shouldn't have a scheduler at all. I've been using the very peterz patch you referenced. It does two things: 1) pins nbd devices to the noop scheduler 2) reduces the nbd devices' associated queue to only have a single scatter gather and only accomodate a single page (in-flight AFAIK). The noop pinning is good; the excessive throttling of nbd devices is quite bad! I actually deployed a custom kernel pins nbd to noop but I had to leave out the following because the system would be nearly unuseable (almost deadlock!, the very thing the throttle was to avoid) when doing mke2fs on an MD with a single nbd member. So yeah I left off these blk_queue_* calls: + blk_queue_max_segment_size(disk->queue, PAGE_SIZE); + blk_queue_max_hw_segments(disk->queue, 1); + blk_queue_max_phys_segments(disk->queue, 1); I forced the scheduler to noop and mke2fs completed (a very rudimentary test) so I tried to build a RAID5 array across a bunch of NBD devices. It stalled part way through the resync. If I try to directly access any of the md member disks once the resync is stalled, the reading process hangs in D state and no NBD requests are generated. Noop by itself does not solve all the problems I'm seeing. http://members.optusnet.com.au/iwade/alt-sysrq-t.noop.txt I would say a disagree with the assertion that nbd shouldn't have a block I/O scheduler at all. Just like any other block device, nbd will benefit from the accumulation of block I/O chunks in the same range so that they can be sent out as one linear I/O request. If you pin nbd to no-op and this makes the problem go away, we'll never fix whatever the real issue is and that bug might have other implications that will just show up elsewhere. Fixing cfq definitely seems like the right thing to do here, and I've got a working back-port, but I'm trying to trim down the diff a bit, as there are multiple changes in the single changeset, and I'd like to keep the changes to the necessary minimum. Iain, curious to know, does your raid5 scenario in comment #38 deadlock under 2.6.22 as well? in response to Dave's comment#39: yes I agree it would be best to reduce the amount of overhead related to NBD IO requests. I lost sight of that aspect and was focused on the fact that the nbd-servers' block device would have its own IO scheduler to optimize the writeout. PeterZ stated that _any_ IO scheduler would cause NBD to deadlock. I myself haven't seen noop _really_ help NBD; I was merely following supposed best practice. I think I have finally isolated the bits within the upstream changeset that actually remedy this particular problem... I'm patching that atop a RHEL5 kernel build for testing right now, and assuming all goes well, will make the kernel available for others to try out as well. Test kernels for i686 and x86_64 available here: http://people.redhat.com/jwilson/test_kernels/2.6.18-52.el5.xi12/ Please test and report back with results. The basic 'dd if=/dev/zero of=/dev/nbd0' test passes with flying colors for me. Basically, so far as I can tell, we're improperly checking the return value of cfq_arm_slice_timer() to determine whether or not to schedule block I/O -- its possible we still need to schedule block I/O no matter what that return value is, I just don't completely understand everything that is going on in cfq_arm_slice_timer() yet. The problem is remedied by adjusting the check for whether or not to call cfq_schedule_dispatch() to look at cfqd->rq_in_driver instead -- dispatch if the driver has pending requests queued. --- linux-2.6/block/cfq-iosched.c 2007-10-01 15:42:03.000000000 -0400 +++ linux-2.6.cfq/block/cfq-iosched.c 2007-10-01 15:42:33.000000000 -0400 @@ -1866,11 +1866,12 @@ static void cfq_completed_request(reques if (cfqd->active_queue == cfqq) { if (time_after(now, cfqq->slice_end)) cfq_slice_expired(cfqd, 0); - else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) { - if (!cfq_arm_slice_timer(cfqd, cfqq)) - cfq_schedule_dispatch(cfqd); - } + else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) + cfq_arm_slice_timer(cfqd, cfqq); } + + if (!cfqd->rq_in_driver) + cfq_schedule_dispatch(cfqd); } static struct request * sorry for not getting back to you sooner, I have pushed a couple of TB's through NBD on the new kernel and it has not had any problems so far. * created a RAID5 across 4x400GB disks, resync * created a RAID1 across 2x1TB disks, resync * created filesystems on each * copied a hundred GB around * sync many times * checksum the data OK I used the noop IO scheduler at first but then switched back to cfq as that seems to be where the change was made. Both have been perfect. Looks like a winner. Thanks. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. in 2.6.18-58.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Given comment#48, and the fact that Dick's latest kernel is now 2.6.18-92.el5, what is the status of this CFQ fix? Has it made it into a kernel that RedHat is (or will be) releasing? Yes, it'll be in the 5.2 GA kernel. Awesome, thanks Jarod. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html Prior that kernel update, I wasn't able to do any bonnie++ on my nbd server. Since that update, I'm able to complete 3 bonnies, then it deadlocks again. Does some of you are able to stress this nbd driver without deadlocking ? Erwan, are you able to unlock nbd by switching schedulers? (see comment #36). Its entirely possible there's another issue somewhere in the cfq code we need to address, but its also possible you're hitting something else, such as the memory pressure deadlock, referenced in comment #3... Really though, since this bug has been closed and the original problem reported here is indeed confirmed as fixed, we should get another bug opened for this. Sorry to dig up an old thread, but this fix does not look valid to me. diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 99e492a..a8ac140 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1866,11 +1866,12 @@ static void cfq_completed_request(request_queue_t *q, struct request *rq) if (cfqd->active_queue == cfqq) { if (time_after(now, cfqq->slice_end)) cfq_slice_expired(cfqd, 0); - else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) { - if (!cfq_arm_slice_timer(cfqd, cfqq)) - cfq_schedule_dispatch(cfqd); - } + else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) + cfq_arm_slice_timer(cfqd, cfqq); } + + if (!cfqd->rq_in_driver) + cfq_schedule_dispatch(cfqd); If cfq_arm_slice_timer returns 0, it means it did *not* arm the timer. In this case, schedule a queue dispatch immediately. If it did arm the timer, the timer should fire after the idle slice timeout has expired (8ms by default). In no case should the I/O remain blocked. It looks to me as though you've papered over a problem. |