241540 – NBD module in RHEL5 deadlocks (regression)

Bug 241540 - NBD module in RHEL5 deadlocks (regression)

Summary: NBD module in RHEL5 deadlocks (regression)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jarod Wilson
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-27 18:38 UTC by Iain Wade
Modified:	2009-04-28 13:17 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHBA-2008-0314
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 14:43:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
nbd updates backported from upstream kernel (3.67 KB, patch) 2007-09-17 15:19 UTC, Jarod Wilson	no flags	Details \| Diff
patch to use page allocations for anything sent to block layer (7.76 KB, patch) 2007-09-20 03:12 UTC, Eric Sandeen	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0314	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5.2	2008-05-20 18:43:34 UTC

Description Iain Wade 2007-05-27 18:38:23 UTC

When using the RHEL5 kernel NBD module, the block devices often and repeatedly deadlock.

Downgrading the OS on the same client to RHEL4.5 works with the same NBD server.

Steps to Reproduce:
1. connect an NBD server
2. mke2fs
3. sync
  
Actual results:

sync never completes.
the machine is otherwise responsive.
the NBD server is idle, waiting for requests.

Expected results:

the dirty blocks should be flushed out, sync should complete.

Comment 1 Mike Snitzer 2007-07-24 13:38:28 UTC

How is it that this nbd issue is only percieved as a "medium" severity and priority?

Any application that relies on the kernel's nbd support is dead in the water on
RHEL5.  This is a show stopper for me.

Please strongly consider raising priority/severity.

Comment 2 Jarod Wilson 2007-09-14 15:04:21 UTC

Can either of you also reproduce the issue with the latest upstream kernel? I see a few possibly relevant 
patches upstream, and I could throw them atop a RHEL5-based kernel build, but it should be possible to 
check functionality by simply installing the latest fedora rawhide kernel on a RHEL5 box.

Comment 3 Jarod Wilson 2007-09-14 17:54:29 UTC

Some interesting reference material:

http://lwn.net/Articles/194569/

Comment 4 Mike Snitzer 2007-09-15 18:17:25 UTC

The comment#3 reference to the "Network receive deadlock prevention for NBD" has
absolutely nothing to do with NBD working on a system that is not even remotely
loaded.

RHEL5's nbd can't even negociate with the nbd-server.  It locks up.  This has
nothing to do with low memory conditions.  The RHEL5 nbd problem is _much_ more
basic.

But don't let me dissuade anyone at redhat from embracing PeterZ's more
comprehensive network deadlock avoidance patchset (it is a very real VM problem
with Linux when writeout depends on a networked resource, not just nbd).

That said, NBD also has a serious issue in the nbd-server under low memory
conditions.  I'm currently researching/testing setting PF_MEMALLOC from within
the userspace nbd-server to address issues associated with the following scenario:
http://marc.info/?l=linux-mm&m=118981112030719&w=2

But even this has no bearing on _basic_ NBD functionality.

Comment 5 Iain Wade 2007-09-16 08:39:19 UTC

I agree with Mike's comments that although the low-memory deadlock situation needs fixing at some 
point, it is not relevent in this case.

Ubuntu 7.04's 2.6.20 kernel works.

mke2fs, mount, read/write, sync works on RHEL5 w/ rawhide's 2.6.23-0.181.rc6.git4.

Another problem crops up for me on 2.6.23, which is that during a software raid build on top of NBD, NBD 
issues block requests which are not multiples of my requested NBD_SET_BLKSIZE parameter (I set block 
size of 4k, but get request sizes which are multiple of 512b, like: 3072b, 7168, etc.) which causes 
problems for my custom NBD server for various reasons. It can be worked around though.

Comment 6 Jarod Wilson 2007-09-17 15:05:47 UTC

Sorry guys, didn't mean to suggest PeterZ's stuff was the fix to this particular
bug, just wanted to list it for future reference. I was actually hoping that the
locking changes already in the upstream kernel would remedy the basic problem,
which it sounds like they may well do, based on Iain's testing.

I'll try to get a RHEL5 test kernel with backported upstream changes posted on
people.redhat.com by the end of the day. I wasn't actively looking, but I don't
recall seeing anything in the nbd code between 2.6.20 and 2.6.23 that appeared
related to the block size issue, so perhaps that's another problem from the
software raid code that we'll not inherit.

Comment 7 Jarod Wilson 2007-09-17 15:19:21 UTC

Created attachment 197531 [details]
nbd updates backported from upstream kernel

The attached patch backports the following upstream changes to the RHEL5
kernel:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6b39bb6548d60b9a18826134b5ccd5c3cef85fe2

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=84963048ca8093e0aa71ac90c2a5fe7af5f617c3

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=be0ef957c9eed4ebae873ee3fbcfb9dfde486dec

Comment 8 Jarod Wilson 2007-09-17 15:56:29 UTC

Test kernels with the above set of patches included can be found here:

http://people.redhat.com/jwilson/test_kernels/2.6.18-47.el5.xi10/

Please give them a go and report back with results!

Comment 9 Iain Wade 2007-09-18 03:47:37 UTC

still broken here: sync blocks.

Comment 10 Jarod Wilson 2007-09-18 15:39:14 UTC

Okay, in that case, I'm thinking the bug may lay in the software raid layer.
I've already spotted a few possibly relevant changes in the software raid code,
will get another test kernel together soonish.

I'm starting to think perhaps I should also try to reproduce the problem with
some systems in my office...

Comment 11 Jarod Wilson 2007-09-19 19:16:03 UTC

Updated test kernel with some additional software raid deadlock prevention
patches included slowly making its way out to people.redhat.com:

http://people.redhat.com/jwilson/test_kernels/2.6.18-48.el5.xi11/

However, in re-reading the problem description, it doesn't actually mention that
any software raid is involved, so perhaps looking at the software raid code was
an exercise in futility...

I'll have a test setup of my own together shortly.

Comment 12 Mike Snitzer 2007-09-19 19:29:37 UTC

Thanks for your continued efforts on this NBD issue!  I now have 3 RHEL5 systems
to test with and will be able to help give your kernels a spin as they are
released.  BTW, I'd recommend using nbd-2.8.8 rather than nbd-2.9.x.

But yes, the original issue is just the fact that NBD isn't happy.  At this
point I can't recall why/how that was the case; I'll revisit this shortly.

I've been fighting with MD+NBD so my comment#4 that said as much obviously
clouded the issue.

Comment 13 Jarod Wilson 2007-09-19 20:16:28 UTC

Heh, my prior knowledge of the ways in which Linux Networx uses nbd clouded the
issue as well... ;)

I've now got a rawhide box serving up a file-backed nbd, more or less like so:

# dd if=/dev/zero of=/root/nbd-file bs=1k count=1M
# losetup -f /root/nbd-file
# nbd-server 12345 /dev/loop0

Then I've got a rhel5 box running the client. Thus far, its been able to connect
to the server, format /dev/nb0 and sync, all without a problem. This was all
done with nbd 2.9.7, built with debugging enabled (and all the debug output
looks sane) and the previously mentioned ~1GB file-backed storage.

I'll give it a go with 2.8.8 and see what transpires, but I'd like to hear more
details on the specific setups that are having problems, such as:

1) nbd server version

2) nbd client version

3) backing store type and size on server

Comment 14 Jarod Wilson 2007-09-19 21:51:01 UTC

Okay, been poking more... For some reason, nbd 2.8.8 doesn't build on the
rawhide box (complains about nbd.h, despite it being there and working with
2.9.7), and mixing 2.8.8 on the rhel5 box and 2.9.7 on the rawhide box leads to
a protocol error.

However, using 2.9.7 on both sides, I've now got things locking up. Just needed
to start pushing more actual data across. I'll keep poking at this particular
setup and see what I can figure out...

Comment 15 Eric Sandeen 2007-09-19 22:05:55 UTC

Just for fun, can you try reverting:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ea817398e68dfa25612229fda7fc74580cf915fb

and see if the problem persists?

Thanks,

-Eric

Comment 16 Jarod Wilson 2007-09-19 22:15:57 UTC

For the benefit of those playing along at home, I pinged Eric after taking a
look at alt-sysrq-t output after inducing a lock-up, and seeing that nbd-client
was actually *not* deadlocked, its kjournald where the actual deadlock appears
to be occurring. I'll revert the above changeset in the morning and see what I
can see then, as well as doing some testing with ext2 and possibly xfs to help
verify the source of the issue.

One more possibly relevant reference Eric dropped on me:

http://thread.gmane.org/gmane.linux.file-systems/7374

Comment 17 Eric Sandeen 2007-09-19 22:21:31 UTC

Actually, the patch I referred to may not do it, but the patch in the previous
comment is probably more applicable...

See also http://thread.gmane.org/gmane.linux.file-systems/7374

"The problem we're trying to solve here is how do implement network block
devices (nbd, iscsi) efficiently.  The zero copy codepath in the networking
layer does need to grab additional references to pages.  So to use sendpage
we need a refcountable page.  pages used by the slab allocator are not
normally refcounted so try to do get_page/put_page on them will break."

I'd still try reverting the git commit above first, though, and then I'll study
up and make sure I know what's going on here.  :)

Comment 18 Mike Snitzer 2007-09-20 01:48:30 UTC

In response to comment#13 

I'm pretty sure Linux Networx used SteelEye's MD+NBD based product.  I worked
for LNXI but heard from an ex-co-worker that they embraced the SteelEye stuff
just after I left.

Also, using small loop devices isn't the best test of NBD; larger block devices
causes mke2fs to allocate much more dirty pages and increase memory pressure
considerably.  So mke2fs on something like a 500GB or 750GB disk really is an
entirely different experience.  Not sure if you have multiple disks like that
kicking around. 


For those inclined to really ask for deadlocks try setting up the MD+NBD config
from the linux-mm post referenced in comment#3... I know; I know... we need
RHEL5's nbd to crawl before it can walk and/or run!

But the same deadlock may be easily reproducible without MD in the picture. 
Import largediskA on serverB and largediskB and serverA; simultaneously run
mke2fs on serverA and serverB's respective /dev/nbdX and see what happens.

I'm trying hacks such as exposing PF_LESS_THROTTLE and PF_MEMALLOC to userspace
(via prctl) and having the nbd-server set those flags in the child nbd-server
process.  I should have results on if that'll make a difference within the next
2 days.

Comment 19 Eric Sandeen 2007-09-20 03:12:58 UTC

Created attachment 200431 [details]
patch to use page allocations for anything sent to block layer

Jarod, forget about reverting that patch I pointed to, and give this one a
shot, please.  This matches what's currently pending for upstream.

Thanks,

-Eric

Comment 20 Eric Sandeen 2007-09-20 03:26:12 UTC

Hm, though, Jarod... ok, maybe I'd better talk to you offline to get up to
speed.  :)  I heard "kjournald deadlock nbd" and made some assumptions, but now
looking at the first comment, if just mkfs to the block device causes the
lockup, how does kjournald come into it....  sorry if I jumped the gun.  Let's
talk tomorrow am.

Comment 21 Jarod Wilson 2007-09-20 05:53:34 UTC

(In reply to comment #18)
> I'm pretty sure Linux Networx used SteelEye's MD+NBD based product.  I worked
> for LNXI but heard from an ex-co-worker that they embraced the SteelEye stuff
> just after I left.

I left LNXI before you did. :) At the time, Nate's hacky stuff for failover between cluster head nodes was 
still more prominent than SteelEye's stuff, at least where I was -- onsite at Boeing in Bellevue, WA. I think 
the last cluster I helped bring up there just before I left was one of the earliest SteelEye deployments.

Comment 22 Jarod Wilson 2007-09-20 06:02:02 UTC

(In reply to comment #20)
> Hm, though, Jarod... ok, maybe I'd better talk to you offline to get up to
> speed.  :)  I heard "kjournald deadlock nbd" and made some assumptions, but now
> looking at the first comment, if just mkfs to the block device causes the
> lockup, how does kjournald come into it....

In my test setup, mkfs didn't deadlock, it wasn't until I started firing decent chunks of data at nbd that 
it locked, but then it was only a 1GB test partition, maybe mkfs would lock on a larger partition... In the 
OP's case, it sounded like mkfs was successful, but a subsequent sync call failed. At least, that was 
what I was thinking... Iain, can you confirm exactly where things lock up in your case?

> sorry if I jumped the gun.  Let's talk tomorrow am.

No worries, I shouldn't have prodded ya right at the end of the day. ;) Tomorrow morning it is.

Comment 23 Iain Wade 2007-09-20 06:36:43 UTC

I've had both situations; mke2fs hanging near the end (last output line is Writing inode tables: done) as 
well as everything seeming ok, doing a small amount of IO, typing sync to flush and then sync blocking. 
The device is 400GB in size.

In either case it's very reproducable and quick to happen.

I have to admit my environment is not like the one you are, or would be, testing.

I wrote a custom NBD server which translates NBD block IO requests from received over a unix domain 
socket into network packets against a Netgear SC101 device. I had to reverse engineer the packet 
format from tcpdump captures.

http://code.google.com/p/sc101-nbd/

I have personally run my code successfully (creating, initialising and using a 4x400GB software RAID5) 
on RHEL4 (2.6.9xxx) and Ubuntu 7.04 (2.6.20xxx). Users have had success on a number of other 
distros/kernels.

If you can recreate the hang in a vanilla environment, that's even better.

If you'd like access to the machine that's no problem. If there is anything you'd like collected I would be 
happy to do so.

Here is my alt+sysrq+t output: http://members.optusnet.com.au/iwade/alt-sysrq-t.txt

Comment 24 Jarod Wilson 2007-09-20 14:46:36 UTC

Somehow, I missed the deadlocked pdflush processes in my sysrq-t output, but *that* is actually what 
seems to be the common thread between Iain's output, mine and also what Eric is seeing doing nothing 
more than dd'ing /dev/zero to an nbd. I'm going to collect a few more data points with assorted kernel 
versions, then I think its time to start in on a git bisection.

Iain, btw, very neat stuff there w/the Netgear!

Comment 25 Jarod Wilson 2007-09-20 15:16:23 UTC

Also seeing blk_congestion_wait popping up in the trace, which is what Daniel Phillips says is biting 
Mike...

http://lkml.org/lkml/2007/9/18/12

Dunno yet whether or not this patch is potentially relevant (or if its already in our 2.6.18), but don't 
want to lose track of it:

http://www.linux-nfs.org/Linux-2.6.x/2.6.18-rc4.2/linux-2.6.18-064-
add_fixes_for_the_congestion_wait_crap.dif

(gotta go work on a few other things for a spell).

Comment 26 Jarod Wilson 2007-09-20 15:29:00 UTC

The patch from Trond does indeed look to be highly relevant, and is not in our 2.6.18 tree.

http://git.kernel.org/?p=linux/kernel/git/torvalds/
linux-2.6.git;a=commitdiff;h=275a082fe9308e710324e26ccb5363c53d8fd45f

Will tackle that after lunch...

Comment 27 Mike Snitzer 2007-09-20 19:48:23 UTC

It doesn't appear as though _any_ released kernel has that patch from Trond.

Comment 28 Mike Snitzer 2007-09-20 19:57:39 UTC

Andrew Morton later obsoleted Trond's changes:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3fcfab16c5b86eaa3db3a9a31adba550c5b67141

To be clear, I still get the blk_congestion_wait() hang with 2.6.22.6

Comment 29 Jarod Wilson 2007-09-20 21:27:19 UTC

Yeah, noticed it was later obsoleted. Looked so promising at first. :)

I've done a bit of further poking, using a simple "dd if=/dev/zero of=/dev/nbd0"
test:

- All 2.6.18 renditions thus far have failed, deadlocking almost immediately
when data starts being written

- Ditto for a 2.6.20 kernel build (2.6.20-1.2962.fc6)

- A 2.6.22.6 kernel (2.6.22.6-55.fc6) made it through the entire dd (1GB worth
of data), subsequent sync call, mke2fs -j, mounting the parition and copying a
kernel git tree onto it. Doing a kernel build in that git tree on the nbd now.
So far, so good... Oops, ran out of space on the device, time to create a bigger
nbd backing store...

I'm thinking a git-bisect between 2.6.22 and 2.6.20 is probably in order.

Comment 30 Jarod Wilson 2007-09-21 19:29:52 UTC

Nb: still working on the git-bisect, about 7 bisections and test cycles in.
Likely to be Monday before isolating the offending changeset and attempting
back-port to 2.6.18.

Then once we get the basic functionality actually working, I'd suggest a new bug
to track the additional deadlock Mike is seeing under higher traffic loads.

Comment 31 Mike Snitzer 2007-09-21 20:57:30 UTC

responding to comment#30: sounds like a plan

btw, I'm also seeing some extremely disturbing behavior where NBD appears to be
buffering writes (that should be synchronous via MD raid1), PeterZ addressed
this in a way that Jens Axboe didn't approve.  But a "correct" fix hasn't ever
been developed AFAIK, anyway here are some useful links:
http://lkml.org/lkml/2006/9/12/204
http://lkml.org/lkml/2007/4/29/283

I even tried to engage peterz and paul clements (nbd maintainer) on this issue
but never got a response:
http://lkml.org/lkml/2007/6/26/212

Comment 32 Jarod Wilson 2007-09-25 02:07:31 UTC

Just about done with a second bisection run, I think I screwed up somewhere along the line the first one[*]. 
Should have something definitive by lunch tomorrow...

[*]: git-bisect doesn't like it when HEAD is good, and an earlier version is bad, so you have to reverse the 
meaning of good and bad to make it actually do anything, and I think I probably accidentally didn't reverse 
at one of the bisection points the first time through. Ugh.

Comment 33 Jarod Wilson 2007-09-25 05:41:51 UTC

Okay, git-bisect claims its this changeset:

--
commit 498d3aa2b4f791059acd8c942ee8fa15c2ce36c2
Author: Jens Axboe <jens.axboe>
Date:   Thu Apr 26 12:54:48 2007 +0200

    [PATCH] cfq-iosched: style cleanups and comments
    
    Signed-off-by: Jens Axboe <jens.axboe>

:040000 040000 098f17624a0007b506367b4df9fc42580dcefe8a 
8bf593a2d6aab9f0f8d1603e9d29656f56b1fd0c M      block
--

However, it may actually be the commit just after that, not entirely sure yet, because of the reversed 
logic and it getting late... For reference, the next commit:

cfq-iosched: slice offset should take ioprio into account
Jens Axboe [Fri, 20 Apr 2007 12:18:00 +0000 (14:18 +0200)]
Use the max_slice-cur_slice as the multipler for the insertion offset.

http://git.kernel.org/?p=linux/kernel/git/torvalds/
linux-2.6.git;a=commitdiff;h=67e6b49e39e9b9bf5ce1351ef21dad391856183f

More fun come morning...

Comment 34 Jarod Wilson 2007-09-25 16:02:20 UTC

I musta screwed up somewhere in the bisection again, manually reverting to the
version just before the commit git-bisect identified still gives me a working
nbd, which makes sense after looking at that changeset, which contains zero
functional changes. More after lunch...

Comment 35 Jarod Wilson 2007-09-25 19:29:52 UTC

Okay, figured out what went wrong in the last bisection, and manually stepped
through a few things to double-check. The commit that finally fixed things is
this one:

--
cfq-iosched: development update

Jens Axboe [Wed, 25 Apr 2007 10:44:27 +0000 (12:44 +0200)]

- Implement logic for detecting cooperating processes, so we
choose the best available queue whenever possible.

- Improve residual slice time accounting.

- Remove dead code: we no longer see async requests coming in on
sync queues. That part was removed a long time ago. That means
that we can also remove the difference between cfq_cfqq_sync()
and cfq_cfqq_class_sync(), they are now indentical. And we can
kill the on_dispatch array, just make it a counter.

- Allow a process to go into the current list, if it hasn't been
serviced in this scheduler tick yet.

Possible future improvements including caching the cfqq lookup
in cfq_close_cooperator(), so we don't have to look it up twice.
cfq_get_best_queue() should just use that last decision instead
of doing it again.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6d048f5310aa2dda2b5acd947eab3598c25e269f
--

However, the very next commit cleans up a few things in this commit, and the two
after that also look worth taking if we're going to take this much. Not to
mention I've not yet determined how difficult this will be to wedge into the
rhel5 kernel.

Meanwhile, in my own testing, PeterZ's patch here...

http://lkml.org/lkml/2006/7/7/164

...applies quite easily to the rhel5 kernel and also appears to remedy the problem.

Comment 36 Jarod Wilson 2007-09-25 19:39:48 UTC

Also, given that what PeterZ's patch largely does is simply force nbd to use the
noop scheduler, it should be possible to simply:

echo noop > /sys/block/nbd0/queue/scheduler

and/or boot with elevator=noop, and have working nbd with the current kernel.
Will try that myself shortly. Other schedules may work as well...

Comment 37 Mike Snitzer 2007-09-25 20:18:42 UTC

So this issue is two-fold:

1) RedHat's default IO scheduler is cfq; hence the fact that reverting/fixing
that area affected by Jens commit eliminates the problem.

2) At the heart of it; nbd devices shouldn't have a scheduler at all.  I've been
using the very peterz patch you referenced.  It does two things: 1) pins nbd
devices to the noop scheduler 2) reduces the nbd devices' associated queue to
only have a single scatter gather and only accomodate a single page (in-flight
AFAIK).  The noop pinning is good; the excessive throttling of nbd devices is
quite bad!

I actually deployed a custom kernel pins nbd to noop but I had to leave out the
following because the system would be nearly unuseable (almost deadlock!, the
very thing the throttle was to avoid) when doing mke2fs on an MD with a single
nbd member.  So yeah I left off these blk_queue_* calls:

+		blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+		blk_queue_max_hw_segments(disk->queue, 1);
+		blk_queue_max_phys_segments(disk->queue, 1);

Comment 38 Iain Wade 2007-09-26 02:30:06 UTC

I forced the scheduler to noop and mke2fs completed (a very rudimentary test) so I tried to build a RAID5 
array across a bunch of NBD devices. It stalled part way through the resync.

If I try to directly access any of the md member disks once the resync is stalled, the reading process hangs 
in D state and no NBD requests are generated.

Noop by itself does not solve all the problems I'm seeing.

http://members.optusnet.com.au/iwade/alt-sysrq-t.noop.txt

Comment 39 David Miller 2007-09-26 05:20:40 UTC

I would say a disagree with the assertion that nbd shouldn't have a block
I/O scheduler at all.  Just like any other block device, nbd will benefit
from the accumulation of block I/O chunks in the same range so that they
can be sent out as one linear I/O request.

If you pin nbd to no-op and this makes the problem go away, we'll never
fix whatever the real issue is and that bug might have other implications
that will just show up elsewhere.

Comment 40 Jarod Wilson 2007-09-26 22:23:12 UTC

Fixing cfq definitely seems like the right thing to do here, and I've got a
working back-port, but I'm trying to trim down the diff a bit, as there are
multiple changes in the single changeset, and I'd like to keep the changes to
the necessary minimum.

Iain, curious to know, does your raid5 scenario in comment #38 deadlock under
2.6.22 as well?

Comment 41 Mike Snitzer 2007-09-28 19:13:00 UTC

in response to Dave's comment#39: yes I agree it would be best to reduce the
amount of overhead related to NBD IO requests.  I lost sight of that aspect and
was focused on the fact that the nbd-servers' block device would have its own IO
scheduler to optimize the writeout.

PeterZ stated that _any_ IO scheduler would cause NBD to deadlock.  I myself
haven't seen noop _really_ help NBD; I was merely following supposed best practice.

Comment 42 Jarod Wilson 2007-10-01 14:31:01 UTC

I think I have finally isolated the bits within the upstream changeset that
actually remedy this particular problem... I'm patching that atop a RHEL5 kernel
build for testing right now, and assuming all goes well, will make the kernel
available for others to try out as well.

Comment 43 Jarod Wilson 2007-10-01 20:36:36 UTC

Test kernels for i686 and x86_64 available here:

http://people.redhat.com/jwilson/test_kernels/2.6.18-52.el5.xi12/

Please test and report back with results. The basic 'dd if=/dev/zero
of=/dev/nbd0' test passes with flying colors for me.

Basically, so far as I can tell, we're improperly checking the return value of
cfq_arm_slice_timer() to determine whether or not to schedule block I/O -- its
possible we still need to schedule block I/O no matter what that return value
is, I just don't completely understand everything that is going on in
cfq_arm_slice_timer() yet. The problem is remedied by adjusting the check for
whether or not to call cfq_schedule_dispatch() to look at cfqd->rq_in_driver
instead -- dispatch if the driver has pending requests queued.


--- linux-2.6/block/cfq-iosched.c     2007-10-01 15:42:03.000000000 -0400
+++ linux-2.6.cfq/block/cfq-iosched.c 2007-10-01 15:42:33.000000000 -0400
@@ -1866,11 +1866,12 @@ static void cfq_completed_request(reques
        if (cfqd->active_queue == cfqq) {
                if (time_after(now, cfqq->slice_end))
                        cfq_slice_expired(cfqd, 0);
-               else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) {
-                       if (!cfq_arm_slice_timer(cfqd, cfqq))
-                               cfq_schedule_dispatch(cfqd);
-               }
+               else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
+                       cfq_arm_slice_timer(cfqd, cfqq);
        }
+
+       if (!cfqd->rq_in_driver)
+               cfq_schedule_dispatch(cfqd);
 }
 
 static struct request *

Comment 45 Iain Wade 2007-10-05 03:33:21 UTC

sorry for not getting back to you sooner, I have pushed a couple of TB's through NBD on the new kernel 
and it has not had any problems so far.

* created a RAID5 across 4x400GB disks, resync
* created a RAID1 across 2x1TB disks, resync
* created filesystems on each
* copied a hundred GB around
* sync many times
* checksum the data OK

I used the noop IO scheduler at first but then switched back to cfq as that seems to be where the 
change was made. Both have been perfect.

Looks like a winner. Thanks.

Comment 46 RHEL Program Management 2007-11-20 05:06:07 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 48 Don Zickus 2007-11-29 17:05:36 UTC

in 2.6.18-58.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 50 Mike Snitzer 2008-05-06 15:00:09 UTC

Given comment#48, and the fact that Dick's latest kernel is now 2.6.18-92.el5,
what is the status of this CFQ fix?

Has it made it into a kernel that RedHat is (or will be) releasing?

Comment 51 Jarod Wilson 2008-05-06 15:10:32 UTC

Yes, it'll be in the 5.2 GA kernel.

Comment 52 Mike Snitzer 2008-05-06 15:47:12 UTC

Awesome, thanks Jarod.

Comment 54 errata-xmlrpc 2008-05-21 14:43:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Comment 55 Erwan Velu 2008-07-02 09:51:45 UTC

Prior that kernel update, I wasn't able to do any bonnie++ on my nbd server.
Since that update, I'm able to complete 3 bonnies, then it deadlocks again.

Does some of you are able to stress this nbd driver without deadlocking ?

Comment 56 Jarod Wilson 2008-07-02 13:44:41 UTC

Erwan, are you able to unlock nbd by switching schedulers? (see comment #36).
Its entirely possible there's another issue somewhere in the cfq code we need to
address, but its also possible you're hitting something else, such as the memory
pressure deadlock, referenced in comment #3...

Really though, since this bug has been closed and the original problem reported
here is indeed confirmed as fixed, we should get another bug opened for this.

Comment 57 Jeff Moyer 2009-04-28 13:17:26 UTC

Sorry to dig up an old thread, but this fix does not look valid to me.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 99e492a..a8ac140 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1866,11 +1866,12 @@ static void cfq_completed_request(request_queue_t *q, struct request *rq)
        if (cfqd->active_queue == cfqq) {
                if (time_after(now, cfqq->slice_end))
                        cfq_slice_expired(cfqd, 0);
-               else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) {
-                       if (!cfq_arm_slice_timer(cfqd, cfqq))
-                               cfq_schedule_dispatch(cfqd);
-               }
+               else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
+                       cfq_arm_slice_timer(cfqd, cfqq);
        }
+
+       if (!cfqd->rq_in_driver)
+               cfq_schedule_dispatch(cfqd);

If cfq_arm_slice_timer returns 0, it means it did *not* arm the timer.  In this case, schedule a queue dispatch immediately.  If it did arm the timer, the timer should fire after the idle slice timeout has expired (8ms by default).  In no case should the I/O remain blocked.

It looks to me as though you've papered over a problem.

Note You need to log in before you can comment on or make changes to this bug.