Bug 703150 - multiple resource leaks on error paths in blkfront and netfront
Summary: multiple resource leaks on error paths in blkfront and netfront
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Laszlo Ersek
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 697927
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-09 13:06 UTC by Laszlo Ersek
Modified: 2012-02-21 03:47 UTC (History)
9 users (show)

Fixed In Version: kernel-2.6.18-283.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-21 03:47:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
make sure PV-on-HVM dynamic IRQ allocation always succeeds (2.08 KB, patch)
2011-05-23 18:22 UTC, Laszlo Ersek
no flags Details | Diff
make *low-level* PV IRQ binding always successful in a domU (1.38 KB, patch)
2011-05-23 18:29 UTC, Laszlo Ersek
no flags Details | Diff
plug remaining leaks in netfront::setup_device() (690 bytes, patch)
2011-05-23 18:33 UTC, Laszlo Ersek
no flags Details | Diff
make sure PV-on-HVM dynamic IRQ allocation always succeeds (v2) (1.88 KB, patch)
2011-05-24 07:39 UTC, Laszlo Ersek
no flags Details | Diff
plug remaining leaks in netfront::setup_device() (v2) (1.69 KB, patch)
2011-05-24 10:29 UTC, Laszlo Ersek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0150 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5.8 kernel update 2012-02-21 07:35:24 UTC

Description Laszlo Ersek 2011-05-09 13:06:13 UTC
blkfront: see setup_blkring()
netfront: see setup_device()

bug 697927 comment 18
bug 697927 comment 24, bug 697927 comment 28

Comment 3 Laszlo Ersek 2011-05-23 16:38:07 UTC
Blkfront doesn't leak the shared ring page, nor the grant reference. What I could not have imagined is this: successful granting means ownership transfer of the page. That is, once the grant reference is stored in blkfront_info, the page is never released manually afterwards. gnttab_end_foreign_access(), which is called by blkif_free(), invokes free_page().

This is symmetrical to how successful IRQ allocation/binding is meant as an ownership transfer of the event channel.

The event channel can still be leaked in setup_blkring(), if and only if the underlying dynirq fails. I intend to backport the "panic on out-of-dynirq" from RHEL-6. We already increased the number of dynirqs under PV (see bug 650838 comment 23), and I intend to repeat both with PV-on-HVM too (ie. the increase and the panic). This should plug the evtchn leak in blkfront.

Re netfront, the same ownership transfer holds for rx/tx ring pages and grant references. However, unlike setup_blkring() calls blkif_free(), setup_device() does not call netif_disconnect_backend() on the error path. So netfront can leak pages, grant references and the event channel.

Comment 4 Laszlo Ersek 2011-05-23 18:15:09 UTC
The series I'm about to attach must be applied on top of the fix for bug 697927:
http://post-office.corp.redhat.com/archives/rhkernel-list/2011-May/msg00290.html

Compile-tested (both PV and PV-on-HVM).

Comment 5 Laszlo Ersek 2011-05-23 18:22:33 UTC
Created attachment 500481 [details]
make sure PV-on-HVM dynamic IRQ allocation always succeeds

Increase number of PV-on-HVM dynirqs to NR_EVENT_CHANNELS. Panic in alloc_xen_irq() if out-of-irqs. Adapt dependent evtchn functions.

This patch effects under PV-on-HVM that netfront and blkfront will never fail to bind the already allocated event channel to a dynamic IRQ. Since that's the last step in setup_blkring() / setup_device(), the setup will succeed and the event channel won't be leaked. The panic should never happen due to the increased number of dynamic IRQs.

I'm not sure if we want to mess with these parts.

Comment 6 Laszlo Ersek 2011-05-23 18:29:56 UTC
Created attachment 500482 [details]
make *low-level* PV IRQ binding always successful in a domU

find_unbound_irq() panics in a PV domU if it runs out of dynamic IRQs. This is (functionally) a backport from RHEL-6, and should not be dangerous since bug 650838 is fixed now.

Now the dependent functions:
- bind_evtchn_to_irq()
- bind_virq_to_irq()
- bind_ipi_to_irq()
can only succeed in a PV domU. Therefore the PV bind_XXX_to_irqhandler functions can't leak the event channel, with the binding already in place -- the unbinding releases the event channel as well.

blkfront and netfront can see bind_evtchn_to_irqhandler() fail, but in that case the event channel won't be leaked anymore.

Comment 7 Laszlo Ersek 2011-05-23 18:33:57 UTC
Created attachment 500484 [details]
plug remaining leaks in netfront::setup_device()

At this point blkfront is plugged both under PV and PV-on-HVM. However, netfront can still be manipulated into leaking ring pages and grant references. This patch intends to plug those.

Zeroing out info->evtchn, info->rx.sring and info->tx.sring wouldn't be strictly necessary, I only do so for good measure -- dangling references are generally dangerous.

Comment 8 Laszlo Ersek 2011-05-23 18:36:48 UTC
(In reply to comment #7)
> Created attachment 500484 [details]
> plug remaining leaks in netfront::setup_device()

This should be submitted to upstream as well. They don't leak event channels at all, but their setup_device() seems also susceptible to leaking pages/refs on the error path.

Comment 9 Paolo Bonzini 2011-05-24 06:52:34 UTC
I agree with everything except bumping the number of PV-on-HVM "irqs".  It wastes some memory and I think is not a real issue (256 devices attached to a single domain?).  It is only an issue for dom0, which is what prompted the change in bug 650838.

Comment 10 Laszlo Ersek 2011-05-24 07:39:46 UTC
Created attachment 500556 [details]
make sure PV-on-HVM dynamic IRQ allocation always succeeds (v2)

(In reply to comment #9)
> I agree with everything except bumping the number of PV-on-HVM "irqs".  It
> wastes some memory and I think is not a real issue (256 devices attached to a
> single domain?).

Resetting size of irq_evtchn[] to 256.

Comment 11 Laszlo Ersek 2011-05-24 10:29:52 UTC
Created attachment 500568 [details]
plug remaining leaks in netfront::setup_device() (v2)

Reuse existing code for cleanup. Also match patch sent to upstream:
http://lists.xensource.com/archives/html/xen-devel/2011-05/msg01539.html

Comment 12 Laszlo Ersek 2011-05-25 14:12:11 UTC
QE, you won't be able to reproduce the problem that this BZ represents. Please do your normal
- boot,
- suspend/resume,
- block and network attach-detach
- anything else
tests in order to verify I didn't introduce a regression. Thanks!

Comment 14 Laszlo Ersek 2011-05-27 12:44:08 UTC
Upstream commit(In reply to comment #11)
> Created attachment 500568 [details]
> plug remaining leaks in netfront::setup_device() (v2)
> 
> Reuse existing code for cleanup. Also match patch sent to upstream:
> http://lists.xensource.com/archives/html/xen-devel/2011-05/msg01539.html

Upstream c/s 1087:
http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/rev/e7c536b81b6a

Comment 16 RHEL Program Management 2011-08-17 13:30:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Jarod Wilson 2011-08-30 19:22:40 UTC
Patch(es) available in kernel-2.6.18-283.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 20 Yuyu Zhou 2012-01-17 02:59:28 UTC
Verified the bug with RHEL5.8 snapshot4(kernel-xen-2.6.18-304.el5 & xen-3.0.3-135.el5)

Since there is no reproducer available, we tested the build with acceptance
test, including boot, reboot, suspend/resume, block and network attach-detach, save/restore, migration.

Close as Verified:SanityOnly.

Comment 21 Yuyu Zhou 2012-01-18 09:11:33 UTC
Change status to Verified:SanityOnly.

Comment 23 errata-xmlrpc 2012-02-21 03:47:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html


Note You need to log in before you can comment on or make changes to this bug.