Bug 182970

Summary: slab leaking (bio and biovec-1)
Product: [Fedora] Fedora Reporter: Charles Lopes <tjarls>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: mbrancaleoni, oliva, pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-02 04:12:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
list of loaded modules
none
/proc/meminfo
none
/proc/slabinfo none

Description Charles Lopes 2006-02-24 20:12:20 UTC
Description of problem:
After a few days slab allocation is 750MB and still increasing. Most of it is in
bio and biovec-1

Version-Release number of selected component (if applicable):
at least from 2.6.14-1.1773_FC5 up to kernel-2.6.15-1.1955_FC5

Additional info:
My system is on logical volumes on the top of MD RAID1 arrays including the
swap. I'll attach the list of loaded modules and a dump of both meminfo and
slabinfo.

Comment 1 Charles Lopes 2006-02-24 20:13:30 UTC
Created attachment 125207 [details]
list of loaded modules

Comment 2 Charles Lopes 2006-02-24 20:14:20 UTC
Created attachment 125208 [details]
/proc/meminfo

Comment 3 Charles Lopes 2006-02-24 20:15:11 UTC
Created attachment 125209 [details]
/proc/slabinfo

Comment 4 Dave Jones 2006-02-27 07:21:53 UTC
*** Bug 183017 has been marked as a duplicate of this bug. ***

Comment 5 Matteo Brancaleoni 2006-02-27 20:39:08 UTC
FYI, to do some testing, I upgrade FC5-test to vanilla kernel 2.6.16-rc4 and the
problem remains. Then downgraded to vanilla kernel 2.6.14.7 and the problems
disappeared. 

So perhaps is a problem with bio on mainstream kernel?

Comment 6 Matteo Brancaleoni 2006-02-28 14:36:03 UTC
of course 2.6.14.7 doesn't work well on FC5 (mainly for a newer udev).
I did some tests on 2.6.15-rcX kernels, and found out that until 2.6.15-rc5 all
is ok, the things break on 2.6.15-rc6.
Hope this can be useful.

Comment 7 Matteo Brancaleoni 2006-02-28 16:15:29 UTC
This patch:

commit 3795bb0fc52fe2af2749f3ad2185cb9c90871ef8
Author: NeilBrown <neilb>
Date:   Mon Dec 12 02:39:16 2005 -0800

    [PATCH] md: fix a use-after-free bug in raid1
    
    Who would submit code with a FIXME like that in it !!!!
    
    Signed-off-by: Neil Brown <neilb>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

Causes the problem. I think this now should go to kernel developers.
The patch correctly release the bio later in the function (before was freed and
then used), but adds a IF(blah==NULL) to fire the bio_put ...
I'm by no means a kernel expert, but why the IF is needed?



Comment 8 Charles Lopes 2006-02-28 17:01:49 UTC
The bio_put after the if statement doesn't look bad. You want to do it in the
cases where the command "r1_bio->bios[mirror] = NULL;" was called previously. I
believe the problem maybe due an exit point prior to the bio_put.

Here's the piece of code:

                if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
                        reschedule_retry(r1_bio);
                        /* Don't dec_pending yet, we want to hold
                         * the reference over the retry
                         */
                        return 0;
                }

A bio_put (with the test?) may be needed just before return. I'll give it a try
tonight.


Comment 9 Matteo Brancaleoni 2006-02-28 22:22:16 UTC
mmmh
so why previously wasn't done?

I mean, previously bio_put was called and the bio used again,
so they simply moved bio_put after bios usage (use after free problem).

But why they added the IF?
what if the statement != NULL ? bio_put is never called... so slab leaking.

I rebuilt latest 2.6.15 kernel, just deleting the if statement, and the machine
seems to not have a single problem right now.

but again, I'm not a kernel expert, so perhaps the If is right but some other
cases must be considered.

Comment 10 Dave Jones 2006-03-02 04:12:55 UTC
should be fixed in tomorrows rawhide. (Grab it early from
http://people.redhat.com/davej/kernels/Fedora/devel)

Comment 11 Dave Jones 2006-03-02 04:16:25 UTC
*** Bug 183555 has been marked as a duplicate of this bug. ***

Comment 12 Alexandre Oliva 2006-03-02 13:30:34 UTC
It is definitely fixed in 2008, but today's rawhide still has 1996.