Bug 155389 - mismatch in kmem_cache_free: expected cache cbf5c980, got ce96be20
Summary: mismatch in kmem_cache_free: expected cache cbf5c980, got ce96be20
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Miller
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-04-19 21:04 UTC by Jay Fenlason
Modified: 2014-08-31 23:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-10 20:06:40 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Information from serial console on oops. (20.38 KB, text/plain)
2005-04-19 21:04 UTC, Jay Fenlason
no flags Details
setsockopt IPV6_ADDRFORM first stab at doing proper slab move (2.55 KB, patch)
2005-04-24 17:39 UTC, acme
no flags Details | Diff
oops when running with the attached patch (3.19 KB, text/plain)
2005-04-26 13:36 UTC, Jay Fenlason
no flags Details
Introduce sk_prot_creator, so that we always remember where this sock come from (2.12 KB, patch)
2005-04-30 21:00 UTC, acme
no flags Details | Diff

Description Jay Fenlason 2005-04-19 21:04:59 UTC
Description of problem: 
fenlason-rhide oopses randomly, with the above error message. 
 
Version-Release number of selected component (if applicable): 
2.6.11-1.1238_FC4smp 
 
How reproducible: 
Every night during the nightly Amanda run 
 
Steps to Reproduce: 
1.boot fenlason-rhide into an affected kernel 
2.wait 
3. 
   
Actual results: 
oops 
 
Expected results: 
no oops 
 
Additional info:

Comment 1 Jay Fenlason 2005-04-19 21:04:59 UTC
Created attachment 113377 [details]
Information from serial console on oops.

Comment 2 Dave Jones 2005-04-20 01:40:33 UTC
assigning to davem, as theres networking stack all over those traces, though
this could just be a symtom of a bigger problem.

Jay, I take it this box passes a memtest run ok ?


Comment 3 Jay Fenlason 2005-04-20 14:28:19 UTC
I ran memtest for a couple of hours on it yesterday.  Also, downgrading to a 
previous kernel (/me has to look up the number) makes the problem go away.  
1238 is not the only or first one that exhibits the problem--it's just the one 
I happend to be running when it crashed last.  1240 also fails. 

Comment 4 David Miller 2005-04-21 23:38:03 UTC
Yeah, it looks like TCP ipv6 sockets are getting freed up into
the ipv4 slab cache and then again into the (correct) ipv6
slab cache.  I suspect some changes recently by Arnaldo, I'll
ask him to look at this.


Comment 5 Jay Fenlason 2005-04-22 19:00:25 UTC
FWIW, kernel-smp-2.6.11-1.1226_FC4 is not affected, but all newer kernels I've 
tried, including kernel-smp-2.6.11-1.1231_FC4, kernel-smp-2.6.11-1.1236_FC4, 
kernel-smp-2.6.11-1.1240_FC4, kernel-smp-2.6.11-1.1251_FC4, and 
2.6.11-1.1258_FC4smp are. 

Comment 6 acme 2005-04-24 15:42:21 UTC
Ok, previously each sock had a pointer to the slab it was allocated from,
now the slab pointer is get from sk->sk_prot->slab, so I'm investigating
cases where a sock is allocated from the TCPv6 slab and later on the
sk->sk_prot pointer is changed to the TCP (v4) struct proto one, which
later on will make a sock allocated from the TCPv6 slab be released from
the TCPv4 slab, like in the IPV6_ADDRFORM setsockopt (net/ipv6/ipv6_sockglue.c),
I'll look at the FC4 amanda sources. Anyway, solution seems to be to reallocate
the sock from the TCPv4 slab, copy its contentes from the old one allocated
from the TCPv6 one and free the old one, I'm investigating to see if this
is the correct way of fixing this.

Comment 7 acme 2005-04-24 15:49:25 UTC
From code in net/ipv6/ipv6_sockglue.c:

                                struct tcp_sock *tp = tcp_sk(sk);

                                local_bh_disable();
                                sock_prot_dec_use(sk->sk_prot);
                                sock_prot_inc_use(&tcp_prot);
                                local_bh_enable();
                                sk->sk_prot = &tcp_prot;
                                tp->af_specific = &ipv4_specific;
                                sk->sk_socket->ops = &inet_stream_ops;
                                sk->sk_family = PF_INET;
                                tcp_sync_mss(sk, tp->pmtu_cookie);

I.e. the pre-nuke-sk_slab code already tried to to this, i.e. to decrement
the number of users in the previous family and increment in the new family,
so I'll cook up a sk_reparent(struct sock *sk, int new_family, struct proto
*new_prot) function to do all the steps needed for the sock to move from one
family to the other, part of it is above, but the "slab_move" operation is not
done, I'm quite confident that this will fix the bug and provide a new
core function to be used in similar cases with other internet v4/v6 (or other
families, in the future maybe) to use.

Comment 8 acme 2005-04-24 17:39:21 UTC
Created attachment 113614 [details]
setsockopt IPV6_ADDRFORM first stab at doing proper slab move

This is still not the final patch, I think, what if the sock is linked to some
list?

Comment 9 acme 2005-04-24 17:41:07 UTC
Can somebody please test with this patch? As I said it may well not be the final
patch, but having the test case tested with it can provide some clues.

Comment 10 Jay Fenlason 2005-04-25 17:37:42 UTC
I've built a kernel with the attached patch, and I'm running it on 
fenlason-rhide.  We'll see if it survives overnight. 

Comment 11 Jay Fenlason 2005-04-26 13:36:52 UTC
Created attachment 113666 [details]
oops when running with the attached patch

Last night's Amanad run apparently worked, because the tape got ejected, but it
still oopsed before I got in this morning.  I'm attaching the console output.

Comment 12 acme 2005-04-30 21:00:31 UTC
Created attachment 113892 [details]
Introduce sk_prot_creator, so that we always remember where this sock come from

I think this one fixes this for good, as at sk_alloc/sk_free time we use the
original sk_prot, not the new one changed in some code such as the
ipv6_setsockopt
IPV6_ADDRFORM case, it introduces a new pointer in sk_prot, but this is better
than having sk_slab and sk_owner back in struct sock.

Could you please give it a try?

Comment 13 Jay Fenlason 2005-04-30 23:54:34 UTC
Do I need to apply both patches, or just the second one?

Comment 14 acme 2005-05-01 00:01:09 UTC
Just the second one, scrap the first, its bogus

Comment 15 acme 2005-05-02 11:59:48 UTC
Any news on the tests? :-)

Comment 16 Jay Fenlason 2005-05-02 15:51:57 UTC
Amanda only runs on weeknights, so all I can say so far is that it hasn't 
crashed yet.  " 11:51:31 up 1 day,  2:45" so far. . . 

Comment 17 acme 2005-05-02 18:05:18 UTC
OK, even amanda has to take some rest on weekends, I should learn :-\

Comment 18 acme 2005-05-04 22:07:35 UTC
Well, Russell King doesn't noticed any problems after applying the
sk_prot_creator patch, we're considering this fixed, Jay, do you confirm that in
your (amandad)
case the problem disappeared as well?

Comment 19 Jay Fenlason 2005-05-05 14:45:18 UTC
The patched smp kernel on fenlason-rhide ("2.6.11-1.1276_FC4.rootsmp") has 
been up for four days now (" 10:43:49 up 4 days,  1:37"), through several 
Amanda runs, so I think we can say the patch works. 

Comment 20 David Miller 2005-05-05 18:24:57 UTC
Great, I'll integrate upstream.


Comment 21 acme 2005-05-05 21:28:44 UTC
Thanks everybody for working with me on fixing this bug!

Comment 22 Dave Jones 2005-05-10 19:54:27 UTC
Thanks for chasing this down Arnaldo, I'll pick it up for FC4 later today, and
then drop it again when you get it integrated upstream :-)


Comment 24 Dave Jones 2005-05-10 20:06:40 UTC
indeed, it made it into todays rawhide kernel due to last nights -rc4 rebase.

thanks again.



Note You need to log in before you can comment on or make changes to this bug.