Bug 155389

Summary:

mismatch in kmem_cache_free: expected cache cbf5c980, got ce96be20

Product:

[Fedora] Fedora

Reporter:

Jay Fenlason <fenlason>

Component:

kernel

Assignee:

David Miller <davem>

Status:

CLOSED RAWHIDE

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

acme, davej, jfeeney, rmk

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-05-10 20:06:40 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Information from serial console on oops.	none
setsockopt IPV6_ADDRFORM first stab at doing proper slab move	none
oops when running with the attached patch	none
Introduce sk_prot_creator, so that we always remember where this sock come from	none

Description Jay Fenlason 2005-04-19 21:04:59 UTC

Description of problem: 
fenlason-rhide oopses randomly, with the above error message. 
 
Version-Release number of selected component (if applicable): 
2.6.11-1.1238_FC4smp 
 
How reproducible: 
Every night during the nightly Amanda run 
 
Steps to Reproduce: 
1.boot fenlason-rhide into an affected kernel 
2.wait 
3. 
   
Actual results: 
oops 
 
Expected results: 
no oops 
 
Additional info:

Comment 1 Jay Fenlason 2005-04-19 21:04:59 UTC

Created attachment 113377 [details]
Information from serial console on oops.

Comment 2 Dave Jones 2005-04-20 01:40:33 UTC

assigning to davem, as theres networking stack all over those traces, though
this could just be a symtom of a bigger problem.

Jay, I take it this box passes a memtest run ok ?

Comment 3 Jay Fenlason 2005-04-20 14:28:19 UTC

I ran memtest for a couple of hours on it yesterday.  Also, downgrading to a 
previous kernel (/me has to look up the number) makes the problem go away.  
1238 is not the only or first one that exhibits the problem--it's just the one 
I happend to be running when it crashed last.  1240 also fails.

Comment 4 David Miller 2005-04-21 23:38:03 UTC

Yeah, it looks like TCP ipv6 sockets are getting freed up into
the ipv4 slab cache and then again into the (correct) ipv6
slab cache.  I suspect some changes recently by Arnaldo, I'll
ask him to look at this.

Comment 5 Jay Fenlason 2005-04-22 19:00:25 UTC

FWIW, kernel-smp-2.6.11-1.1226_FC4 is not affected, but all newer kernels I've 
tried, including kernel-smp-2.6.11-1.1231_FC4, kernel-smp-2.6.11-1.1236_FC4, 
kernel-smp-2.6.11-1.1240_FC4, kernel-smp-2.6.11-1.1251_FC4, and 
2.6.11-1.1258_FC4smp are.

Comment 6 acme 2005-04-24 15:42:21 UTC

Ok, previously each sock had a pointer to the slab it was allocated from,
now the slab pointer is get from sk->sk_prot->slab, so I'm investigating
cases where a sock is allocated from the TCPv6 slab and later on the
sk->sk_prot pointer is changed to the TCP (v4) struct proto one, which
later on will make a sock allocated from the TCPv6 slab be released from
the TCPv4 slab, like in the IPV6_ADDRFORM setsockopt (net/ipv6/ipv6_sockglue.c),
I'll look at the FC4 amanda sources. Anyway, solution seems to be to reallocate
the sock from the TCPv4 slab, copy its contentes from the old one allocated
from the TCPv6 one and free the old one, I'm investigating to see if this
is the correct way of fixing this.

Comment 7 acme 2005-04-24 15:49:25 UTC

From code in net/ipv6/ipv6_sockglue.c:

                                struct tcp_sock *tp = tcp_sk(sk);

                                local_bh_disable();
                                sock_prot_dec_use(sk->sk_prot);
                                sock_prot_inc_use(&tcp_prot);
                                local_bh_enable();
                                sk->sk_prot = &tcp_prot;
                                tp->af_specific = &ipv4_specific;
                                sk->sk_socket->ops = &inet_stream_ops;
                                sk->sk_family = PF_INET;
                                tcp_sync_mss(sk, tp->pmtu_cookie);

I.e. the pre-nuke-sk_slab code already tried to to this, i.e. to decrement
the number of users in the previous family and increment in the new family,
so I'll cook up a sk_reparent(struct sock *sk, int new_family, struct proto
*new_prot) function to do all the steps needed for the sock to move from one
family to the other, part of it is above, but the "slab_move" operation is not
done, I'm quite confident that this will fix the bug and provide a new
core function to be used in similar cases with other internet v4/v6 (or other
families, in the future maybe) to use.

Comment 8 acme 2005-04-24 17:39:21 UTC

Created attachment 113614 [details]
setsockopt IPV6_ADDRFORM first stab at doing proper slab move

This is still not the final patch, I think, what if the sock is linked to some
list?

Comment 9 acme 2005-04-24 17:41:07 UTC

Can somebody please test with this patch? As I said it may well not be the final
patch, but having the test case tested with it can provide some clues.

Comment 10 Jay Fenlason 2005-04-25 17:37:42 UTC

I've built a kernel with the attached patch, and I'm running it on 
fenlason-rhide.  We'll see if it survives overnight.

Comment 11 Jay Fenlason 2005-04-26 13:36:52 UTC

Created attachment 113666 [details]
oops when running with the attached patch

Last night's Amanad run apparently worked, because the tape got ejected, but it
still oopsed before I got in this morning.  I'm attaching the console output.

Comment 12 acme 2005-04-30 21:00:31 UTC

Created attachment 113892 [details]
Introduce sk_prot_creator, so that we always remember where this sock come from

I think this one fixes this for good, as at sk_alloc/sk_free time we use the
original sk_prot, not the new one changed in some code such as the
ipv6_setsockopt
IPV6_ADDRFORM case, it introduces a new pointer in sk_prot, but this is better
than having sk_slab and sk_owner back in struct sock.

Could you please give it a try?

Comment 13 Jay Fenlason 2005-04-30 23:54:34 UTC

Do I need to apply both patches, or just the second one?

Comment 14 acme 2005-05-01 00:01:09 UTC

Just the second one, scrap the first, its bogus

Comment 15 acme 2005-05-02 11:59:48 UTC

Any news on the tests? :-)

Comment 16 Jay Fenlason 2005-05-02 15:51:57 UTC

Amanda only runs on weeknights, so all I can say so far is that it hasn't 
crashed yet.  " 11:51:31 up 1 day,  2:45" so far. . .

Comment 17 acme 2005-05-02 18:05:18 UTC

OK, even amanda has to take some rest on weekends, I should learn :-\

Comment 18 acme 2005-05-04 22:07:35 UTC

Well, Russell King doesn't noticed any problems after applying the
sk_prot_creator patch, we're considering this fixed, Jay, do you confirm that in
your (amandad)
case the problem disappeared as well?

Comment 19 Jay Fenlason 2005-05-05 14:45:18 UTC

The patched smp kernel on fenlason-rhide ("2.6.11-1.1276_FC4.rootsmp") has 
been up for four days now (" 10:43:49 up 4 days,  1:37"), through several 
Amanda runs, so I think we can say the patch works.

Comment 20 David Miller 2005-05-05 18:24:57 UTC

Great, I'll integrate upstream.

Comment 21 acme 2005-05-05 21:28:44 UTC

Thanks everybody for working with me on fixing this bug!

Comment 22 Dave Jones 2005-05-10 19:54:27 UTC

Thanks for chasing this down Arnaldo, I'll pick it up for FC4 later today, and
then drop it again when you get it integrated upstream :-)

Comment 23 acme 2005-05-10 20:03:10 UTC

http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=476e19cfa131e2b6eedc4017b627cdc4ca419ffb

Already merged upstream :-)

Comment 24 Dave Jones 2005-05-10 20:06:40 UTC

indeed, it made it into todays rawhide kernel due to last nights -rc4 rebase.

thanks again.