Red Hat Bugzilla – Bug 155389
mismatch in kmem_cache_free: expected cache cbf5c980, got ce96be20
Last modified: 2014-08-31 19:27:24 EDT
Description of problem:
fenlason-rhide oopses randomly, with the above error message.
Version-Release number of selected component (if applicable):
Every night during the nightly Amanda run
Steps to Reproduce:
1.boot fenlason-rhide into an affected kernel
Created attachment 113377 [details]
Information from serial console on oops.
assigning to davem, as theres networking stack all over those traces, though
this could just be a symtom of a bigger problem.
Jay, I take it this box passes a memtest run ok ?
I ran memtest for a couple of hours on it yesterday. Also, downgrading to a
previous kernel (/me has to look up the number) makes the problem go away.
1238 is not the only or first one that exhibits the problem--it's just the one
I happend to be running when it crashed last. 1240 also fails.
Yeah, it looks like TCP ipv6 sockets are getting freed up into
the ipv4 slab cache and then again into the (correct) ipv6
slab cache. I suspect some changes recently by Arnaldo, I'll
ask him to look at this.
FWIW, kernel-smp-2.6.11-1.1226_FC4 is not affected, but all newer kernels I've
tried, including kernel-smp-2.6.11-1.1231_FC4, kernel-smp-2.6.11-1.1236_FC4,
kernel-smp-2.6.11-1.1240_FC4, kernel-smp-2.6.11-1.1251_FC4, and
Ok, previously each sock had a pointer to the slab it was allocated from,
now the slab pointer is get from sk->sk_prot->slab, so I'm investigating
cases where a sock is allocated from the TCPv6 slab and later on the
sk->sk_prot pointer is changed to the TCP (v4) struct proto one, which
later on will make a sock allocated from the TCPv6 slab be released from
the TCPv4 slab, like in the IPV6_ADDRFORM setsockopt (net/ipv6/ipv6_sockglue.c),
I'll look at the FC4 amanda sources. Anyway, solution seems to be to reallocate
the sock from the TCPv4 slab, copy its contentes from the old one allocated
from the TCPv6 one and free the old one, I'm investigating to see if this
is the correct way of fixing this.
From code in net/ipv6/ipv6_sockglue.c:
struct tcp_sock *tp = tcp_sk(sk);
sk->sk_prot = &tcp_prot;
tp->af_specific = &ipv4_specific;
sk->sk_socket->ops = &inet_stream_ops;
sk->sk_family = PF_INET;
I.e. the pre-nuke-sk_slab code already tried to to this, i.e. to decrement
the number of users in the previous family and increment in the new family,
so I'll cook up a sk_reparent(struct sock *sk, int new_family, struct proto
*new_prot) function to do all the steps needed for the sock to move from one
family to the other, part of it is above, but the "slab_move" operation is not
done, I'm quite confident that this will fix the bug and provide a new
core function to be used in similar cases with other internet v4/v6 (or other
families, in the future maybe) to use.
Created attachment 113614 [details]
setsockopt IPV6_ADDRFORM first stab at doing proper slab move
This is still not the final patch, I think, what if the sock is linked to some
Can somebody please test with this patch? As I said it may well not be the final
patch, but having the test case tested with it can provide some clues.
I've built a kernel with the attached patch, and I'm running it on
fenlason-rhide. We'll see if it survives overnight.
Created attachment 113666 [details]
oops when running with the attached patch
Last night's Amanad run apparently worked, because the tape got ejected, but it
still oopsed before I got in this morning. I'm attaching the console output.
Created attachment 113892 [details]
Introduce sk_prot_creator, so that we always remember where this sock come from
I think this one fixes this for good, as at sk_alloc/sk_free time we use the
original sk_prot, not the new one changed in some code such as the
IPV6_ADDRFORM case, it introduces a new pointer in sk_prot, but this is better
than having sk_slab and sk_owner back in struct sock.
Could you please give it a try?
Do I need to apply both patches, or just the second one?
Just the second one, scrap the first, its bogus
Any news on the tests? :-)
Amanda only runs on weeknights, so all I can say so far is that it hasn't
crashed yet. " 11:51:31 up 1 day, 2:45" so far. . .
OK, even amanda has to take some rest on weekends, I should learn :-\
Well, Russell King doesn't noticed any problems after applying the
sk_prot_creator patch, we're considering this fixed, Jay, do you confirm that in
case the problem disappeared as well?
The patched smp kernel on fenlason-rhide ("2.6.11-1.1276_FC4.rootsmp") has
been up for four days now (" 10:43:49 up 4 days, 1:37"), through several
Amanda runs, so I think we can say the patch works.
Great, I'll integrate upstream.
Thanks everybody for working with me on fixing this bug!
Thanks for chasing this down Arnaldo, I'll pick it up for FC4 later today, and
then drop it again when you get it integrated upstream :-)
Already merged upstream :-)
indeed, it made it into todays rawhide kernel due to last nights -rc4 rebase.