Bug 155389
| Summary: | mismatch in kmem_cache_free: expected cache cbf5c980, got ce96be20 | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Jay Fenlason <fenlason> |
| Component: | kernel | Assignee: | David Miller <davem> |
| Status: | CLOSED RAWHIDE | QA Contact: | Brian Brock <bbrock> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4 | CC: | acme, davej, jfeeney, rmk |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2005-05-10 20:06:40 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Jay Fenlason
2005-04-19 21:04:59 UTC
Created attachment 113377 [details]
Information from serial console on oops.
assigning to davem, as theres networking stack all over those traces, though this could just be a symtom of a bigger problem. Jay, I take it this box passes a memtest run ok ? I ran memtest for a couple of hours on it yesterday. Also, downgrading to a previous kernel (/me has to look up the number) makes the problem go away. 1238 is not the only or first one that exhibits the problem--it's just the one I happend to be running when it crashed last. 1240 also fails. Yeah, it looks like TCP ipv6 sockets are getting freed up into the ipv4 slab cache and then again into the (correct) ipv6 slab cache. I suspect some changes recently by Arnaldo, I'll ask him to look at this. FWIW, kernel-smp-2.6.11-1.1226_FC4 is not affected, but all newer kernels I've tried, including kernel-smp-2.6.11-1.1231_FC4, kernel-smp-2.6.11-1.1236_FC4, kernel-smp-2.6.11-1.1240_FC4, kernel-smp-2.6.11-1.1251_FC4, and 2.6.11-1.1258_FC4smp are. Ok, previously each sock had a pointer to the slab it was allocated from, now the slab pointer is get from sk->sk_prot->slab, so I'm investigating cases where a sock is allocated from the TCPv6 slab and later on the sk->sk_prot pointer is changed to the TCP (v4) struct proto one, which later on will make a sock allocated from the TCPv6 slab be released from the TCPv4 slab, like in the IPV6_ADDRFORM setsockopt (net/ipv6/ipv6_sockglue.c), I'll look at the FC4 amanda sources. Anyway, solution seems to be to reallocate the sock from the TCPv4 slab, copy its contentes from the old one allocated from the TCPv6 one and free the old one, I'm investigating to see if this is the correct way of fixing this. From code in net/ipv6/ipv6_sockglue.c:
struct tcp_sock *tp = tcp_sk(sk);
local_bh_disable();
sock_prot_dec_use(sk->sk_prot);
sock_prot_inc_use(&tcp_prot);
local_bh_enable();
sk->sk_prot = &tcp_prot;
tp->af_specific = &ipv4_specific;
sk->sk_socket->ops = &inet_stream_ops;
sk->sk_family = PF_INET;
tcp_sync_mss(sk, tp->pmtu_cookie);
I.e. the pre-nuke-sk_slab code already tried to to this, i.e. to decrement
the number of users in the previous family and increment in the new family,
so I'll cook up a sk_reparent(struct sock *sk, int new_family, struct proto
*new_prot) function to do all the steps needed for the sock to move from one
family to the other, part of it is above, but the "slab_move" operation is not
done, I'm quite confident that this will fix the bug and provide a new
core function to be used in similar cases with other internet v4/v6 (or other
families, in the future maybe) to use.
Created attachment 113614 [details]
setsockopt IPV6_ADDRFORM first stab at doing proper slab move
This is still not the final patch, I think, what if the sock is linked to some
list?
Can somebody please test with this patch? As I said it may well not be the final patch, but having the test case tested with it can provide some clues. I've built a kernel with the attached patch, and I'm running it on fenlason-rhide. We'll see if it survives overnight. Created attachment 113666 [details]
oops when running with the attached patch
Last night's Amanad run apparently worked, because the tape got ejected, but it
still oopsed before I got in this morning. I'm attaching the console output.
Created attachment 113892 [details]
Introduce sk_prot_creator, so that we always remember where this sock come from
I think this one fixes this for good, as at sk_alloc/sk_free time we use the
original sk_prot, not the new one changed in some code such as the
ipv6_setsockopt
IPV6_ADDRFORM case, it introduces a new pointer in sk_prot, but this is better
than having sk_slab and sk_owner back in struct sock.
Could you please give it a try?
Do I need to apply both patches, or just the second one? Just the second one, scrap the first, its bogus Any news on the tests? :-) Amanda only runs on weeknights, so all I can say so far is that it hasn't crashed yet. " 11:51:31 up 1 day, 2:45" so far. . . OK, even amanda has to take some rest on weekends, I should learn :-\ Well, Russell King doesn't noticed any problems after applying the sk_prot_creator patch, we're considering this fixed, Jay, do you confirm that in your (amandad) case the problem disappeared as well? The patched smp kernel on fenlason-rhide ("2.6.11-1.1276_FC4.rootsmp") has
been up for four days now (" 10:43:49 up 4 days, 1:37"), through several
Amanda runs, so I think we can say the patch works.
Great, I'll integrate upstream. Thanks everybody for working with me on fixing this bug! Thanks for chasing this down Arnaldo, I'll pick it up for FC4 later today, and then drop it again when you get it integrated upstream :-) http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=476e19cfa131e2b6eedc4017b627cdc4ca419ffb Already merged upstream :-) indeed, it made it into todays rawhide kernel due to last nights -rc4 rebase. thanks again. |