Description of problem: When the connection to the Internet fails outside the computer, the kernel will reliably panic. This even occurs just after a reboot of the computer if the Internet connection is not up. Here is the hand transcribed stack call trace: ip6_link_failure+0xbe/0xd0 ipip6_tunnel_xmit+0x7e7/0x860 [sit] dev_hard_start_xmit+0x3e3/0x690 dev_queue_xmit+0x38f/0x610 neigh_direct_output+0x11/0x20 ip6_finish_output2+0x90/0x340 ? ac6_proc_exit+0x20/0x20 ip6_finish_output+0x98/0xc0 ip6_output ? __ip6_local_out ip6_local_out ip6_push_pending ? ip6_append_data edp_v6_push_pending_frames udpv6_sendmsg ... Version-Release number of selected component (if applicable): This has been occurring in Fedora 17 for much of the kernel 3.6 series. How reproducible: This is relatively reproducible, and typically occurs after the Internet link goes down or after a reboot if the Internet link is down. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
What specific kernel version is the backtrace above from? Does the oops always indicate the sit module?
As Josh notes, it would be great if you could please provide the specific kernel version. also, the complete backtrace (recorded via serial cable or more completely transcribed) would be a great help. In fact the boot log would be a great help as well. I don't see anything that would expressly panic in. If I hat to guess, I would say that in ipip6_tunnel_xmit, we are falling into this block: if (rt->rt_type != RTN_UNICAST) { ip_rt_put(rt); dev->stats.tx_carrier_errors++; goto tx_error_icmp; } Possibly freeing the route tables dst entry without setting the skbs dst pointer to null, then oopsing when we try to send the dst_unreach message or something along those lines, but the complete backtrace would help confirm that.
(In reply to comment #1) > What specific kernel version is the backtrace above from? 3.6.9-2.fc17.x86_64 > Does the oops always indicate the sit module? Yes it appears so. Here's the full trace if it helps: ip6_link_failure+0xbe/0xd0 ipip6_tunnel_xmit+0x7e7/0x860 [sit] dev_hard_start_xmit+0x3e3/0x690 dev_queue_xmit+0x38f/0x610 neigh_direct_output+0x11/0x20 ip6_finish_output2+0x90/0x340 ? ac6_proc_exit+0x20/0x20 ip6_finish_output+0x98/0xc0 ip6_output ? __ip6_local_out ip6_local_out ip6_push_pending ? ip6_append_data udp_v6_push_pending_frames udpv6_sendmsg inet_sendmsg ? selinux_socket_sendmsg sock_sendmsg ? ipv6_addr_label ? _raw_read_unlock_bh ? ipv6_dev_get_saddr __sys_sendmsg ? sk_setup_caps ? ip6_datagram_connect ? inet_dgram_connect ? sys_connect sys_sendmsg system_call_fastpath Here's another older call trace for a failure of a 3.5.6-1.fc17.x86_64 kernel: ip6_link_failure+0xbe/0xd0 ipip6_tunnel_xmit+0x305/0x850 [sit] ? _raw_spin_unlock_bh+0x15/0x20 dev_hard_start_xmit+0x3c3/0x640 dev_queue_xmit_0x38f/0x610 neigh_direct_output+0x11/0x20 ... Another older call trace for a failure of a 3.5.3-1.fc17.x86_64 kernel: ip6_link_failure+0xbe/0xd0 ipip6_tunnel_xmit+0x7d0/0x850 [sit] dev_hard_start_xmit+0x2f3/0x630 ...
When I say the complete backtrace, what I really need are all the lines after the Panic or Oops line (The Register dump, including the RIP value, so I can see on exactly what line we crashed).
Created attachment 674874 [details] Trace image. Example trace.
thank you, that helps. I'll try work up a patch for you shortly
hmm, sorry this is going to take some additional work. It appears as though the ip6_link_failure code is trying to call dst_release on a garbage pointer, not just a null one, which is very odd for this code. I'm going to have to try setup a reproducer.
hmm, so further complications I'm afraid. I've got a reproducer setup here (specifically 2 qemu guests running 3.3.4-5 kernel, with a point to point 6in4 sit tunnel between them. I ping the ipv6 address of each host from the other and do some iptables magic to return dest-unreachable icmp traffic in response, but it still keeps working. I've also tried ifdown-ing the underlying physical interface, and I still just process the locally generated dest unreachable messages. but no oops. I still need to update my hosts to run the latest f17 kernel and test again, but at the moment, things seem to be working. In the interim, I notice I can't find specific details in the bz here on how you take down your internet link. Can you expand on exactly what you do or what happens to trigger your network failure prior to the oops? It would help me re-create this. Thanks!
I've tested with several kernels up to 3.6.11-1.fc17, and I still can't force the failure to occur, and I've confirmed that I'm calling dst_link_failure from ipip6_tunnel_xmit, so I'm not sure what the discrepancy is here (unless theres some hardware driver problem (as I'm using virt guests), but I dont' see that that would be the case here. more details on your reproducer would be helpful.
Created attachment 693350 [details] Panic trace kernel-3.7.4-204.fc18.x86_64
(In reply to comment #9) > I've tested with several kernels up to 3.6.11-1.fc17, and I still can't > force the failure to occur, and I've confirmed that I'm calling > dst_link_failure from ipip6_tunnel_xmit, so I'm not sure what the > discrepancy is here (unless theres some hardware driver problem (as I'm > using virt guests), but I dont' see that that would be the case here. more > details on your reproducer would be helpful. Just got another panic that I was able to record a trace from, see the attachment. Upgraded the system to Fedora 18. The system has an PPP link to the wider Internet and it appears to be a failure in this link that triggers the panic. The panic does not appear to be immediately triggered by the PPP link failure. Perhaps I could add some debugging code to the kernel to try and gather more information? The trace is different now, but still has a lot of ip6 calls. A limited transcription follows: ... panic+0xef/0x1d0 oops_end+0xda/0xe0 no_context+0x253/0x27e __bad_area_nosemaphore+0x1bf/0x1de bad_area_nosemaphonre_0x13/0x15 __do_page_fault+0x39e/0x4e0 ? icmpv6_push_pending_frames+0xc7/0x100 ? ep_poll_callback+0xfb/0x170 ? _raw_spin_lunlock_bh+0x15/0x20 ? __wake_up_common+0x55/0x90 do_page_fault_0xe/0x10 page_fault+0x28/0x30 ? rt6_check_expired.isra.26+0x28/0x40 ? ip6_pol_route.isra.46+0xe9/0x500 ip6_pol_route_output+0x2a/0x30 fib6_rule_action+0xd7/0x1f0 ...
Debug code or stap scripts might be helpful, but having looked at what you just posted, theres not much to do. The stack trace indicates that you are most likely dereferencing a bad flow6i pointer in ip6_pol_route_output, but that pointer is declared on the stack in udpv6_sendmsg, so that really can't be the case. At this point what would be most helpful I think would be a sosreport from the system in question so that I can try again to recreate your setup here and see if I can trigger the problem myself
Hey, I've posted a patch here: http://marc.info/?l=linux-netdev&m=136130572428908&w=2 Based on some commentary from Eric Dumazet that seems related to this problem. I'm not sure how well received this solution will be, but some testing and confirmation that it solves the problem would certainly help.
Created attachment 700035 [details] [PATCH] ipv6: fix race condition regarding dst->expires and dst->from. Eric Dumazet wrote: | Some strange crashes happen in rt6_check_expired(), with access | to random addresses. | | At first glance, it looks like the RTF_EXPIRES and | stuff added in commit 1716a96101c49186b | (ipv6: fix problem with expired dst cache) | are racy : same dst could be manipulated at the same time | on different cpus. | | At some point, our stack believes rt->dst.from contains a dst pointer, | while its really a jiffie value (as rt->dst.expires shares the same area | of memory) | | rt6_update_expires() should be fixed, or am I missing something ? | | CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060 Because we do not have any locks for dst_entry, we cannot change essential structure in the entry; e.g., we cannot change reference to other entity. To fix this issue, split 'from' and 'expires' field in dst_entry out of union. Once it is 'from' is assigned in the constructor, keep the reference until the very last stage of the life time of the object. Of course, it is unsafe to change 'from', so make rt6_set_from simple just for fresh entries. Reported-by: Eric Dumazet <eric.dumazet> Reported-by: Neil Horman <nhorman> CC: Gao Feng <gaofeng.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji> --- include/net/dst.h | 8 ++------ include/net/ip6_fib.h | 39 ++++++++++++--------------------------- net/core/dst.c | 1 + net/ipv6/route.c | 8 +++----- 4 files changed, 18 insertions(+), 38 deletions(-)
http://koji.fedoraproject.org/koji/taskinfo?taskID=5036210 Heres a brew build with the latest upstream version of the patch, please try it out and report results asap. thank you!
looks like this patch has been accepted upstream. Given the number of eyes on it, I think its a fair bet it fixes the problem, so I'm going to pull it into the kernel.
http://koji.fedoraproject.org/koji/taskinfo?taskID=5040305 Post commit test build, for reference
kernel-3.7.9-205.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/kernel-3.7.9-205.fc18
kernel-3.7.9-205.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.
*** Bug 971102 has been marked as a duplicate of this bug. ***