Description of problem: Kernel crash: "Watchdog detected hard LOCKUP on cpu 1" after about 5-20 mins running on kernels: kernel-3.9.10-100.fc17.i686 kernel-3.9.11-200.fc18.i686 But: kernel-3.8.13-100.fc17.i686 does not crash Booting with kernel options "acpi=off noapic nolapic pnpacpi=off pci=noacpi" fixes the bug, system remains stable with either of the buggy kernels (as well as kernel-3.10.5-201.fc19.i686). It may be stable with a subset of the above options, I will test that as I find time, for now I went all out on the safety factor so I have a stable system. I am now F19 and kernel-3.10.5-201.fc19.i686 and will do all testing around that kernel. I have not seen it crash with F19 yet, but that's because I've had the acpi/apic kludges in place since F18. Also, booting into the system and immediately turning off iptables (empty chains, ACCEPT all) seems to make the bug go away, even without any acpi/apic options. Not the best solution, though! Version-Release number of selected component (if applicable): kernel-3.9.11-200.fc18.i686 How reproducible: always, but I only captured a call trace from one crash (got lucky), all other times screen was black Steps to Reproduce: 1. boot into affected kernels without acpi/apic protection 2. wait 5-20 mins 3. Actual results: crash, call trace below Expected results: no crash Additional info: System is a quirky AMD with ECC that often has hardware compat issues (USB sucks, onboard LAN sucks), but once you resolve them the system runs 100% stable for years. Manufacturer: TYAN Product Name: S2466 TIGER MPX Version: A2 [ 549.123989] ------------[ cut here ]------------ [ 549.123989] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0xac/0xd0() [ 549.123989] Hardware name: Unknown [ 549.123989] Watchdog detected hard LOCKUP on cpu 1 [ 549.123989] Modules linked in: xfrm4_mode_tunnel authenc esp4 xfrm4_mode_transport deflate zlib_deflate twofish_generic twofish_i586 twofish_common camellia_generic serpent_generic glue_helper blowfish_generic blowfish_common cast5_generic cast_common des_generic xcbc rmd160 sha512_generic crypto_null af_key nf_conntrack_netbios_ns nf_conntrack_broadcast xt_TCPMSS xt_tcpmss xt_state xt_LOG xt_limit xt_nat ipt_MASQUERADE xt_REDIRECT iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_mark ip6table_filter iptable_mangle ip6_tables ppdev parport_pc parport r8169 e1000 3c59x mii i2c_amd756 amd_rng nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 ata_generic pata_acpi sata_sil24 pata_amd floppy nouveau mxm_wmi wmi video i2c_algo_bit drm_kms_helper ttm drm i2c_core [ 549.123989] Pid: 0, comm: swapper/1 Not tainted 3.9.10-100.fc17.i686 #1 [ 549.123989] Call Trace: [ 549.123989] [<c043f6f9>] warn_slowpath_common+0x69/0x90 [ 549.123989] [<c04bdccc>] ? watchdog_overflow_callback+0xac/0xd0 [ 549.123989] [<c04bdccc>] ? watchdog_overflow_callback+0xac/0xd0 [ 549.123989] [<c04bdc20>] ? touch_nmi_watchdog+0x70/0x70 [ 549.123989] [<c043f7c3>] warn_slowpath_fmt+0x33/0x40 [ 549.123989] [<c04bdccc>] watchdog_overflow_callback+0xac/0xd0 [ 549.123989] [<c04f0db6>] __perf_event_overflow+0xa6/0x270 [ 549.123989] [<c0410b8f>] ? x86_perf_event_set_period+0x11f/0x230 [ 549.123989] [<c04f1925>] perf_event_overflow+0x15/0x20 [ 549.123989] [<c041120e>] x86_pmu_handle_irq+0x10e/0x150 [ 549.123989] [<c06d3cb2>] ? bit_cursor+0x512/0x550 [ 549.123989] [<c06cf72f>] ? fbcon_clear+0x1bf/0x1f0 [ 549.123989] [<c09856cb>] perf_event_nmi_handler+0x1b/0x20 [ 549.123989] [<c0984ea9>] nmi_handle.isra.0+0x39/0x60 [ 549.123989] [<c09850af>] do_nmi+0x1df/0x3f0 [ 549.123989] [<c09874c1>] ? __atomic_notifier_call_chain+0x21/0x30 [ 549.123989] [<c09845e7>] nmi_stack_correct+0x2f/0x34 [ 549.123989] [<c0429d51>] ? native_apic_mem_read+0x11/0x20 [ 549.123989] [<c0425743>] ? native_apic_wait_icr_idle+0x23/0x30 [ 549.123989] [<c0406ac4>] arch_irq_work_raise+0x34/0x40 [ 549.123989] [<c04e9e75>] irq_work_queue+0xa5/0xb0 [ 549.123989] [<c044149d>] wake_up_klogd+0x2d/0x30 [ 549.123989] [<c04417dd>] console_unlock+0x33d/0x450 [ 549.123989] [<c0441ce5>] vprintk_emit+0x1f5/0x4e0 [ 549.123989] [<c068859a>] ? vsnprintf+0x2ca/0x3d0 [ 549.123989] [<c097b56a>] printk+0x4d/0x4f [ 549.123989] [<f7f910e8>] sb_close+0x28/0x50 [xt_LOG] [ 549.123989] [<f7f91fb6>] ipt_log_packet+0x126/0x180 [xt_LOG] [ 549.123989] [<f7f92a6c>] log_tg+0x7c/0xcc [xt_LOG] [ 549.123989] [<c090ba60>] ipt_do_table+0x320/0x710 [ 549.123989] [<c08ff084>] ? fib_table_lookup+0x2a4/0x350 [ 549.123989] [<c090cb83>] iptable_filter_hook+0x43/0x80 [ 549.123989] [<c08bb51a>] nf_iterate+0x6a/0x90 [ 549.123989] [<c08c4640>] ? ip_frag_mem+0x50/0x50 [ 549.123989] [<c08bb59c>] nf_hook_slow+0x5c/0x100 [ 549.123989] [<c08c4640>] ? ip_frag_mem+0x50/0x50 [ 549.123989] [<c08c4a65>] ip_forward+0x385/0x3b0 [ 549.123989] [<c08c4640>] ? ip_frag_mem+0x50/0x50 [ 549.123989] [<c08c2ae0>] ip_rcv_finish+0x60/0x320 [ 549.123989] [<c08c33ac>] ip_rcv+0x24c/0x370 [ 549.123989] [<c08c2a80>] ? inet_add_protocol+0x50/0x50 [ 549.123989] [<c0893d3b>] __netif_receive_skb_core+0x55b/0x6d0 [ 549.123989] [<c0893ecd>] __netif_receive_skb+0x1d/0x60 [ 549.123989] [<c089407e>] netif_receive_skb+0x2e/0x90 [ 549.123989] [<c089491f>] napi_gro_receive+0x7f/0xb0 [ 549.123989] [<f7ef24b7>] e1000_receive_skb+0x57/0x70 [e1000] [ 549.123989] [<f7ef40f4>] e1000_clean_rx_irq+0x214/0x430 [e1000] [ 549.123989] [<c07781cf>] ? scsi_request_fn+0x9f/0x4b0 [ 549.123989] [<c065e247>] ? __freed_request+0x87/0x90 [ 549.123989] [<f7ef53c6>] e1000_clean+0x1d6/0x850 [e1000] [ 549.123989] [<c0753484>] ? put_device+0x14/0x20 [ 549.123989] [<c077900d>] ? scsi_next_command+0x3d/0x50 [ 549.123989] [<c0779222>] ? scsi_io_completion+0x1b2/0x650 [ 549.123989] [<c0778d26>] ? scsi_device_unbusy+0x76/0xa0 [ 549.123989] [<c048a1fe>] ? ktime_get+0x5e/0x100 [ 549.123989] [<c089467d>] net_rx_action+0x11d/0x1f0 [ 549.123989] [<c0447763>] __do_softirq+0xc3/0x1f0 [ 549.123989] [<c047150d>] ? sched_clock_idle_wakeup_event+0x1d/0x20 [ 549.123989] [<c04479f5>] irq_exit+0x95/0xa0 [ 549.123989] [<c0404c2b>] do_IRQ+0x4b/0xc0 [ 549.123989] [<c098b333>] common_interrupt+0x33/0x38 [ 549.123989] [<c042fad5>] ? native_safe_halt+0x5/0x10 [ 549.123989] [<c040a027>] default_idle+0x37/0xe0 [ 549.123989] [<c040a9c6>] cpu_idle+0xb6/0xe0 [ 549.123989] [<c09757f8>] start_secondary+0x262/0x267 [ 549.123989] ---[ end trace 6b5a14a0a996da62 ]---
I took out the "acpi=off noapic nolapic pnpacpi=off pci=noacpi" options one by one trying to see which was actually required to make the system not crash again. I now have *all* of them out and the system is not crashing! I think a kernel update in the interim has fixed whatever the issue was. The kernel used now that is not crashing is: 3.10.9-200.fc19.i686 Closing this bug.
Darn, the box just crashed again yesterday. I had recently rebooted into kernel-3.11.1-200.fc19.i686 for the first time in weeks (no acpi/apic options) and now it crashed. Either the bug was put back in between 3.10.9-200.fc19.i686 and 3.11.1-200.fc19.i686, or it just took a long time to hit again. In either case, somewhat strange. I am working on getting a copy of this crash's stack track to paste in the bz.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs. Fedora 18 has now been rebased to 3.11.4-101.fc18. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19. If you experience different issues, please open a new bug report for those.
Hi Trevor, have you ever been able to get a stack trace with 3.11.x ? thanks, Michele
Here is a manual transcript of the oops trace: ? leave_mm+0x80/0x50 smp_call_function_single+0xa4/0x130 ? leave_mm+0x50/0x50 smp_fall_function_many+0x1ec/0x220 ? leave_mm+0x50/0x50 native_flush_tlb_others+0x2b/0x30 flush_tlb_page+0x48/0x90 ptep_clear_flush+0x40/0x50 do_wp_page+0x223/0x7d0 handle_pte_fault+0x315/0x8d0 handle_mm_fault+0xb5/0x110 __do_page_fault+0x189/0x4d0 ? task_stopped_code+0x50/0x50 ? __do_page_fault+0x4d0/0x4d0 ?do_page_fault+0xd/0x10 error_code+0x67/0x6c Sorry for any typos. I can provide more info (I have a pic of it on my phone) if required. Since the last report the computer has since died completely, just the mobo or CPUs. No visible bad caps or other obvious problems. Probably just the ol' it's-AMD-and-it-got-too-old death. This whole bug might therefore be nothing more than hardware dying and I would be fine with closing it as such, especially if the traces don't seem related and/or useful. I replaced the motherboard/cpu with a newer model and everything works fine now for a month.
Hi Trevor, given the completely different traces I'd tend to point fingers to HW this time around (ram, cpu,...). So I'd vote for closing it at this point. regards, Michele
Ok. Thanks for letting us know. We appreciate the heads up that it was likely hardware related.